8GB to 70B: A Real Hardware Guide for Local LLMs

wpnews.pro

The idea of running a local LLM (Large Language Model) has always appealed to me, especially concerning data privacy and cost control. However, when I first delved into this, I realized through my own experiences how misleading market claims like "a few GB of RAM is enough" can be. In real-world scenarios, running a 70B parameter model with 8GB of VRAM is only possible with significant optimizations, which come with certain trade-offs.

In this post, I will share my experiences, the problems I encountered, and the solutions I found, from hardware selection to optimization techniques for local LLMs. My goal is to offer a concrete, practical, and "good enough" perspective to anyone interested in this field. As we begin, we must remember that VRAM is the most critical part of this equation.

At the core of running an LLM locally is keeping the model's weights in the GPU's VRAM. As the model size grows, the amount of VRAM it needs naturally increases. For example, a 7 billion parameter (7B) model in 16-bit float (FP16) format requires about 14GB of VRAM, while a 70B parameter model can demand up to 140GB. These values are far beyond the hardware owned by an average user.

While working on AI-powered operations for my side product and a production planning model for a client project, I had the opportunity to experiment with models of different sizes. I clearly saw that there can sometimes be differences between theoretical VRAM requirements on paper and practical usage, especially as the context window grows. A 7B model, with a common quantization like Q4_K_M, can generally run with around 5-6GB of VRAM. However, for a 13B model, this value jumps to 8-10GB, and for a 70B model, it can soar to 40-50GB. This also varies depending on parameters like context window and batch size.

💡 VRAM Monitoring TipsYou can monitor the real-time status of your GPU and VRAM with the

nvidia-smi

command. Usingwatch -n 1 nvidia-smi

to update VRAM usage every second will help you understand how much memory is consumed when a model or performing inference.

While 8GB or 12GB VRAM cards are common in the market, running large models like 70B on these cards requires more than just VRAM; significant optimizations like quantization are essential. Sometimes, even running a 7B model with full performance and long contexts can be challenging with 8GB of VRAM. At this point, not only the VRAM capacity but also the memory bandwidth of the GPU becomes important. Higher bandwidth allows model weights to be read and processed faster, increasing inference speed. If fitting a model into VRAM is an achievement, making it run fast afterward is another challenge.

Quantization is a lifesaver for those of us who want to run LLMs locally. Essentially, it means representing the model's weights using fewer bits. For example, using int8 (8-bit) or int4 (4-bit) instead of float16 (16-bit) significantly reduces model size and thus VRAM requirements. This way, I can run a 70B model, which would normally require 140GB of VRAM, with 4-bit quantization using around 40GB of VRAM.

In my experience, especially with GGUF format models used by projects like llama.cpp

, quantization levels like Q4_K_M generally offer a good balance. These formats keep the model's performance and output quality at acceptable levels while significantly reducing VRAM consumption. While prompt engineering for a client project, I closely observed the differences in output quality between different quantization levels of the same model. Less compressed formats like Q8_0 yielded better output, but Q4_K_M offered a more practical solution in terms of both performance and memory.

ollama run mistral:7b-instruct-v0.2-q4_K_M

./main -m models/mistral-7b-instruct-v0.2.Q4_K_M.gguf -p "Write me a poem about artificial intelligence." -n 512

Of course, quantization comes with a cost: a potential drop in output quality. Especially in very sensitive or creative tasks, lower bit depths can sometimes lead to meaningless or erroneous outputs. Therefore, choosing which quantization level to use is a trade-off depending on your project's sensitivity and available hardware. I generally prefer the most compressed format that provides the lowest acceptable quality, because often speed and memory savings are more critical than a slight drop in quality. This is part of the "good enough" philosophy; instead of always aiming for perfection, it's about finding the most efficient solution that gets the job done.

Limiting local LLM performance to just GPU and VRAM would be a big mistake. Factors like model , CPU processing tokens, and efficient inference engine operation play critical roles in overall performance. Especially with large models, disk speed directly affects how quickly model files are loaded into VRAM. There's a world of difference between a 40-50GB 70B model file from an HDD versus a fast NVMe SSD. In my tests, NVMe drives can reduce model times by up to 70%.

The CPU takes on a significant workload, especially in hybrid CPU/GPU inference engines like llama.cpp

. If part of the model doesn't fit into VRAM or if CPU off is used, the CPU's core count and speed directly impact inference speed. While integrating LLMs into the backend of my anonymous Turkish data platform, I realized how crucial it was to set the correct thread

count for llama.cpp

. Excessive thread usage can degrade performance due to context switching costs, while insufficient thread usage wastes CPU resources.

./main -m models/mistral-7b-instruct-v0.2.Q4_K_M.gguf -p "Make me a 5-item list." -n 128 -t 8

Inference engines themselves offer an additional layer of optimization. llama.cpp

is a popular choice that can efficiently use both CPU and GPU, with broad model support. vLLM

, on the other hand, is designed more for high-performance GPUs, increasing throughput with techniques like batching and continuous batching. Which engine you choose depends on your hardware and use case. I generally prefer llama.cpp

for its simplicity and flexibility, especially in hybrid systems. By setting CPU and memory limits for a service with cgroup

, I ensure that LLM inference doesn't affect other critical services. Last month, when I accidentally caused a service to be OOM-killed by writing sleep 360

, I once again understood the importance of cgroup

limits.

Hardware selection for local LLMs is directly proportional to your budget and the size of the model you want to run. You don't always have to buy the most expensive card; the important thing is to find the most efficient solution that meets your needs. My approach to this has always been "good enough"; that is, getting the best performance with the available resources.

Here are my observations and recommendations for different budget levels:

llama.cpp

's multi-GPU support). At this level, costs significantly increase. While working on a larger and more complex LLM model for production planning in an manufacturing company's ERP, we had to conduct a very detailed return-on-investment analysis for such hardware.

ℹ️ Evaluating the Second-Hand MarketIn my opinion, exploring the second-hand market, especially for mid-range and high-end cards, can be smart. Prices can be much more affordable than new cards, and with the right choice, you can significantly save on your budget. However, always check the seller's history and, if possible, have the opportunity to test the card. I faced a similar budget and hardware selection dilemma when optimizing

PostgreSQL

performance on a VPS; the most expensive solution is not always the best, the important thing is to find what suits the need.

Remember, LLMs are evolving rapidly, and new optimization techniques are constantly emerging. Therefore, instead of making a huge investment initially, choosing what suits your needs and upgrading over time might be a more sensible strategy.

Efficiently running local LLMs on your hardware not only requires selecting the right hardware but also using the right tools and fine-tuning. The two main tools I've used and found most beneficial in this process are ollama

and llama.cpp

.

ollama

offers an incredibly easy interface for running local LLMs. With a single command, you can download and run popular models, and even import your own model. Its API also allows for easy integration into your other applications.

ollama run llama2:7b

ollama run mistral:latest

curl http://localhost:11434/api/generate -d '{
  "model": "mistral:latest",
  "prompt": "Why are local LLMs important?",
  "stream": false
}'

llama.cpp

, on the other hand, offers lower-level control and more optimization options. By compiling its source code, you can make hardware-specific optimizations and run different GGUF models directly. Using make -j

to compile, utilizing all your CPU cores, significantly shortens compilation time.

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
make -j # Automatically adjusts based on your CPU core count

./main -m models/llama-2-7b-chat.Q4_K_M.gguf -p "What are the advantages of using local LLMs?" -n 256 --gpu-layers 30

Performance monitoring and resource management are another point not to be overlooked. Following system logs with journald

and limiting resource consumption of LLM services with cgroup

are critical for overall system stability. For example, if an LLM service unexpectedly consumes too much memory and gets OOM-killed

, you can see this in journald

logs and adjust your cgroup

settings accordingly. Similarly, using auditd

to monitor specific file accesses or system calls can be useful for identifying security and performance issues, especially when an LLM's access to the file system is concerned.

⚠️ Caution with Resource LimitingCare must be taken when setting resource limits with

cgroup

. Too low limits can cause LLM inference to slow down or fail entirely. Finding the right limits requires trial and error and closely monitoringjournald

outputs.

Playing with parameters like batch size

and context window

in engines like llama.cpp

also affects performance. A larger batch size

can increase throughput but also increases VRAM consumption. The longer the context window

, the more past information the model can remember, but this also extends inference time. Adjusting these parameters according to your project's needs and your hardware's capacity is a practical reflection of the "good enough" philosophy.

Stepping into the world of local LLMs brings with it some challenges, especially on the hardware side. However, from my own experiences, I've seen that with the right knowledge and approach, it's possible to efficiently run models like 7B on an 8GB VRAM system, and even push to 70B in some cases. Throughout this process, I personally experienced the critical role of VRAM, the saving effect of quantization, and the importance of other factors like CPU and disk speed.

Remember, you don't always need the latest or most expensive hardware. The important thing is to use your available resources in the best way possible to create a solution that is suitable for your project's needs and cost-effective. With a "good enough" approach, you can get maximum efficiency from your current hardware and make smart upgrades when necessary.

source & further reading

dev.to — original article Your RAG Index Might Be Lying to You: Data Freshness Is the Missing Signal for AI Systems Docker returns to its coding-agent series with an argument shaped like a CI problem: no layer between the agent and the host Prompt injection has two types. You're probably only filtering one.

8GB to 70B: A Real Hardware Guide for Local LLMs

Run your AI side-project on zahid.host