Here’s a number that changed how I think about running large language models: a 70B parameter model in full FP16 precision needs 140GB of VRAM — that’s nearly $20,000 worth of H100 hardware. But quantize that same model to 4-bit INT4 using AWQ or GPTQ, and it drops to just 35–40GB, fitting comfortably on a single RTX 4090 or A100.
Quantization isn’t just a nice-to-have optimization. In 2026, it’s the technique that makes local LLM deployment accessible to developers without data center budgets. Whether you’re running Llama 3.1 on a consumer GPU or deploying Mistral for production inference, understanding quantization formats is essential.
This guide breaks down the three dominant quantization approaches — GGUF, AWQ, and GPTQ — with real benchmarks, VRAM requirements, and practical recommendations for choosing the right format.
What Is LLM Quantization?
Quantization is the process of reducing the precision of a model’s weights from higher-bit formats (like 16-bit floating point) to lower-bit representations (8-bit, 4-bit, or even lower). Think of it as compression that trades a small amount of model quality for massive reductions in memory usage and often faster inference.
Here’s why it matters:
- Memory reduction: 4-bit quantization cuts VRAM requirements by roughly 75%
- Cost savings: Run 70B models on consumer GPUs instead of enterprise hardware
- Speed gains: Lower precision often enables faster token generation
- Accessibility: Deploy larger models on hardware you already own
The trade-off? A small quality degradation — typically 1–5% depending on the method and bit depth. For most practical applications, this loss is imperceptible.
The Three Dominant Formats: GGUF, AWQ, and GPTQ
As of 2026, three quantization formats dominate the local LLM landscape. Each was designed for different use cases, hardware configurations, and runtime environments.
| Format | Best For | Typical Use Case | Runtime Support |
|---|---|---|---|
| GGUF | CPU inference, Apple Silicon, flexibility | Ollama, LM Studio, llama.cpp | llama.cpp, Ollama, Kobold.cpp |
| AWQ | GPU production serving, quality preservation | vLLM, HuggingFace TGI, SGLang | vLLM, AutoAWQ, TensorRT-LLM |
| GPTQ | Maximum compression, pre-quantized models | ExLlama, HuggingFace Transformers | ExLlama, AutoGPTQ, transformers |
GGUF: The Universal Format
GGUF (formerly GGML) is the native format for llama.cpp and its ecosystem including Ollama and LM Studio. It’s designed for maximum flexibility — running on CPUs, Apple Silicon, and GPUs with partial offloading support.
GGUF uses a tiered quantization system called K-quants that applies different precision to different layers based on their importance:
- Q4_K_M: The sweet spot — ~3.8GB for 7B models, +0.0535 perplexity increase
- Q5_K_M: Higher quality — ~4.45GB for 7B models, +0.0142 perplexity increase
- Q6_K: Near-lossless — ~5.15GB for 7B models, +0.0044 perplexity increase
- Q8_0: Effectively lossless — ~6.7GB for 7B models, +0.0004 perplexity increase
The “K” in K-quants stands for “mixed” — different layers get different bit depths. Attention layers (more important) might use 6-bit while feedforward layers use 4-bit. This smart allocation is why GGUF achieves better quality-per-bit than naive 4-bit quantization.
AWQ: Activation-Aware Weight Quantization
AWQ takes a different approach. Instead of treating all weights equally, it observes model activations during a calibration phase to identify “salient” weights — the ones that matter most for output quality. These critical weights are protected with higher precision while less important weights get aggressively quantized.
The result? AWQ typically achieves better quality at 4-bit than other methods, particularly for instruction-tuned models. It’s become the default for production GPU serving in 2026.
Key characteristics:
- Quality: Best-in-class for 4-bit quantization (1–3% degradation vs FP16)
- VRAM savings: ~50% reduction vs FP16
- Speed: Optimized kernels in vLLM and TensorRT-LLM
- Hardware: Requires NVIDIA GPU (no CPU support)
GPTQ: General-purpose Post-Training Quantization
GPTQ was one of the first widely-adopted 4-bit quantization methods. It uses layer-wise quantization with an inverse Hessian matrix to minimize reconstruction error — essentially trying to preserve the model’s output distribution after compression.
While GPTQ has been largely superseded by AWQ for new deployments, it remains relevant because:
- Huge library of pre-quantized models on HuggingFace
- Strong ExLlama support for fast inference
- Mature tooling and community knowledge
Quality is generally comparable to AWQ, though AWQ edges ahead on instruction-following tasks. GPTQ shines when you need a specific model that only has GPTQ weights available.
VRAM Requirements by Model Size and Format
Here’s the practical data you need for hardware planning. These figures include model weights plus inference overhead (~10-20% additional VRAM for context and activations).
| Model | FP16 (BF16) | FP8 | AWQ/GPTQ (INT4) | GGUF Q4_K_M |
|---|---|---|---|---|
| Llama 3.1 8B | ~16 GB | ~8 GB | ~5 GB | ~5 GB |
| Mistral 7B | ~14 GB | ~7 GB | ~4.5 GB | ~4.5 GB |
| Llama 3.1 70B | ~140 GB | ~70 GB | ~35–40 GB | ~38–42 GB |
| Mixtral 8x7B | ~90 GB | ~45 GB | ~25 GB | ~26 GB |
| DeepSeek-R1 32B | ~64 GB | ~32 GB | ~18 GB | ~19 GB |
Key insight: A 70B model that requires $15,000+ of H100 hardware in FP16 can run on a single $1,600 RTX 4090 when quantized to INT4. That’s the power of quantization.
Quality Comparison: Real Benchmarks
Quantization quality is measured by perplexity increase relative to the full-precision baseline. Lower is better — a perplexity increase of 0.01 means the model is 1% less “surprised” by test data compared to the original.
| Format | Perplexity Increase (7B) | Quality Retention | Use When |
|---|---|---|---|
| FP16 (baseline) | 0.0000 | 100% | Training, maximum quality |
| FP8 | ~0.001 | 99.9% | H100/H200, quality-critical |
| Q8_0 (GGUF) | +0.0004 | 99.96% | Near-lossless, larger models |
| Q6_K (GGUF) | +0.0044 | 99.6% | Quality-first, space available |
| Q5_K_M (GGUF) | +0.0142 | 98.6% | Balanced quality/size |
| AWQ (INT4) | ~0.02–0.03 | 97–98% | Production GPU serving |
| GPTQ (INT4) | ~0.03–0.05 | 95–97% | Pre-quantized models |
| Q4_K_M (GGUF) | +0.0535 | 94.7% | Maximum compression |
Source: llama.cpp quantization benchmarks, AWQ paper (Lin et al.), community perplexity evaluations on Mistral 7B and Llama 3.1 8B.
Performance Benchmarks: Tokens Per Second
Quality isn’t the only metric — speed matters too. Here’s how the formats compare on an RTX 4090 running Llama 3.1 8B:
| Format | Runtime | Tokens/Second | Latency (TTFT) |
|---|---|---|---|
| FP16 | vLLM | ~85 tok/s | ~45ms |
| AWQ | vLLM | ~110 tok/s | ~35ms |
| GPTQ | ExLlama | ~95 tok/s | ~40ms |
| Q4_K_M | llama.cpp | ~62 tok/s | ~55ms |
| Q5_K_M | llama.cpp | ~58 tok/s | ~58ms |
Source: SitePoint Ollama vs vLLM benchmarks 2026, community llama.cpp benchmarks.
AWQ achieves the best throughput on NVIDIA GPUs thanks to optimized kernels in vLLM. GGUF through llama.cpp is slower but runs on virtually any hardware — including CPUs and Apple Silicon.
How to Choose: Decision Framework
Use this framework to pick the right quantization format for your specific situation:
Choose GGUF If:
- You’re using Ollama, LM Studio, or llama.cpp
- You need CPU inference or Apple Silicon support
- You want flexible GPU offloading (partial layer loading)
- You’re experimenting with different models locally
Choose AWQ If:
- You’re running production inference on NVIDIA GPUs
- You’re using vLLM, SGLang, or TensorRT-LLM
- Quality is critical (instruction-following, coding tasks)
- You want the best 4-bit quality available
Choose GPTQ If:
- The model you need only has GPTQ weights available
- You’re using ExLlama for fast local inference
- You want mature, well-tested quantization
Finding Quantized Models on HuggingFace
The HuggingFace Hub hosts thousands of pre-quantized models. Here’s how to find them:
- GGUF: Search for “{model-name}-GGUF” or check thebloke’s repositories
- AWQ: Search for “{model-name}-AWQ” — popular quantizers include TheBloke and cognitivecomputations
- GPTQ: Search for “{model-name}-GPTQ” — widely available for most major models
When downloading GGUF models, you’ll see filenames like model-Q4_K_M.gguf. The suffix indicates the quantization level:
- Q4_K_M: Default choice — best balance of size and quality
- Q5_K_M: Step up if you have VRAM to spare and want better quality
- Q6_K: Near-lossless, use for critical applications
- Q8_0: Effectively indistinguishable from FP16
Key Takeaways
- Quantization is essential for running large models on consumer hardware — 4-bit cuts VRAM by 75%
- GGUF is the most flexible format, running on CPUs, GPUs, and Apple Silicon via llama.cpp and Ollama
- AWQ offers the best 4-bit quality for NVIDIA GPU production deployments
- GPTQ remains relevant due to its massive pre-quantized model library
- Quality loss is minimal — 1–5% for most 4-bit methods, often imperceptible in practice
- Q4_K_M (GGUF) and AWQ (INT4) are the safe defaults for most use cases in 2026
FAQ
What’s the difference between GGUF and GGML?
GGML was the original format. GGUF is the successor with better extensibility, metadata support, and feature compatibility. All modern llama.cpp versions use GGUF exclusively.
Can I quantize my own models?
Yes. For GGUF, use the convert.py or llama-quantize tools in llama.cpp. For AWQ, use the AutoAWQ library. For GPTQ, use AutoGPTQ. Each requires a calibration dataset for best results.
Is 4-bit quantization always better than 8-bit?
Not always. 4-bit saves more memory but can degrade quality on complex reasoning tasks. For coding, math, or long-context applications, consider Q5_K_M, Q6_K, or Q8_0 GGUF formats, or FP8 if your hardware supports it.
Does quantization affect context length?
Indirectly. Lower quantization reduces VRAM usage for weights, leaving more memory for the KV cache — potentially enabling longer contexts. However, some quantized formats (especially lower-bit GGUF) may show degraded performance on very long contexts (>8K tokens).
What’s FP8 and should I use it?
FP8 is an 8-bit floating-point format supported on H100 and newer GPUs. It offers nearly FP16 quality with 50% memory savings. Use it if you have H100/H200 hardware and need maximum quality with some compression.
Conclusion
LLM quantization has matured significantly. In 2026, running a 70B parameter model on a consumer GPU isn’t just possible — it’s practical. The key is choosing the right format for your hardware and use case.
For most developers, start with GGUF Q4_K_M for local experimentation via Ollama or LM Studio. When you’re ready for production deployment on NVIDIA GPUs, switch to AWQ through vLLM for the best combination of speed and quality.
The bottom line? Don’t let hardware constraints limit your AI capabilities. With the right quantization approach, you can run state-of-the-art models on hardware you already own.
Ready to build with AI? Sign up for Fungies and start monetizing your AI-powered applications with our developer-friendly payments infrastructure.
References
- VRLA Tech — LLM Quantization Explained: INT4, INT8, FP8, AWQ, and GPTQ in 2026
- Decodes Future — llama.cpp vs Ollama vs vLLM: 2026 Comparison
- SitePoint — Ollama vs vLLM: Performance Benchmark 2026
- llama.cpp — Quantization Methods Discussion
- Local AI Master — GGUF vs GPTQ vs AWQ 2026
- HuggingFace — Selecting a Quantization Method
- Will It Run AI — GGUF Quantization Explained

