Overview
Quantization reduces model precision from FP32/FP16 to INT8/INT4, dramatically reducing memory usage and improving inference speed with minimal quality loss.
Quantization Types
| Format | Bits | Memory Reduction | Quality Loss |
|---|
| FP16 | 16 | 2x | None |
| INT8 | 8 | 4x | Minimal |
| INT4 | 4 | 8x | Small |
| GPTQ | 4 | 8x | Small |
| AWQ | 4 | 8x | Very small |
| GGUF | 2-8 | Variable | Depends |
BitsAndBytes (Easy)
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
# 8-bit quantization
model_8bit = AutoModelForCausalLM.from_pretrained(
"mistralai/Mistral-7B-v0.1",
load_in_8bit=True,
device_map="auto"
)
# 4-bit quantization
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)
model_4bit = AutoModelForCausalLM.from_pretrained(
"mistralai/Mistral-7B-v0.1",
quantization_config=bnb_config,
device_map="auto"
)
GPTQ (Pre-quantized)
from transformers import AutoModelForCausalLM
# Load pre-quantized GPTQ model
model = AutoModelForCausalLM.from_pretrained(
"TheBloke/Mistral-7B-v0.1-GPTQ",
device_map="auto"
)
AWQ (Activation-aware)
from awq import AutoAWQForCausalLM
model = AutoAWQForCausalLM.from_quantized(
"TheBloke/Mistral-7B-v0.1-AWQ",
fuse_layers=True
)
GGUF (llama.cpp)
For CPU inference with Ollama or llama.cpp:
# Download GGUF model
wget https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF/resolve/main/mistral-7b-v0.1.Q4_K_M.gguf
# Run with llama.cpp
./main -m mistral-7b-v0.1.Q4_K_M.gguf -p "Hello, world"
GGUF Quantization Levels
| Quant | Bits | Size (7B) | Quality |
|---|
| Q2_K | 2.5 | 2.8 GB | Poor |
| Q4_K_M | 4.5 | 4.1 GB | Good |
| Q5_K_M | 5.5 | 4.8 GB | Very good |
| Q6_K | 6.5 | 5.5 GB | Excellent |
| Q8_0 | 8 | 7.2 GB | Near-FP16 |
Memory Comparison (7B Model)
| Precision | VRAM Required |
|---|
| FP32 | 28 GB |
| FP16 | 14 GB |
| INT8 | 7 GB |
| INT4 | 3.5 GB |
When to Use What
- Training: FP16 or BF16
- Fine-tuning: QLoRA (4-bit base + LoRA)
- GPU inference: AWQ or GPTQ
- CPU inference: GGUF Q4_K_M or Q5_K_M
- Edge devices: GGUF Q2_K or Q3_K
Key Resources
💬 Comments
Comments are coming soon! We're setting up our discussion system.
In the meantime, feel free to contact us with your feedback.