Model Quantization Guide

Overview

Quantization reduces model precision from FP32/FP16 to INT8/INT4, dramatically reducing memory usage and improving inference speed with minimal quality loss.

Quantization Types

Format	Bits	Memory Reduction	Quality Loss
FP16	16	2x	None
INT8	8	4x	Minimal
INT4	4	8x	Small
GPTQ	4	8x	Small
AWQ	4	8x	Very small
GGUF	2-8	Variable	Depends

BitsAndBytes (Easy)

from transformers import AutoModelForCausalLM, BitsAndBytesConfig

# 8-bit quantization
model_8bit = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-v0.1",
    load_in_8bit=True,
    device_map="auto"
)

# 4-bit quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

model_4bit = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-v0.1",
    quantization_config=bnb_config,
    device_map="auto"
)

GPTQ (Pre-quantized)

from transformers import AutoModelForCausalLM

# Load pre-quantized GPTQ model
model = AutoModelForCausalLM.from_pretrained(
    "TheBloke/Mistral-7B-v0.1-GPTQ",
    device_map="auto"
)

AWQ (Activation-aware)

from awq import AutoAWQForCausalLM

model = AutoAWQForCausalLM.from_quantized(
    "TheBloke/Mistral-7B-v0.1-AWQ",
    fuse_layers=True
)

GGUF (llama.cpp)

For CPU inference with Ollama or llama.cpp:

# Download GGUF model
wget https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF/resolve/main/mistral-7b-v0.1.Q4_K_M.gguf

# Run with llama.cpp
./main -m mistral-7b-v0.1.Q4_K_M.gguf -p "Hello, world"

GGUF Quantization Levels

Quant	Bits	Size (7B)	Quality
Q2_K	2.5	2.8 GB	Poor
Q4_K_M	4.5	4.1 GB	Good
Q5_K_M	5.5	4.8 GB	Very good
Q6_K	6.5	5.5 GB	Excellent
Q8_0	8	7.2 GB	Near-FP16

Memory Comparison (7B Model)

Precision	VRAM Required
FP32	28 GB
FP16	14 GB
INT8	7 GB
INT4	3.5 GB

When to Use What

Training: FP16 or BF16
Fine-tuning: QLoRA (4-bit base + LoRA)
GPU inference: AWQ or GPTQ
CPU inference: GGUF Q4_K_M or Q5_K_M
Edge devices: GGUF Q2_K or Q3_K