Overview

Quantization reduces model precision from FP32/FP16 to INT8/INT4, dramatically reducing memory usage and improving inference speed with minimal quality loss.

Quantization Types

FormatBitsMemory ReductionQuality Loss
FP16162xNone
INT884xMinimal
INT448xSmall
GPTQ48xSmall
AWQ48xVery small
GGUF2-8VariableDepends

BitsAndBytes (Easy)

from transformers import AutoModelForCausalLM, BitsAndBytesConfig

# 8-bit quantization
model_8bit = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-v0.1",
    load_in_8bit=True,
    device_map="auto"
)

# 4-bit quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

model_4bit = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-v0.1",
    quantization_config=bnb_config,
    device_map="auto"
)

GPTQ (Pre-quantized)

from transformers import AutoModelForCausalLM

# Load pre-quantized GPTQ model
model = AutoModelForCausalLM.from_pretrained(
    "TheBloke/Mistral-7B-v0.1-GPTQ",
    device_map="auto"
)

AWQ (Activation-aware)

from awq import AutoAWQForCausalLM

model = AutoAWQForCausalLM.from_quantized(
    "TheBloke/Mistral-7B-v0.1-AWQ",
    fuse_layers=True
)

GGUF (llama.cpp)

For CPU inference with Ollama or llama.cpp:

# Download GGUF model
wget https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF/resolve/main/mistral-7b-v0.1.Q4_K_M.gguf

# Run with llama.cpp
./main -m mistral-7b-v0.1.Q4_K_M.gguf -p "Hello, world"

GGUF Quantization Levels

QuantBitsSize (7B)Quality
Q2_K2.52.8 GBPoor
Q4_K_M4.54.1 GBGood
Q5_K_M5.54.8 GBVery good
Q6_K6.55.5 GBExcellent
Q8_087.2 GBNear-FP16

Memory Comparison (7B Model)

PrecisionVRAM Required
FP3228 GB
FP1614 GB
INT87 GB
INT43.5 GB

When to Use What

  • Training: FP16 or BF16
  • Fine-tuning: QLoRA (4-bit base + LoRA)
  • GPU inference: AWQ or GPTQ
  • CPU inference: GGUF Q4_K_M or Q5_K_M
  • Edge devices: GGUF Q2_K or Q3_K

Key Resources