Fine-Tuning

Overview LoRA (Low-Rank Adaptation) enables fine-tuning of large models by training only a small number of additional parameters. QLoRA adds 4-bit quantization to reduce memory further. Requirements pip install transformers peft bitsandbytes accelerate datasets Loading a Model with QLoRA from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig import torch bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_use_double_quant=True ) model = AutoModelForCausalLM.from_pretrained( "mistralai/Mistral-7B-v0.1", quantization_config=bnb_config, device_map="auto" ) tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1") Configuring LoRA from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training model = prepare_model_for_kbit_training(model) lora_config = LoraConfig( r=16, # Rank lora_alpha=32, # Scaling factor target_modules=["q_proj", "k_proj", "v_proj", "o_proj"], lora_dropout=0.05, bias="none", task_type="CAUSAL_LM" ) model = get_peft_model(model, lora_config) model.print_trainable_parameters() # trainable params: 4,194,304 || all params: 3,752,071,168 || trainable%: 0.11% Training from transformers import TrainingArguments, Trainer training_args = TrainingArguments( output_dir="./lora-mistral", num_train_epochs=3, per_device_train_batch_size=4, gradient_accumulation_steps=4, learning_rate=2e-4, fp16=True, logging_steps=10, save_strategy="epoch" ) trainer = Trainer( model=model, args=training_args, train_dataset=train_dataset, tokenizer=tokenizer ) trainer.train() Merging and Saving # Save LoRA adapter model.save_pretrained("./lora-adapter") # Merge with base model for inference merged_model = model.merge_and_unload() merged_model.save_pretrained("./merged-model") Memory Requirements Model Size Full Fine-tune LoRA QLoRA 7B 56 GB 16 GB 6 GB 13B 104 GB 32 GB 10 GB 70B 560 GB 160 GB 48 GB Key Resources PEFT Library QLoRA Paper

RAG vs Fine-Tuning: Which Strategy for Custom LLMs? TL;DR Don’t choose. Use RAG for knowledge (injecting facts) and Fine-Tuning for behavior (style, format, tone). Most production systems need RAG first. Specifications Comparison Feature RAG (Retrieval-Augmented Generation) Fine-Tuning Primary Use Adding knowledge Changing behavior Cost Low (Vector DB) High (GPU Training) Updates Real-time Requires retraining Hallucinations Reduced (Grounded) Possible RAG (Retrieval-Augmented Generation) Pros ✅ Up-to-date information ✅ Traceable sources ✅ Cheaper to implement Cons ❌ Context window limits ❌ Retrieval latency ❌ Complex architecture Fine-Tuning Pros ✅ Perfect style matching ✅ Lower latency (no retrieval) ✅ Learn new tasks Cons ❌ Static knowledge ❌ Catastrophic forgetting ❌ Expensive compute Verdict Don’t choose. Use RAG for knowledge (injecting facts) and Fine-Tuning for behavior (style, format, tone). Most production systems need RAG first. ...