Llm | BlogIA

Fine-tuning LLMs with LoRA and QLoRA

Overview LoRA (Low-Rank Adaptation) enables fine-tuning of large models by training only a small number of additional parameters. QLoRA adds 4-bit quantization to reduce memory further. Requirements pip install transformers peft bitsandbytes accelerate datasets Loading a Model with QLoRA from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig import torch bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_use_double_quant=True ) model = AutoModelForCausalLM.from_pretrained( "mistralai/Mistral-7B-v0.1", quantization_config=bnb_config, device_map="auto" ) tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1") Configuring LoRA from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training model = prepare_model_for_kbit_training(model) lora_config = LoraConfig( r=16, # Rank lora_alpha=32, # Scaling factor target_modules=["q_proj", "k_proj", "v_proj", "o_proj"], lora_dropout=0.05, bias="none", task_type="CAUSAL_LM" ) model = get_peft_model(model, lora_config) model.print_trainable_parameters() # trainable params: 4,194,304 || all params: 3,752,071,168 || trainable%: 0.11% Training from transformers import TrainingArguments, Trainer training_args = TrainingArguments( output_dir="./lora-mistral", num_train_epochs=3, per_device_train_batch_size=4, gradient_accumulation_steps=4, learning_rate=2e-4, fp16=True, logging_steps=10, save_strategy="epoch" ) trainer = Trainer( model=model, args=training_args, train_dataset=train_dataset, tokenizer=tokenizer ) trainer.train() Merging and Saving # Save LoRA adapter model.save_pretrained("./lora-adapter") # Merge with base model for inference merged_model = model.merge_and_unload() merged_model.save_pretrained("./merged-model") Memory Requirements Model Size Full Fine-tune LoRA QLoRA 7B 56 GB 16 GB 6 GB 13B 104 GB 32 GB 10 GB 70B 560 GB 160 GB 48 GB Key Resources PEFT Library QLoRA Paper

Running LLMs Locally with Ollama

Overview Ollama makes it easy to run large language models locally. No cloud API needed, full privacy, and works on Mac, Linux, and Windows. Installation # macOS / Linux curl -fsSL https://ollama.com/install.sh | sh # Or download from https://ollama.com/download Running Your First Model # Pull and run Llama 3.2 ollama run llama3.2 # Pull and run Mistral ollama run mistral # Pull and run a coding model ollama run codellama Available Models Model Size Use Case llama3.2 3B/8B General purpose mistral 7B Fast, high quality codellama 7B/13B Code generation phi3 3.8B Efficient, Microsoft gemma2 9B Google’s open model qwen2.5 7B Multilingual API Usage import requests response = requests.post('http://localhost:11434/api/generate', json={ 'model': 'mistral', 'prompt': 'Explain machine learning in one paragraph', 'stream': False }) print(response.json()['response']) Using with LangChain from langchain_community.llms import Ollama llm = Ollama(model="mistral") response = llm.invoke("What is the capital of France?") print(response) Custom Models (Modelfile) # Modelfile FROM mistral SYSTEM You are a helpful coding assistant specialized in Python. PARAMETER temperature 0.7 PARAMETER num_ctx 4096 ollama create my-coder -f Modelfile ollama run my-coder Hardware Requirements Model Size RAM Required GPU VRAM 3B 4 GB 4 GB 7B 8 GB 8 GB 13B 16 GB 16 GB 70B 64 GB 48 GB Key Resources Ollama Website Model Library GitHub

Building RAG Applications with LangChain

Overview RAG (Retrieval-Augmented Generation) combines document retrieval with LLM generation. Instead of relying solely on the model’s training data, RAG fetches relevant context from your documents. Architecture Query → Embed → Vector Search → Retrieve Docs → LLM + Context → Response Installation pip install langchain langchain-community chromadb sentence-transformers Loading Documents from langchain_community.document_loaders import PyPDFLoader, DirectoryLoader # Single PDF loader = PyPDFLoader("document.pdf") docs = loader.load() # Directory of files loader = DirectoryLoader("./docs", glob="/*.pdf") docs = loader.load() Splitting Documents from langchain.text_splitter import RecursiveCharacterTextSplitter splitter = RecursiveCharacterTextSplitter( chunk_size=1000, chunk_overlap=200, separators=["\n\n", "\n", " ", ""] ) chunks = splitter.split_documents(docs) Creating Embeddings from langchain_community.embeddings import HuggingFaceEmbeddings embeddings = HuggingFaceEmbeddings( model_name="sentence-transformers/all-MiniLM-L6-v2" ) Vector Store from langchain_community.vectorstores import Chroma vectorstore = Chroma.from_documents( documents=chunks, embedding=embeddings, persist_directory="./chroma_db" ) # Search results = vectorstore.similarity_search("What is machine learning?", k=3) RAG Chain from langchain_community.llms import Ollama from langchain.chains import RetrievalQA llm = Ollama(model="mistral") retriever = vectorstore.as_retriever(search_kwargs={"k": 3}) qa_chain = RetrievalQA.from_chain_type( llm=llm, chain_type="stuff", retriever=retriever, return_source_documents=True ) response = qa_chain.invoke({"query": "Summarize the main findings"}) print(response["result"]) Production Tips Chunk size: 500-1000 tokens works well for most use cases Overlap: 10-20% overlap prevents context loss at boundaries Reranking: Use a cross-encoder to rerank retrieved documents Hybrid search: Combine vector search with keyword search (BM25) Key Resources LangChain Documentation ChromaDB RAG Paper

Prompt Engineering Techniques

Overview Prompt engineering is the art of crafting inputs that get the best outputs from LLMs. Small changes in prompts can dramatically improve results. Basic Techniques Zero-Shot Prompting Classify the sentiment of this review as positive, negative, or neutral: "The product arrived late but works great." Sentiment: Few-Shot Prompting Classify the sentiment: Review: "Amazing quality, fast shipping!" → Positive Review: "Broken on arrival, terrible." → Negative Review: "It's okay, nothing special." → Neutral Review: "The product arrived late but works great." → Chain-of-Thought (CoT) Q: A store has 23 apples. They sell 8 and receive 15 more. How many apples? Let me think step by step: 1. Start with 23 apples 2. Sell 8: 23 - 8 = 15 apples 3. Receive 15: 15 + 15 = 30 apples Answer: 30 apples Role Prompting You are an expert Python developer with 15 years of experience. Review this code for bugs, performance issues, and best practices: ```python def process(data): result = [] for i in range(len(data)): result.append(data[i] * 2) return result ## Structured Output Extract information from this text and return as JSON: ...

LLM Evaluation Metrics

Overview Evaluating LLMs is challenging because quality is subjective. This guide covers automated metrics, benchmarks, and human evaluation approaches. Automated Metrics Perplexity Measures how well a model predicts text. Lower is better. import torch from transformers import GPT2LMHeadModel, GPT2Tokenizer model = GPT2LMHeadModel.from_pretrained("gpt2") tokenizer = GPT2Tokenizer.from_pretrained("gpt2") def calculate_perplexity(text): inputs = tokenizer(text, return_tensors="pt") with torch.no_grad(): outputs = model(**inputs, labels=inputs["input_ids"]) return torch.exp(outputs.loss).item() perplexity = calculate_perplexity("The quick brown fox jumps over the lazy dog.") BLEU Score Measures n-gram overlap with reference text. Used for translation. ...

Leveraging GPTZero to Detect Subtle Hallucinations in AI Research 🧠

Practical tutorial: Focus on the capability of GPTZero to detect subtle hallucinations in cutting-edge AI research.

Dockerize Large Language Models for Any Language without Prebuilding Containers 🚀

Practical tutorial: Step-by-step practical guide: Show HN: Run LLMs in Docker for any language without prebuilding conta

Upgrade Your Claude Code Workflow: Ultrathink is Deprecated & How to Enable 2x Thinking Tokens 🚀

Practical tutorial: Step-by-step practical guide: Ultrathink is deprecated & How to enable 2x thinking tokens in Claude

🎡 Analyzing Breakthroughs in AI by Integrating Claude Code into RollerCoaster Tycoon

Practical tutorial: Breaking news analysis with implications: We put Claude Code in Rollercoaster Tycoon

Analyzing ChatGPT Go's Impact with Real Data and Advanced AI Techniques 🚀

Practical tutorial: Breaking news analysis with implications: Introducing ChatGPT Go, now available worldwide

Evaluating Large Language Models for Truthfulness Using Neighborhood Consistency 📊

Practical tutorial: Step-by-step practical guide: Paper: Illusions of Confidence? Diagnosing LLM Truthfulness via Neighb

🌐 Crafting Engaging Multi-Party Conversations with LLMberjack: A Practical Guide 📝

Practical tutorial: Step-by-step practical guide: Paper: LLMberjack: Guided Trimming of Debate Trees for Multi-Party Con

Automate CVE Analysis with LLMs and RAG 🚀

Practical tutorial: Automate CVE analysis with LLMs and RAG

Building Claude Code-Level Performance on a Budget 🚀

Practical tutorial: Honest hands-on review with pros/cons: What hardware would it take to get Claude Code-level performa

Fine-Tuning Mistral Large 2 on Your Data with Unsloth 🚀

Practical tutorial: Fine-tune Mistral Large 2 on your data with Unsloth

Honest Hands-on Review of Claude Code CLI Issues as of January 8, 2026 🚧

Practical tutorial: Honest hands-on review with pros/cons: Claude Code CLI was broken

Unlocking Code Generation Magic with GPT-5.2 Codex-Max 🚀

Practical tutorial: Using GPT-5.2 Codex-Max for code generation

🚀 Code Generation with Latest Coding LLMs: Streamline Your Workflow

Practical tutorial: Generate code with the latest coding LLMs

Building a Knowledge Graph from Documents with Large Language Models (LLMs) 🤖📚

Practical tutorial: Build a knowledge graph from documents with LLMs

Deploy Ollama and Run Llama 4 or Qwen 3 Locally 🚀

Practical tutorial: Deploy Ollama and run Llama 4 or Qwen 3 locally in 5 minutes