Overview

FastAPI is the go-to framework for serving ML models. It’s fast, has automatic documentation, and handles async requests efficiently.

Basic Setup

pip install fastapi uvicorn pydantic
from fastapi import FastAPI
from pydantic import BaseModel
import pickle

app = FastAPI(title="ML Model API")

# Load model at startup
with open("model.pkl", "rb") as f:
    model = pickle.load(f)

class PredictionInput(BaseModel):
    features: list[float]

class PredictionOutput(BaseModel):
    prediction: float
    confidence: float

@app.post("/predict", response_model=PredictionOutput)
async def predict(input: PredictionInput):
    pred = model.predict([input.features])[0]
    proba = model.predict_proba([input.features])[0].max()
    return PredictionOutput(prediction=pred, confidence=proba)

Running the Server

uvicorn main:app --host 0.0.0.0 --port 8000 --reload

Health Check Endpoint

@app.get("/health")
async def health():
    return {"status": "healthy", "model_loaded": model is not None}

Batch Predictions

class BatchInput(BaseModel):
    instances: list[list[float]]

@app.post("/predict/batch")
async def predict_batch(input: BatchInput):
    predictions = model.predict(input.instances)
    return {"predictions": predictions.tolist()}

Docker Deployment

FROM python:3.11-slim

WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY . .
EXPOSE 8000

CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
docker build -t ml-api .
docker run -p 8000:8000 ml-api

Production Tips

  1. Use Gunicorn: gunicorn main:app -w 4 -k uvicorn.workers.UvicornWorker
  2. Add rate limiting: Use slowapi or nginx
  3. Monitor latency: Add Prometheus metrics
  4. Cache predictions: Use Redis for repeated inputs
  5. Validate inputs: Pydantic handles this automatically

Key Resources