Overview

Data quality determines model quality. This guide covers labeling strategies, tools, and quality control for building reliable training datasets.

Labeling Tools

ToolTypeBest For
Label StudioOpen sourceGeneral purpose
CVATOpen sourceComputer vision
ProdigyCommercialNLP, active learning
Scale AIManagedLarge scale
Amazon SageMaker GTManagedAWS integration

Label Studio Setup

pip install label-studio
label-studio start

Access at http://localhost:8080

Labeling Guidelines

1. Create Clear Instructions

## Task: Sentiment Classification

Label each review as:
- **Positive**: Expresses satisfaction, recommendation, or praise
- **Negative**: Expresses dissatisfaction, complaints, or criticism
- **Neutral**: Factual statements without emotional content

### Examples:
- "Great product, highly recommend!" → Positive
- "Arrived broken, waste of money" → Negative
- "The package weighs 2kg" → Neutral

2. Handle Edge Cases

Document ambiguous cases upfront:

  • Mixed sentiment: Label as the dominant sentiment
  • Sarcasm: Label based on true intent
  • Questions: Label as Neutral unless clearly rhetorical

Quality Control

Inter-Annotator Agreement

from sklearn.metrics import cohen_kappa_score

# Compare two annotators
kappa = cohen_kappa_score(annotator1_labels, annotator2_labels)
# > 0.8 = excellent, 0.6-0.8 = good, < 0.6 = needs review

Gold Standard Sets

  • Create 50-100 “gold” examples with known correct labels
  • Randomly insert into annotation tasks
  • Flag annotators with < 90% accuracy on gold set

Review Workflow

Annotator → Review (10% sample) → Adjudication → Final Dataset

Active Learning

Prioritize labeling uncertain examples:

from modAL.models import ActiveLearner

learner = ActiveLearner(
    estimator=RandomForestClassifier(),
    X_training=X_initial,
    y_training=y_initial
)

# Query most uncertain samples
query_idx, query_sample = learner.query(X_unlabeled, n_instances=10)

Cost Estimation

Task TypeTime/ItemCost/1000
Binary classification5 sec$15-25
Multi-class (5 classes)10 sec$30-50
NER tagging30 sec$100-150
Bounding boxes20 sec$60-100
Segmentation2 min$300-500

Key Resources