Data Labeling Best Practices

Overview

Data quality determines model quality. This guide covers labeling strategies, tools, and quality control for building reliable training datasets.

Labeling Tools

Tool	Type	Best For
Label Studio	Open source	General purpose
CVAT	Open source	Computer vision
Prodigy	Commercial	NLP, active learning
Scale AI	Managed	Large scale
Amazon SageMaker GT	Managed	AWS integration

Label Studio Setup

pip install label-studio
label-studio start

Access at http://localhost:8080

Labeling Guidelines

1. Create Clear Instructions

## Task: Sentiment Classification

Label each review as:
- **Positive**: Expresses satisfaction, recommendation, or praise
- **Negative**: Expresses dissatisfaction, complaints, or criticism
- **Neutral**: Factual statements without emotional content

### Examples:
- "Great product, highly recommend!" → Positive
- "Arrived broken, waste of money" → Negative
- "The package weighs 2kg" → Neutral

2. Handle Edge Cases

Document ambiguous cases upfront:

Mixed sentiment: Label as the dominant sentiment
Sarcasm: Label based on true intent
Questions: Label as Neutral unless clearly rhetorical

Quality Control

Inter-Annotator Agreement

from sklearn.metrics import cohen_kappa_score

# Compare two annotators
kappa = cohen_kappa_score(annotator1_labels, annotator2_labels)
# > 0.8 = excellent, 0.6-0.8 = good, < 0.6 = needs review

Gold Standard Sets

Create 50-100 “gold” examples with known correct labels
Randomly insert into annotation tasks
Flag annotators with < 90% accuracy on gold set

Review Workflow

Annotator → Review (10% sample) → Adjudication → Final Dataset

Active Learning

Prioritize labeling uncertain examples:

from modAL.models import ActiveLearner

learner = ActiveLearner(
    estimator=RandomForestClassifier(),
    X_training=X_initial,
    y_training=y_initial
)

# Query most uncertain samples
query_idx, query_sample = learner.query(X_unlabeled, n_instances=10)

Cost Estimation

Task Type	Time/Item	Cost/1000
Binary classification	5 sec	$15-25
Multi-class (5 classes)	10 sec	$30-50
NER tagging	30 sec	$100-150
Bounding boxes	20 sec	$60-100
Segmentation	2 min	$300-500