Overview
Data quality determines model quality. This guide covers labeling strategies, tools, and quality control for building reliable training datasets.
Labeling Tools
| Tool | Type | Best For |
|---|---|---|
| Label Studio | Open source | General purpose |
| CVAT | Open source | Computer vision |
| Prodigy | Commercial | NLP, active learning |
| Scale AI | Managed | Large scale |
| Amazon SageMaker GT | Managed | AWS integration |
Label Studio Setup
pip install label-studio
label-studio start
Access at http://localhost:8080
Labeling Guidelines
1. Create Clear Instructions
## Task: Sentiment Classification
Label each review as:
- **Positive**: Expresses satisfaction, recommendation, or praise
- **Negative**: Expresses dissatisfaction, complaints, or criticism
- **Neutral**: Factual statements without emotional content
### Examples:
- "Great product, highly recommend!" → Positive
- "Arrived broken, waste of money" → Negative
- "The package weighs 2kg" → Neutral
2. Handle Edge Cases
Document ambiguous cases upfront:
- Mixed sentiment: Label as the dominant sentiment
- Sarcasm: Label based on true intent
- Questions: Label as Neutral unless clearly rhetorical
Quality Control
Inter-Annotator Agreement
from sklearn.metrics import cohen_kappa_score
# Compare two annotators
kappa = cohen_kappa_score(annotator1_labels, annotator2_labels)
# > 0.8 = excellent, 0.6-0.8 = good, < 0.6 = needs review
Gold Standard Sets
- Create 50-100 “gold” examples with known correct labels
- Randomly insert into annotation tasks
- Flag annotators with < 90% accuracy on gold set
Review Workflow
Annotator → Review (10% sample) → Adjudication → Final Dataset
Active Learning
Prioritize labeling uncertain examples:
from modAL.models import ActiveLearner
learner = ActiveLearner(
estimator=RandomForestClassifier(),
X_training=X_initial,
y_training=y_initial
)
# Query most uncertain samples
query_idx, query_sample = learner.query(X_unlabeled, n_instances=10)
Cost Estimation
| Task Type | Time/Item | Cost/1000 |
|---|---|---|
| Binary classification | 5 sec | $15-25 |
| Multi-class (5 classes) | 10 sec | $30-50 |
| NER tagging | 30 sec | $100-150 |
| Bounding boxes | 20 sec | $60-100 |
| Segmentation | 2 min | $300-500 |
💬 Comments
Comments are coming soon! We're setting up our discussion system.
In the meantime, feel free to contact us with your feedback.