🧭 Purpose of Evaluation
-
Evaluation is essential for iterating and improving models over time.
-
Generative model evaluation is difficult because outputs are open-ended.
-
Human evaluation remains the most reliable method — domain experts review and score outputs.
📊 Building a Good Test Dataset
A quality evaluation set should be:
-
Accurate → thoroughly checked for correctness.
-
Generalized → covers varied cases relevant to your use.
-
Unseen → not part of the training data.
🧮 Quantitative & Comparative Evaluation
-
ELO ranking: compares models pairwise (like chess tournaments).
-
Open LLM Benchmark (EleutherAI): combines multiple academic benchmarks, such as:
-
ARC – reasoning over grade-school science questions.
-
HellaSwag – common sense understanding.
-
MMLU – broad academic knowledge.
-
TruthfulQA – tests factual accuracy.
-
These metrics are useful for comparing base models, but not always relevant to specific fine-tuning tasks.
🧠 Error Analysis
-
Categorize and study common or critical mistakes before and after fine-tuning.
-
Helps identify which data improvements yield the biggest gains.
-
Typical error types:
-
Typos / misspellings
-
Excessive length or verbosity
-
Repetitive outputs
-
Can be mitigated through better data curation, prompt templates, or stop tokens.
-
🧩 Evaluation in Practice
-
Run models on test datasets using batching for efficiency.
-
For generative tasks, exact string matching is often too strict; other options include:
-
Embedding similarity (semantic closeness).
-
LLM-based grading (ask another model to assess output quality).
-
-
Always run
model.eval()during evaluation to disable dropout.
🧾 Benchmarks vs. Real-World Evaluation
-
Example: fine-tuned model scored 0.31 on ARC vs 0.36 for base, despite performing better on its company-specific task.
-
→ Benchmarks test general ability, not domain success.
-
Focus your evaluation on your real use case, not just public leaderboards.
✅ Key Takeaways
-
Iterative improvement depends on careful evaluation and data refinement.
-
Use human judgment + task-specific tests over rigid metrics.
-
Benchmarks (like ARC) are good for general comparisons, but custom evaluation sets best reflect real-world success.