LLM, et al: Evaluation

Evaluation

🧭 Purpose of Evaluation

Evaluation is essential for iterating and improving models over time.
Generative model evaluation is difficult because outputs are open-ended.
Human evaluation remains the most reliable method — domain experts review and score outputs.

📊 Building a Good Test Dataset

A quality evaluation set should be:

Accurate → thoroughly checked for correctness.
Generalized → covers varied cases relevant to your use.
Unseen → not part of the training data.

🧮 Quantitative & Comparative Evaluation

ELO ranking: compares models pairwise (like chess tournaments).
Open LLM Benchmark (EleutherAI): combines multiple academic benchmarks, such as:
- ARC – reasoning over grade-school science questions.
- HellaSwag – common sense understanding.
- MMLU – broad academic knowledge.
- TruthfulQA – tests factual accuracy.

These metrics are useful for comparing base models, but not always relevant to specific fine-tuning tasks.

🧠 Error Analysis

Categorize and study common or critical mistakes before and after fine-tuning.
Helps identify which data improvements yield the biggest gains.
Typical error types:
- Typos / misspellings
- Excessive length or verbosity
- Repetitive outputs
- Can be mitigated through better data curation, prompt templates, or stop tokens.

🧩 Evaluation in Practice

Run models on test datasets using batching for efficiency.
For generative tasks, exact string matching is often too strict; other options include:
- Embedding similarity (semantic closeness).
- LLM-based grading (ask another model to assess output quality).
Always run model.eval() during evaluation to disable dropout.

🧾 Benchmarks vs. Real-World Evaluation

Example: fine-tuned model scored 0.31 on ARC vs 0.36 for base, despite performing better on its company-specific task.
→ Benchmarks test general ability, not domain success.
Focus your evaluation on your real use case, not just public leaderboards.

✅ Key Takeaways

Iterative improvement depends on careful evaluation and data refinement.
Use human judgment + task-specific tests over rigid metrics.
Benchmarks (like ARC) are good for general comparisons, but custom evaluation sets best reflect real-world success.

Subscribe to: Posts (Atom)