Evaluation

 

🧭 Purpose of Evaluation

  • Evaluation is essential for iterating and improving models over time.

  • Generative model evaluation is difficult because outputs are open-ended.

  • Human evaluation remains the most reliable method — domain experts review and score outputs.


📊 Building a Good Test Dataset

A quality evaluation set should be:

  • Accurate → thoroughly checked for correctness.

  • Generalized → covers varied cases relevant to your use.

  • Unseen → not part of the training data.


🧮 Quantitative & Comparative Evaluation

  • ELO ranking: compares models pairwise (like chess tournaments).

  • Open LLM Benchmark (EleutherAI): combines multiple academic benchmarks, such as:

    • ARC – reasoning over grade-school science questions.

    • HellaSwag – common sense understanding.

    • MMLU – broad academic knowledge.

    • TruthfulQA – tests factual accuracy.

These metrics are useful for comparing base models, but not always relevant to specific fine-tuning tasks.


🧠 Error Analysis

  • Categorize and study common or critical mistakes before and after fine-tuning.

  • Helps identify which data improvements yield the biggest gains.

  • Typical error types:

    • Typos / misspellings

    • Excessive length or verbosity

    • Repetitive outputs

    • Can be mitigated through better data curation, prompt templates, or stop tokens.


🧩 Evaluation in Practice

  • Run models on test datasets using batching for efficiency.

  • For generative tasks, exact string matching is often too strict; other options include:

    • Embedding similarity (semantic closeness).

    • LLM-based grading (ask another model to assess output quality).

  • Always run model.eval() during evaluation to disable dropout.


🧾 Benchmarks vs. Real-World Evaluation

  • Example: fine-tuned model scored 0.31 on ARC vs 0.36 for base, despite performing better on its company-specific task.

  • → Benchmarks test general ability, not domain success.

  • Focus your evaluation on your real use case, not just public leaderboards.


✅ Key Takeaways

  • Iterative improvement depends on careful evaluation and data refinement.

  • Use human judgment + task-specific tests over rigid metrics.

  • Benchmarks (like ARC) are good for general comparisons, but custom evaluation sets best reflect real-world success.