Getting started: practical fine-tuning steps
-
Define the task → collect/input–output pairs; if data is scarce, synthesize with prompts/templates.
-
Start small → try a 400M–1B parameter model to gauge baseline performance.
-
Vary data size to see returns from more examples.
-
Evaluate → inspect what works/doesn’t; iterate by collecting targeted new data.
Scaling difficulty
-
Reading vs. writing tasks: writing (chat, emails, code) is harder—more output tokens → usually needs larger models.
-
Composite/agentic tasks (multiple abilities in one go) further increase difficulty → consider bigger models.
Compute & hardware realities
-
Training needs far more memory than inference (gradients + optimizer states).
-
Example: a single V100 (16 GB) can infer a ~7B model but typically only train around ~1B params without tricks.
-
To fit larger models or speed up: use more GPUs or efficiency methods.
PEFT (Parameter-Efficient Fine-Tuning) & LoRA
-
PEFT adapts big models by training small add-on weights while freezing base weights.
-
LoRA (Low-Rank Adaptation):
-
Replaces full weight updates with low-rank matrices in selected layers.
-
Reported effects (example cited in lesson): up to 10,000× fewer trainable params and ~3× less GPU memory vs full fine-tune.
-
Slight accuracy drop vs full FT, but same inference latency (adapters can be merged into base weights).
-
Great for multi-tenant setups: train separate LoRA adapters per customer/task and swap/merge at inference.
-
Bottom line: Start with a smaller model and simple tasks, iterate with data and evaluation, then scale model size and complexity. Use PEFT/LoRA to train efficiently when hardware is tight or tasks multiply.