The short version
- Fine-tuning continues training a pretrained LLM on a domain dataset.
- Modern fine-tuning uses LoRA/QLoRA for parameter-efficient adaptation.
- RAG and prompt engineering solve most problems; fine-tuning is the specialist's tool.
- Data preparation is where most of the engineering effort actually goes.
The longer explanation
What fine-tuning does
A pretrained LLM has been trained on a broad corpus and developed broad capabilities. Fine-tuning continues the training on a narrower, curated dataset so the model develops capabilities specific to a target domain, task, or style. The base model capabilities do not disappear; they are specialized.
The three categories of fine-tuning that matter in practice:
- Supervised fine-tuning (SFT). Train on input-output pairs. The model learns the specific mapping. This is the most common enterprise fine-tuning path.
- Instruction fine-tuning. A flavor of SFT focused on following task-specific instructions. Often used for domain-specific assistants.
- Preference fine-tuning (RLHF, DPO, and related). Train against preference data — "response A is better than response B" — to shape model behavior. Common for safety and style alignment.
LoRA and QLoRA
Full fine-tuning updates every parameter in the model. For a 70B-parameter model, this requires roughly 1.4 TB of GPU memory in 16-bit precision. Most enterprises do not have that infrastructure readily available.
LoRA (Low-Rank Adaptation) inserts small adapter matrices into the model and trains only those. The base model weights stay frozen. The adapter weights are a few tens of megabytes. The training workload drops by an order of magnitude, and the results for most tasks are comparable to full fine-tuning.
QLoRA goes further: it quantizes the frozen base model to 4-bit precision, further reducing GPU memory requirements. A 70B model that would need 4-8 H100s for full fine-tuning can be QLoRA fine-tuned on a single H100.
Both are production-ready. Open-weight models (Llama, Mistral, Qwen, Gemma) support them; the tooling (Hugging Face PEFT, Axolotl, and others) is mature.
When fine-tuning earns its keep
- Specific output format the model does not produce reliably with prompt engineering alone.
- Domain vocabulary and style that the base model treats as out-of-distribution.
- Latency-sensitive workloads where baking behavior into weights beats paying for it in the prompt every request.
- Cost-sensitive high-volume workloads where a smaller fine-tuned model outperforms a larger base model at lower cost per inference.
- Tasks where prompt engineering has hit a ceiling after systematic iteration.
For the first enterprise AI workload, fine-tuning rarely earns its keep. For the fifth or tenth, it often does.
The cost structure
Compute cost is the less important part. Data preparation — curating, cleaning, formatting the training data — is where most of the engineering effort goes. A 10,000-example fine-tuning dataset might cost $500 to compute against but $50,000 to prepare properly, especially if the examples require domain-expert review.
Evaluation is the other expensive line item. A fine-tuned model needs to be evaluated against production scenarios; the evaluation suite is often as large as the training set.
How Thoughtwave approaches this
We recommend fine-tuning only when it is the right tool. For most engagements, prompt engineering plus RAG plus model-switching (to a different base model) solves the problem without fine-tuning. When fine-tuning is called for — specific output format, domain specialization for a cost-sensitive workload, or a production behavior the base model cannot produce reliably — we use LoRA or QLoRA on open-weight models.
For the deeper context on model selection and deployment, see our LLM Deployment Services and the accelerators portfolio.