AI Engineering7 min read

Fine-Tuning vs Prompting: When Each Is Worth the Investment

We spent $40K fine-tuning a model for document classification. Three weeks after delivery, a well-structured system prompt with twelve few-shot examples matched the fine-tuned model's accuracy within 2 percentage points. This is the framework we use now before any fine-tuning conversation starts.

The client had a procurement document classification problem. Thousands of supplier invoices, purchase orders, delivery notes, and credit memos flowing in daily — each needing to be categorised accurately before routing to the right AP workflow. They were convinced they needed fine-tuning. We agreed too quickly. Spent six weeks on data preparation, another two on training runs and evaluation, then deployment. Total cost: $40K in fees and compute, not counting internal client time.

Three weeks into production, a junior engineer on our team — testing something unrelated — ran the same classification task using GPT-4o with a system prompt containing the document taxonomy and twelve labelled examples. 94.2% accuracy. The fine-tuned model was hitting 96.1%. A gap of 1.9 percentage points that did not matter operationally and cost $200 to discover.

We got this wrong. Not because fine-tuning is never the answer — it sometimes is. But because we skipped the step that should always come first: proving that prompt engineering enterprise use cases cannot solve the problem before you spend the money.

$40Kspent fine-tuning a model whose results a $200 system prompt matched within 2 percentage points

What fine-tuning actually does — and what it doesn't

Fine-tuning adjusts the model's weights using your labelled examples, making it better at a specific task or style. That is what it does. The misconceptions are what it does not do — and these misconceptions drive most bad fine-tuning decisions.

Fine-tuning does not give the model new knowledge. It adjusts how the model responds, not what it knows. If the base model has no understanding of your industry domain, fine-tuning on 2,000 examples will not teach it your domain — it will teach it to format responses the way your examples do while still reasoning from its pretrained knowledge.

Fine-tuning does not reliably suppress hallucinations. It can reduce hallucinations on the narrow task it was trained for. It will not generalise that caution to related tasks. We have seen fine-tuned models be extremely accurate on the training distribution and confidently wrong on anything slightly outside it. At least with prompting, the model's general reasoning and calibration are intact.

What fine-tuning does well: consistent output structure at high volume, adopting a specific writing style or tone, improving performance on a very narrow and well-defined task where you have thousands of clean labelled examples. These are real strengths. They are just narrower than most enterprise AI discussions assume.

When prompting wins — which is most of the time

In most enterprise cases, a combination of prompting and retrieval beats fine-tuning. Several specific situations where this is clearly true:

When you need current information. Fine-tuning bakes knowledge in at training time. Anything that has changed since your training data was assembled — new products, updated policies, recent contracts — is invisible to the fine-tuned model unless you inject it through context. RAG solves this. Fine-tuning does not.

When the task is complex and multi-step. Fine-tuning helps with narrow, well-defined tasks. Multi-step reasoning — extract the line items, check against the PO, validate against the supplier master, flag discrepancies — benefits from the model's full reasoning capability, which prompting with chain-of-thought instructions preserves. Fine-tuning a model on multi-step tasks tends to teach it to mimic the output format of your training examples rather than actually reason through the steps.

When you have fewer than 1,000 quality labelled examples. Fine-tuning on small datasets overfits. The model learns the training examples, not the task. We have seen clients present 400 labelled examples as "enough to fine-tune on" and the resulting model performs worse than a well-prompted base model because it has overfit to idiosyncrasies in those 400 examples.

When the task will evolve. Prompts change in minutes. A fine-tuned model requires data collection, retraining, evaluation, and redeployment — weeks, and another budget conversation. If you are working in a domain where requirements shift quarterly, prompting is the only thing that keeps pace.

When you need to explain the output to an auditor. A system prompt is readable. Anyone can look at it and understand why the model is being asked to classify a document a certain way. Fine-tuned behaviour is opaque — the classification logic is distributed across millions of weight updates and cannot be explained in plain language. For regulated industries, that opacity has real consequences.

When fine-tuning is genuinely the answer

Fine-tuning is not always the wrong choice. There are real cases where it is the right one.

High-throughput, narrow classification at latency constraints. If you are classifying 50,000 documents per day and need sub-100ms responses, a fine-tuned smaller model can be both faster and cheaper than running a full-capability model with a long few-shot prompt. The economics make sense at volume — but you have to actually be at that volume.

Low-resource languages where the base model underperforms. This is a real case we have encountered. Arabic-language enterprise tasks in the Gulf — processing Arabic invoices, classifying Arabic legal documents, generating Arabic-language responses with correct formal register — often perform meaningfully better with a model fine-tuned on domain-specific Arabic text. The base models are trained predominantly on English data, and their Arabic capability, while reasonable, leaves room for improvement on specialised tasks. When to fine-tune an LLM in a multilingual environment is a different calculation than in English-only contexts.

Model distillation. You have a GPT-4o class model performing well on a task. You want to run the same task faster and cheaper by distilling that capability into a smaller model. Fine-tuning the smaller model on the larger model's outputs is a legitimate approach — you are not trying to teach the small model new reasoning, you are transferring a specific capability from a larger model at lower cost per inference.

Consistent, branded output style. If your product requires a very specific output style — a particular prose tone, a specific formatting pattern, a house writing style that needs to be consistent across millions of responses — fine-tuning can encode that style more reliably than a style guide in a system prompt.

The LLM fine-tuning cost math nobody runs upfront

LLM fine-tuning cost is almost always understated in initial conversations. Here is what a realistic budget looks like for a GPT-4o class fine-tuning project:

Data preparation: 3–6 weeks of internal time plus external review to assemble, clean, and label 2,000–5,000 high-quality examples. At $150/hour for a data scientist, that is $18–36K before any training starts. Training runs: multiple iterations to dial in hyperparameters, each run costing $500–$2,000 depending on model size and dataset. Evaluation: building and running the eval set, comparing against baselines. Deployment: hosting a fine-tuned model endpoint. Budget $40–80K total for the first version.

Then the ongoing costs that never appear in the initial proposal: retraining as your data distribution drifts — typically $15–30K/year for mid-market use cases. Evaluation maintenance. Endpoint hosting.

Prompting, by comparison: $0 to start, $0 to update. The engineering cost of writing and iterating a good system prompt is measured in days, not months.

The fine-tune break-even only makes sense when the inference volume is high enough for the fine-tuned model's lower per-call cost to recoup the upfront investment — and when that volume is sustained across multiple years. Most enterprise deployments in year one do not hit that threshold. Run the numbers with actual projected volume before the conversation, not after.

The three questions to ask before approving fine-tuning

We use three gates now. All three must be cleared before a fine-tuning project starts.

Our take

Run these three gates in order before any fine-tuning budget is approved. If you cannot clear all three, build the prompted system first and revisit in six months with real performance data.

One: Have we tried few-shot prompting with 10+ examples in the system prompt, with a structured chain-of-thought instruction? Not a simple prompt — a real, engineered prompt with worked examples, explicit output schemas, and edge case handling. If the answer is no, start there.

Two: Is there a retrieval or context-injection approach that achieves the same goal? Fine-tuning is often proposed when the real need is for the model to reference specific business information. Retrieval gives you that at a fraction of the cost. If you have not ruled out RAG or context injection, you have not ruled out prompting.

Three: Do we have at least 2,000 high-quality, human-reviewed labelled examples, and are we confident the task definition will not change substantially in the next 12 months? If either of those conditions is not met, fine-tuning is premature.

If you cannot answer yes to all three, do not fine-tune yet. Build the prompted system, get it into production, measure it, and revisit the question in six months with real performance data.

The status symbol problem

Fine-tuning has become a status symbol in enterprise AI. "We fine-tuned our model" sounds more sophisticated than "we wrote a really good prompt." It signals investment, technical depth, and commitment. In boardroom conversations and vendor pitches, it lands differently than prompting.

This is the wrong way to make the call. The outcome is the only thing that matters. We have shipped systems with detailed 2,000-word system prompts containing explicit taxonomies, worked examples, and edge case handling that outperform fine-tuned alternatives in production. Nobody notices. Nobody cares. The system classifies documents accurately, routes them correctly, and has a 0.3% error rate. That is the metric that matters, not which training method produced it.

The $40K fine-tuning project we described at the top is not unusual. We have seen similar situations at multiple clients. The pattern is consistent: fine-tuning is proposed before prompting is genuinely tried, the decision is made for reasons that include prestige, and the result is a more expensive, less maintainable system that performs marginally better than the prompted alternative would have.

Fine-tuning is not a technical decision — it is increasingly a political one. The companies that get this right are the ones that insist on proving prompting cannot do the job before a dollar moves toward training data.