People keep asking me should they fine-tune or just use better prompts. And honestly, the answer is almost always try better prompts first. But when that stops working — when you have a specific tone, a specific format, a domain where the base model just doesn't get it no matter how many examples you stuff in the system message — that's when fine-tuning makes sense. Microsoft Foundry Portal actually makes this less painful than it used to be, but there are things nobody tells you upfront.
Right now Foundry supports fine-tuning on GPT-4o, GPT-4o-mini, the whole GPT-4.1 family including nano, o4-mini, and even some open-source models like Ministral-3B, Llama-3.3-70B, and Qwen-32B. The open-source ones are still in public preview though, so don't build your production pipeline around them just yet. GPT-5 fine-tuning exists too but it is private preview — you need to apply for access.
There are three customization methods:
Supervised Fine-Tuning (SFT): The standard one — you give it input/output pairs and the model learns your pattern.
Direct Preference Optimization (DPO): For when you want the model to prefer certain responses over others, only available on GPT-4o and GPT-4.1 models.
Reinforcement Fine-Tuning (RFT): Uses grader signals, and that is only for o4-mini and GPT-5.
Most people will use SFT. That's fine. Start there.
The data part where everyone messes up
Your training data needs to be JSONL format, UTF-8 encoded, under 512 MB per file. The format follows Chat Completions API structure — system message, user message, assistant message. Minimum 10 examples but let me be very clear, 10 examples will do basically nothing. Microsoft recommends starting with 50 well-crafted examples and honestly even that feels low to me. We did a fine-tuning project at OZ for a legal document classifier and didn't see real improvement until we hit about 300 examples. Quality matters more than quantity though. I have seen teams throw 5,000 sloppy examples at a model and get worse results than someone with 200 carefully curated ones.
One thing I actually like — you can use weighted messages in multi-turn conversations. Set weight: 0 on assistant responses you don't want the model to learn from, and weight: 1 on the ones you do. This is useful when your training data has a back-and-forth where the first response is mediocre but the follow-up after user correction is what you actually want. Most people skip this feature and they shouldn't.
GPT-4o and GPT-4.1 also support vision fine-tuning. You can include image URLs in your training data. I haven't tested this extensively yet but the format is straightforward — just add an image_url content block alongside your text.
The costs that show up later
Here is where it gets interesting. Foundry gives you three training types: Standard, Global, and Developer. Standard keeps your data in the training region — guaranteed residency. Global is cheaper and faster but your data can be processed anywhere. Developer tier is the cheapest, but your job can get preempted and there is no SLA. For experimentation, Developer is fine. For production training, I would go Standard or Global depending on your compliance situation.
But the training cost is not the part that surprises people. The part that surprises people is the hosting cost. Once you deploy your fine-tuned model, you pay hourly hosting fees whether anyone is calling the API or not. It is not pay-per-token anymore for the hosting part. And if nobody calls your deployment for 15 consecutive days, Microsoft auto-deletes it. The model stays, you can redeploy, but the deployment itself is gone. We found this out the hard way when a staging environment went quiet over Eid holidays and our deployment vanished. Not a big deal to redeploy but it confused the team.
The actual fine-tuning process in the portal is pretty straightforward — pick your base model, choose your method, upload your JSONL file, configure hyperparameters if you want or leave them on auto. I'd say leave them on auto for your first run. The defaults are sensible. You get training loss curves, token accuracy metrics, checkpoints after each epoch. Last three checkpoints are saved and each one is independently deployable, which is nice because sometimes epoch 2 is better than epoch 4.
One feature I genuinely appreciate is the ability to fine-tune a model that was already fine-tuned. Continuous fine-tuning. So you train version one, deploy it, collect more data from production, then fine-tune that fine-tuned model with your new examples. The model name gets long and ugly — something like gpt-4.1-2025-04-14.ft-b044a9d3cf9c4228 — but it works and it means you don't start from scratch every time.
You can also pause training jobs mid-run if the metrics look wrong, which saves you from burning through compute on a bad configuration. And there is a model copy feature in preview that lets you move fine-tuned models across regions within same tenant, which is useful if you trained in North Central US but need to deploy in Sweden Central.
My biggest complaint is region availability. Fine-tuning is only available in North Central US, Sweden Central, and East US2 for some models. If your resources are in West Europe or Southeast Asia, you are out of luck for standard training and need to use Global type instead.


