The 2026 Guide to Microsoft Foundry Models: Choosing the Right LLM

So I opened Foundry portal last week and — honestly? The model catalog itself has become a problem. GPT-5.4, GPT-5.3-codex, GPT-5.2, GPT-5.1, GPT-5, GPT-5-mini, GPT-5-nano, GPT-5-chat, GPT-5-pro. That's just the GPT-5 family. Then you've got the o-series reasoning models, DeepSeek, Grok, Llama, Mistral, Cohere, Moonshot's Kimi, and whatever Black Forest Labs is doing with image generation. I counted and it's basically 50+ models sitting there. You know what the funny part is? Half my team didn't even know Microsoft sells non-OpenAI models directly now.

Let me break down what I actually learned after spending — well, too much time — testing these for real projects.

The GPT-5 lineup is confusing on purpose

Here's the thing. GPT-5.4 just dropped with a 1 million token context window. That's massive. But you need registration to even access it. Same with GPT-5.3-codex. Microsoft is gating the newest stuff behind approval forms, which makes sense, right?

But the naming, it drives me crazy. GPT-5.2-codex is different from GPT-5.1-codex which is different from GPT-5.3-codex. Each has slightly different context limits, different output caps, different deployment options. GPT-5.1's reasoning_effort defaults to none — I found this out the hard way when my chain-of-thought prompts were returning flat answers. Took me an hour to figure out I needed to set it manually.

For most production work, I'm still on GPT-4.1. A million token context, 32K output, available in basically every deployment type. It's stable, it's GA, and I don't need to fill out a form. The known issue with tool definitions over 300K tokens — yeah, that bit me once, but who's sending 300K tokens of function definitions? If you are, we need to talk about your architecture.

GPT-4o-mini remains the workhorse for cost-sensitive stuff. 128K context, fast, cheap. I've been using it for classification tasks and it handles those fine.

Where the non-OpenAI models actually matter

This part, nobody talks about enough.

DeepSeek-R1 and the new V3.2-Speciale — these are real competitors now. 128K context, reasoning capabilities, and they're available through Global Standard deployment. I tested DeepSeek-R1 on a RAG pipeline we built for a client and — it was good. Not GPT-5 good, but for the price difference? Makes sense for internal tools.

Grok surprised me. xAI's grok-4.1-fast models don't even need registration. 128K context in and out. The grok-code-fast-1 with 256K context is interesting for code review scenarios. I haven't tested it enough to have strong opinion, but the early results I am seeing look promising.

Meta's Llama-4-Maverick is the one I keep coming back to though. 1 million token context, 12 languages, and you can fine-tune it. For multilingual projects — and I work on a lot of those — this thing I already tested extensively and it holds up.

Mistral's Document AI models are sort of in their own category. They take PDFs directly, up to 30 pages. For document processing pipelines this is actually cleaner than the OCR-then-LLM approach I was doing before.

And then there's Microsoft's own model-router. It basically picks the right model for your request — routes between GPT-4.1, o4-mini, GPT-5 automatically. I haven't used it in production yet. The idea of not knowing which model handled your request feels wrong for anything where you need consistency. But for prototyping? Could save time.

So how do you actually pick

Look, it depends the use case. That's the boring answer but it's true.

Need reasoning? The o-series — o3, o4-mini — these are purpose-built for it. o3-pro if budget isn't a concern. DeepSeek-R1 if it is.
Need long context? GPT-4.1 for stability. Kimi-K2.5 from Moonshot gives you 262K and it's — actually no, let me rephrase, it's decent for research-type tasks but I wouldn't put it in a customer-facing product yet.
Need to save money? Global Batch deployment cuts costs by roughly 50%. Not every model supports it, but GPT-4.1 and GPT-4o do. If your workload isn't real-time, this is free money basically.

Deployment type matters more than people think. Global Standard is the default and it works. But if you're in EU and data residency matters — Data Zone Standard only supports a subset of models. Check before you commit to a model that isn't available in your zone.

One more thing. Fine-tuning is available for GPT-4.1, GPT-4o-mini, o4-mini, and some open-source models like Llama and Qwen-32B. If you're building something specialized, this matters more than picking the biggest model. I've seen fine-tuned GPT-4o-mini outperform base GPT-5 on domain-specific tasks. Not always, but enough times that I stopped automatically reaching for the latest model.

The model catalog keeps growing. Microsoft is adding providers fast — Cohere, Moonshot, xAI all showed up in the last few months only. My advice? Pick two models. One smart, one cheap. Build your abstractions so you can swap. Because three months from now this list will look different again.

The 2026 Guide to Microsoft Foundry Models: Choosing the Right LLM

The GPT-5 lineup is confusing on purpose

Where the non-OpenAI models actually matter

So how do you actually pick

Share this article

About the Author

Related Articles

How to Build Adaptive Dialog Management in Microsoft Copilot Studio

How to Build a Copilot Studio Agent From Scratch (Without the Mistakes)

Need help with your project?