The Great AI Inference Migration: Why Every Company Is Switching Models Every 90 Days
Model switching costs dropped to near zero. 68% of enterprises now use three or more LLM providers. Average model tenure is 87 days and shrinking. The model layer is commoditizing faster than anyone predicted, and the real lock-in is moving to the orchestration layer that sits above it.
By Raj Patel, AI & Infrastructure · Mar 10, 2026
68% of enterprises now use 3+ LLM providers and swap primary models every 90 days. A data-driven analysis of why the model layer is commoditizing and who captures value in a multi-model world.
Frequently Asked Questions
Why are enterprises switching AI models so frequently?
Enterprises are switching primary LLM providers approximately every 87 days because the combination of standardized APIs, commoditized inference pricing, and rapid model quality convergence has eliminated meaningful switching costs. OpenAI-compatible API formats are now supported by virtually every model provider, meaning a migration that once required weeks of engineering can be completed in hours. Meanwhile, new model releases from Anthropic, Google, Meta, and DeepSeek arrive every 6-10 weeks, each offering better performance-per-dollar ratios than its predecessor. According to Flexera's 2026 State of AI report, 68% of enterprises now use three or more LLM providers simultaneously, and 41% maintain active contracts with five or more. The rational strategy is no longer to pick a winner but to continuously route traffic to the best available model for each task.
What are model routing and orchestration layers, and why do they matter?
Model routing and orchestration layers are software platforms that sit between an application and multiple LLM providers, automatically directing each inference request to the optimal model based on cost, latency, quality, and availability. Key players include OpenRouter, LiteLLM, Portkey, Martian, and Unify. These platforms matter because they are becoming the new lock-in point in the AI stack. While switching between GPT-4o and Claude Sonnet is now trivial at the API level, migrating away from an orchestration layer that handles routing logic, fallback chains, cost optimization, rate limit management, and observability is far more difficult. OpenRouter processes over 3 billion tokens per day across 200+ models. LiteLLM has 22,000+ GitHub stars and is embedded in thousands of production applications. The orchestration layer is capturing the durable value that model providers are losing.
How much can companies save with model arbitrage strategies?
Model arbitrage, the practice of routing each query to the cheapest model that meets a quality threshold, can reduce inference costs by 40-72% without measurable quality degradation for most workloads. A typical enterprise strategy routes simple classification and extraction tasks to lightweight models like GPT-4o mini or Claude Haiku at $0.25-$0.80 per million tokens, medium-complexity reasoning to mid-tier models like Claude Sonnet or Gemini 1.5 Pro at $3-$15 per million tokens, and only escalates complex multi-step reasoning to frontier models like GPT-4o, Claude Opus, or Gemini Ultra at $15-$75 per million tokens. Martian's production data shows that 62% of enterprise queries can be handled by models costing less than $1 per million input tokens. The remaining 38% require mid-tier or frontier models but only account for 15-20% of total query volume by count.
Is the AI model layer really commoditizing like cloud compute did?
The structural parallels to cloud computing commoditization are strong but imperfect. Like cloud compute in 2010-2015, AI models are converging on standardized interfaces (the OpenAI API format is the equivalent of the S3 API), pricing is falling 10-15x per year, and multi-provider strategies are becoming the default. However, unlike cloud compute, model capabilities still differ meaningfully at the frontier. Claude Opus outperforms competitors on extended reasoning and code generation, GPT-4o leads on certain multimodal tasks, and Gemini has advantages in long-context processing. The commoditization is happening fastest at the lower and mid tiers, where open-source models like Llama 4 and DeepSeek V3 have reached quality parity with proprietary alternatives from 12 months ago. At the frontier, differentiation still exists but the window is narrowing to 3-6 months rather than the 12-18 months it was in 2023.
How are OpenAI, Anthropic, and Google responding to model commoditization?
Each major provider is pursuing a different strategy to maintain pricing power as the model layer commoditizes. OpenAI is moving aggressively into the application layer with ChatGPT Enterprise, custom GPTs, and platform features like memory and file storage that create workflow lock-in beyond the model itself. Anthropic is emphasizing safety, reliability, and enterprise compliance, positioning Claude as the model procurement teams choose when risk tolerance is low. Google is leveraging vertical integration, bundling Gemini with Google Cloud, Workspace, and its advertising stack to make the model a loss leader that drives platform revenue. All three have cut prices by 60-85% over the past 18 months, with GPT-4o-level capability now available at roughly 1/10th the price OpenAI charged for GPT-4 at its March 2023 launch. The price war is accelerating as open-source models close the quality gap.
What should enterprise AI teams do to prepare for a multi-model world?
Enterprise AI teams should implement four structural changes. First, adopt a model-agnostic abstraction layer from day one. Whether using OpenRouter, LiteLLM, Portkey, or a custom gateway, every LLM call should pass through a routing layer that decouples application logic from any specific provider. Second, establish a continuous model evaluation pipeline that benchmarks new releases against production workloads within 48 hours of launch. Companies running quarterly evaluations are already falling behind. Third, negotiate contracts that reflect the new reality: shorter terms (6-12 months maximum), volume-based pricing with no minimums, and explicit provisions for multi-provider deployments. Fourth, invest in prompt portability. The biggest hidden switching cost is not the API integration but the prompt engineering. Teams that structure prompts as data, version-controlled and model-parameterized, can migrate between providers in hours rather than weeks.
Related Articles
Topics: AI Infrastructure, LLMs, Enterprise AI, Pricing, Developer Tools
Browse all articles | About Signal