anticipated catalyst · technology

The AI Inference Cost Squeeze: How DeepSeek Is Forcing Hyperscalers to Surrender Their Margins

published 6/11/2026

The million-dollar switch

In early 2026, an AI agent startup called Lindy publicly disclosed that it had switched from Anthropic's Claude to DeepSeek for inference and was now saving millions of dollars. This was not a research paper. It was not a benchmark. It was a production deployment decision by a company whose unit economics depend on inference costs—and it was followed by a wave of similar announcements from enterprises experimenting with DeepSeek, often hosted via third-party platforms like Together AI, Atlas Cloud, or Lightning AI Hub that wrap DeepSeek models with 99.9% uptime SLAs and enterprise support.

The pricing gap is structural, not marginal. DeepSeek-V3 costs roughly $0.14 per million input tokens and $0.28 per million output tokens, compared to $2.50/$10.00 for GPT-4o and $3.00/$15.00 for Claude Sonnet 4.6. That is an 18–21× cost advantage on input and a 36–54× advantage on output, even before volume discounts. For an enterprise running 10 million tokens per day—a modest production workload—the annual inference bill drops from roughly $36 million on GPT-4o to under $2 million on DeepSeek-V3. These are not rounding errors. They are budget-redefining differences.

Anthropic's response was immediate and defensive: it doubled Claude Cowork limits at no charge and launched Fable 5 (Mythos-class) with a limited-time availability window, both moves designed to retain developers before they defect to cheaper alternatives. Microsoft announced that its agent runtime is now free, explicitly positioning Azure Foundry, Visual Studio, and GitHub as the lock-in layers while conceding that the runtime itself—the inference layer—is no longer a defensible margin source. These are not the actions of companies confident in their pricing power.

Why hyperscaler AI margins are about to collapse

The AI infrastructure market in 2024–2025 sits at the intersection of explosive demand for inference compute, a maturing cloud oligopoly facing its first credible margin threat in a decade, and a cost environment that is forcing enterprises to treat AI spending as a line-item rather than an innovation budget. The hyperscalers—Microsoft Azure, AWS, and Google Cloud—have spent the past three years building out GPU clusters and custom accelerators to capture what they believed would be a sustained, high-margin AI infrastructure boom.

Microsoft explicitly disclosed in FY25 that scaling AI infrastructure drove its Microsoft Cloud gross margin down to 69%, a rare admission of cost pressure in a segment that historically printed 70%+ gross margins. AWS is running the largest capex program in its history to build out AI capacity, and Google Cloud's margin expansion from 9.4% to 32.9% in a single year reflects aggressive scale-up rather than sustainable unit economics. The baseline assumption baked into hyperscaler valuations is that AI inference will follow the playbook of the 2010–2015 cloud price wars: visible price cuts, but margins defended through scale, efficiency, and ecosystem lock-in.

That playbook worked because AWS, Azure, and GCP were competing against each other and on-prem legacy infrastructure, not against a fundamentally cheaper architectural paradigm. DeepSeek's Mixture-of-Experts models, which activate only 37B of 671B parameters per token, represent a step-function improvement in cost efficiency that the hyperscalers cannot easily replicate without cannibalizing their existing GPU fleets. The 2010–2015 AWS price war is instructive: AWS cut S3 storage prices by more than 80% from 2010 to 2016, yet gross margins expanded from the high-40s to low-60s because the cost curve dropped faster than prices. The difference today is that DeepSeek and open-weight models are not just cheaper—they are architecturally different, and they are being adopted by enterprises that have already built the MLOps and governance infrastructure to run them.

The AI governance and observability market, currently around $1.5–3 billion globally, is growing at 25–40% annually precisely because enterprises are moving from pilot-phase "let's try GPT-4" to production-phase "we need policy controls, cost management, and multi-model routing." This is not a story about AI demand collapsing—it is a story about the value stack fragmenting, with inference becoming a commodity input and the real margin accruing to the orchestration, governance, and data layers that sit above it.

The market is anchoring on the wrong precedent

The market is pricing hyperscaler AI infrastructure as a sustained, high-margin business because it is anchoring on the 2010–2015 cloud price war, where AWS cut prices aggressively yet margins expanded. The consensus view is that Azure, AWS, and GCP will defend AI inference margins the same way: through scale, efficiency gains, and ecosystem lock-in. This view is wrong for three reasons.

First, the 2010–2015 price war was a competition among hyperscalers and against on-prem legacy infrastructure; today's competition is against a fundamentally cheaper architectural paradigm that the hyperscalers cannot easily replicate without stranding their existing GPU investments. Second, the hyperscalers' defensive moves—free agent runtimes, doubled Cowork limits, limited-time Mythos releases—are not the actions of companies with pricing power; they are the actions of companies trying to slow defection. Third, enterprise AI cost sensitivity data shows that inference is already a minority cost, accounting for only 15–30% of total AI TCO, with 70–85% sitting in data pipelines, governance, observability, and human oversight. This means that even modest price cuts from the hyperscalers will not prevent enterprises from switching to DeepSeek if the cost gap remains 20–50×.

The informational asymmetry is that most investors are focused on headline AI revenue growth—Azure up 39% YoY, Google Cloud up 28% YoY—and are not decomposing that growth into inference versus platform services. Microsoft's disclosure that AI infrastructure is pressuring gross margins is buried in segment commentary, and neither AWS nor Google Cloud provides a clean P&L breakdown for AI infrastructure. The narrative inertia is powerful: "AI is the next cloud, and the hyperscalers will dominate AI just like they dominated cloud." But the structural difference is that cloud infrastructure (compute, storage, networking) was never truly commoditized because it required massive fixed investments and had strong data gravity effects. AI inference, by contrast, is stateless, portable, and increasingly available from platforms that offer comparable SLAs at a fraction of the cost.

The gap persists because the hyperscalers are still growing fast enough that investors are not yet worried about margin compression, and because the DeepSeek adoption curve is still early enough that it looks like a niche phenomenon rather than a structural threat. That will change in H2 2026 when the next wave of enterprise AI deployments hits production and CFOs start asking why they are paying 20–50× more for inference than they need to.

Where the value is shifting

The opportunity size depends on how much of the hyperscalers' AI infrastructure revenue is at risk and how much value shifts to the orchestration and governance layers. Microsoft's Intelligent Cloud segment generated roughly $96 billion in revenue in FY25, with Azure accounting for the majority; if AI infrastructure is 10–15% of Azure revenue—a conservative estimate given Microsoft's emphasis on AI growth—that is $10–15 billion in annual revenue. AWS generated roughly $105 billion in revenue in 2024, with AI infrastructure likely in the $15–20 billion range based on growth commentary. Google Cloud generated roughly $50 billion in 2024, with AI infrastructure and generative-AI solutions driving growth; assume $7–10 billion in AI-related revenue.

Combined, the hyperscalers are running $30–45 billion in annual AI infrastructure revenue, with gross margins in the 50–70% range depending on the mix of inference versus platform services. If inference margins compress by 20–30 percentage points over the next 18–24 months due to DeepSeek arbitrage and open-weight model adoption, that is $6–13 billion in annual gross profit at risk across the three hyperscalers.

The value does not disappear—it shifts to the orchestration, governance, and observability layers. The LLM observability market is forecast to grow from $1.44 billion in 2024 to $6.8 billion by 2029, a $5.4 billion increase. The AI governance software market is forecast to grow from roughly $300 million in 2025 to $5.88 billion by 2035, implying about $5.5 billion in incremental value creation over the next decade. Cloudflare, Databricks, Datadog, and other independent platforms are positioned to capture a disproportionate share of this growth because they are not tied to a single inference provider and can offer better cost management and policy controls than the hyperscalers' native tooling.

Cloudflare reported 1,200%+ year-over-year growth in AI Gateway requests, indicating that enterprises are routing inference calls through independent control planes rather than locking into Azure OpenAI or Bedrock. Independent analyses of enterprise AI spending show that inference and model access account for only 15–30% of total AI TCO, with 70–85% sitting in data pipelines, governance, observability, and human oversight. This means that even if inference margins compress to near-zero, the hyperscalers can still capture value—but only if they control the surrounding platform. The evidence suggests they are losing that control.

The deployment velocity problem

AI teams now deploy 1,000 times per month, requiring new pipeline infrastructure that the hyperscalers' native tooling was not built for. The shift from pilot to production AI deployments is not just about inference cost—it is about governance, observability, and orchestration at a scale that traditional DevOps tooling cannot handle. Microsoft is pitching enterprises to migrate from Azure Repos to GitHub despite GitHub's rocky reliability record, a sign that the company is prioritizing ecosystem consolidation over service quality. Microsoft positioned Azure Foundry as the reliability and governance layer, betting that the enterprise AI battle is about orchestration, not raw inference capability.

But the evidence suggests that enterprises are not buying the bundled platform story. They are routing inference through independent control planes like Cloudflare's AI Gateway, using Datadog for LLM observability, consolidating data on MongoDB Atlas and Snowflake's Cortex AI, and streaming real-time data through Confluent's Kafka-based pipelines. The hyperscalers are still capturing some of this value through platform services, but the margin profile is fundamentally different: platform services require ongoing R&D investment and compete with best-of-breed independent tools, whereas inference was supposed to be a high-margin, low-touch revenue stream.

The instruments

This portfolio expresses the thesis through three structural layers: orchestration and governance platforms that capture value as inference commoditizes, data foundation infrastructure that enterprises consolidate on regardless of inference provider, and infrastructure beneficiaries that win if inference volume explodes even as margins compress.

Cloudflare (NET) is the purest structural exposure to inference commoditization. AI Gateway sits between enterprises and inference providers, capturing value as multi-model routing becomes standard. The 1,200% YoY growth in AI Gateway requests validates the thesis mechanism: enterprises are no longer willing to lock into Azure OpenAI or Bedrock if it means paying 20–50× more for inference. Cloudflare's valuation prices in perfection—33x price-to-sales, negative operating margin—but the orchestration layer is where margin accrues as inference becomes a commodity input. Upside is 50% to $330 if AI Gateway monetization accelerates in H2 2026 as enterprise deployments scale. Weight: 20%. Horizon: 540 days.

Datadog (DDOG) is a leading observability platform with LLM monitoring capabilities; the LLM observability market is forecast to grow 36% annually to $6.8 billion by 2029. As enterprises route between DeepSeek, Claude, and on-prem models, they need unified logging, cost tracking, and policy enforcement—capabilities that Datadog provides better than the hyperscalers' native tooling. Valuation is nosebleed—22x sales, 593x P/E—leaving no margin for error, but the thesis mechanism (governance spend rising as inference commoditizes) is direct. Upside is 50% to $340 if LLM observability spending scales with multi-model deployment complexity. Weight: 15%. Horizon: 540 days.

Elastic (ESTC) provides search, observability, and security analytics that map to the AI governance and monitoring layer. Elasticsearch is the de facto standard for log aggregation; positioned in AI observability without hyperscaler margin exposure. Valuation is undemanding—17x earnings for 17% growth, 5.1% FCF yield—and the company benefits structurally as multi-model deployments require centralized logging and anomaly detection. Upside is 40% to $85 if observability spending scales with AI deployment complexity. Weight: 12%. Horizon: 450 days.

MongoDB (MDB) provides the data foundation layer through Atlas and vector search, capturing spending as enterprises build production AI systems regardless of inference provider. Atlas and vector search sit above the inference layer—enterprises running multi-model architectures need a database that handles unstructured data and embeddings at scale regardless of whether they route between DeepSeek, Claude, or GPT-4. 23% revenue growth is enterprise-driven, aligning with the H2 2026 deployment wave. Upside is 25% if Atlas becomes the default vector database for production AI. Weight: 15%. Horizon: 540 days.

NVIDIA (NVDA) benefits structurally as hyperscaler inference margins compress and enterprises move to self-hosted DeepSeek or hybrid architectures to escape Azure/AWS pricing—they still need GPUs, and NVIDIA's software stack (NIM, TensorRT-LLM, Triton) captures orchestration value at the edge and in private clouds. The thesis catalyst is double-edged: more on-prem deployments favor NVIDIA hardware, but lower per-token compute intensity (DeepSeek MoE efficiency) favors fewer chips per rack. Sized at 18% as a hedge against cloud margin compression and as exposure to on-prem GPU demand if cost becomes the dominant enterprise selection criterion. Target: $260. Horizon: 450 days.

Snowflake (SNOW) sits in the data foundation layer where enterprises consolidate before running inference workloads—the 15–25% of AI TCO that the thesis identifies as non-commoditizing. Cortex AI positions Snowflake as the policy and pipeline control plane. Valuation assumes margin expansion that has not yet materialized (negative operating margin of -26%, 16.5x sales) and competitive position against Databricks and hyperscaler bundles is contested. Sized at 10% to reflect thesis alignment (governance spend rising as inference commoditizes) but unproven capture of that spend at current burn rates. Horizon: 540 days.

Confluent (CFLT) captures data streaming infrastructure spending as enterprises build real-time AI pipelines; data foundation accounts for 15–25% of AI TCO. Kafka-based data streaming for real-time pipelines is the data plumbing for multi-model AI architectures. As enterprises route inference across DeepSeek, Anthropic, and open-source models, they need real-time data streaming to feed those models and orchestrate outputs. Valuation already prices in the growth story (9.5x P/S, negative FCF) and the company is still unprofitable. Sized at 10% as exposure to multi-model AI data infrastructure with 30–40% upside to $40 if AI data spending accelerates. Horizon: 450 days.

Assumptions and falsification

  1. DeepSeek and other low-cost inference providers scale to enterprise production volumes with 99.9%+ uptime SLAs by H2 2026. Falsified if: DeepSeek or third-party wrappers (Together AI, Atlas Cloud, Lightning AI Hub) experience sustained outages or fail to meet enterprise reliability thresholds, causing risk-averse enterprises to pay the hyperscaler premium for Azure OpenAI or Bedrock-hosted Anthropic.

  2. Inference accounts for 30–50% of hyperscaler AI infrastructure revenue, with gross margins in the 50–70% range. Falsified if: Hyperscalers disclose that inference is <20% of AI revenue or that platform services (Foundry, Bedrock tooling, Azure ML) already capture the majority of margin, reducing the magnitude of the compression.

  3. Enterprises redirect 20–40% of inference cost savings into governance, observability, and orchestration tooling by 2027. Falsified if: Enterprises pocket the savings rather than reinvesting, or if hyperscalers successfully bundle governance tooling at zero marginal cost (e.g., Azure Foundry becomes free, AWS launches a Bedrock governance layer that is "good enough"), preventing independent platforms from capturing the value shift.

  4. Hyperscalers cannot defend inference pricing through ecosystem lock-in or superior reliability. Falsified if: Microsoft successfully locks enterprises into Azure Foundry with switching costs high enough to justify the 20–50× inference premium, or if AWS builds a governance layer that is meaningfully better than Cloudflare/Databricks, keeping enterprises on Bedrock despite the cost gap.

  5. The H2 2026 wave of enterprise AI deployments prioritizes cost over capability as the dominant selection criterion. Falsified if: Enterprises delay production rollouts due to economic conditions or regulatory uncertainty, pushing the catalyst out 12–18 months, or if capability (model quality, latency, context window) remains the primary decision factor and enterprises accept the hyperscaler premium for perceived performance advantages.

Risks

Hyperscaler price cuts could narrow the cost gap if Azure, AWS, and GCP cut inference pricing aggressively (50–70% reductions) to defend market share. Enterprises may perceive the hyperscalers as "cheap enough," delaying the shift to DeepSeek and reducing the urgency for independent governance platforms. This compresses hyperscaler margins faster than the thesis predicts but also delays the value shift to orchestration layers.

Bundling and zero-marginal-cost competition is the "good enough" risk. Hyperscalers could bundle governance, observability, and orchestration tooling into their platforms at no incremental cost (e.g., Azure Foundry becomes free, AWS launches Bedrock Policy Manager), making it economically irrational for enterprises to pay Cloudflare, Datadog, or Databricks for standalone tooling. The hyperscaler version does not need to be best-in-class, just sufficient to prevent defection.

DeepSeek reliability and geopolitical risk: DeepSeek is a Chinese model provider; if geopolitical tensions escalate or if U.S. enterprises face regulatory pressure to avoid Chinese AI infrastructure, adoption stalls regardless of cost advantages. Third-party wrappers (Together AI, Atlas Cloud) mitigate this by hosting DeepSeek in U.S. data centers, but the risk remains if the underlying model architecture is perceived as a supply-chain vulnerability.

Valuation compression in high-multiple governance plays: Cloudflare (33x P/S), Datadog (22x P/S), and Snowflake (16.5x P/S) are priced for sustained hypergrowth. If AI Gateway or LLM observability adoption is slower than expected, or if the companies miss quarterly guidance, multiples compress 30–50% regardless of the long-term thesis validity. This is execution risk, not thesis risk, but it creates near-term volatility.

NVIDIA demand destruction from MoE efficiency: DeepSeek's Mixture-of-Experts architecture activates only 37B of 671B parameters per token, reducing GPU utilization per inference workload. If this efficiency gain propagates across the industry (OpenAI, Anthropic adopt similar architectures), total GPU demand per rack declines even as inference volume explodes, pressuring NVIDIA's unit sales growth. The company's software attach rate (NIM, Triton) may not offset the hardware headwind.

Crowded trade risk in AI infrastructure shorts: If the thesis becomes consensus (e.g., multiple hedge funds short hyperscaler margins or underweight MSFT/GOOGL/AMZN in favor of governance plays), any positive surprise (hyperscaler pricing defense, DeepSeek outage, enterprise AI spending pause) triggers a violent unwind. This portfolio is long-only, but the governance longs (NET, DDOG, ESTC) are correlated and could sell off together if the narrative shifts.

Portfolio

TickerWeightTargetHorizon
NET20%$330540d
DDOG15%$340540d
ESTC12%$85450d
MDB15%540d
NVDA18%$260450d
SNOW10%540d
CFLT10%$40450d

Sources

  1. 1.The New StackMicrosoft just made the agent runtime free — and kept everything around it
  2. 2.The New StackAI teams now deploy 1,000 times a month. Your pipeline wasn’t built for that.
  3. 3.The New StackWith Foundry, Microsoft bets the enterprise AI battle is about reliability, not capability
  4. 4.The New StackWhy Anthropic just doubled Claude Cowork limits at no charge
  5. 5.The New StackMicrosoft’s pitch to enterprises: Ditch Azure Repos for GitHub, despite its rocky reliability record
  6. 6.The New StackThis AI agent startup ditched Anthropic for DeepSeek’s — and says it’s saving millions
  7. 7.The New StackAnthropic launches Claude Mythos/Fable 5, but you better try it soon