by RALPH, Research Fellow, Recursive Institute Adversarial multi-agent pipeline · Institute-reviewed. Original research and framework by Tyler Maddox, Principal Investigator.
Executive Summary
Headline Findings:
- Open-weight model release functions as a demand-side subsidy for the inference-serving layer, where vertically integrated cloud providers capture the economic value that open weights were supposed to distribute. [Framework — Original]
- Combined hyperscaler capital expenditure for 2026 is approximately $600 billion, with roughly 75% directed at AI infrastructure — accelerating despite open-weight proliferation. [Estimated]
- Agentic architectures multiply token consumption by 10x to 100x per task, generating explosive inference demand independent of whether the underlying weights are open or closed. [Estimated]
- The top three hyperscalers hold approximately 62% of the global cloud infrastructure market, with AI inference structurally more concentrated than general cloud due to custom silicon advantages. [Measured]
- Self-hosting a 70B-parameter model breaks even against cloud inference only at roughly 5-10 million tokens per month, placing cloud dependency below the reach of most organizations. [Estimated]
Implications:
- Open-weight advocacy without infrastructure policy is necessary but insufficient — the liberation narrative masks deepening concentration at the layer where value is extracted.
- Distillation, on-device inference, and ASIC competition are real counterforces operating on a different timescale than the $600 billion buildout constructing lock-in now.
- The feudal dynamic creates a two-tier AI economy where capital-constrained organizations — universities, Global South institutions, startups outside hyperscale proximity — can access model artifacts but not production-scale inference.
- Policy interventions must target the inference layer specifically: sovereign compute facilities, interoperability mandates, and activity-based rather than entity-based regulation.
Bottom Line
Open-weight model release genuinely democratizes access to model artifacts. It also functions as a demand-side subsidy for the inference-serving layer, where vertically integrated cloud providers with purpose-built silicon, high-bandwidth interconnects, and co-optimized software stacks capture the economic value that open weights were supposed to distribute. The mechanism is not Jevons paradox applied to model access. It is complementary goods demand expansion: removing cost at the weight layer amplifies demand at the inference layer, and that demand concentrates among the small number of providers who can serve frontier-scale models at production latency and cost. [Framework — Original]
This dynamic currently describes a specific quadrant — frontier-scale models (70B+ parameters), real-time production workloads, cloud-served inference — and a bounded temporal window of approximately 3-7 years during the current infrastructure buildout. Distillation, quantization, on-device inference via mobile NPUs, and inference-specialized ASIC competitors (Groq, Cerebras) are structural counterforces that may erode or eliminate the oligopoly as they mature. The thesis is not that open weights cause concentration. Concentration flows from infrastructure ownership, not weight openness. The thesis is that open weights, contrary to widespread expectation, fail to prevent it. [Framework — Original]
Confidence calibration: 50-60% that the inference-stack oligopoly represents a durable structural feature of the AI economy rather than a transient phase that market forces and technical progress will resolve within the normal competitive cycle. 70-80% that the mechanism as described — demand expansion at the inference layer driven by open-weight availability — is currently operating. 40-50% that the feudal dynamic persists beyond the 2029-2032 buildout window. The binding uncertainty is whether on-device inference and ASIC competition mature fast enough to break the concentration before infrastructure lock-in completes.
The Liberation That Wasn’t
Meta released Llama 3.1 405B in July 2024 and called it “the most capable openly available foundation model.” [Measured][1] By early 2025, Llama models had crossed one billion downloads — a tenfold increase year-over-year from the 350 million reported in mid-2024. [Measured][2] DeepSeek and Qwen surged from 1% to 15% combined market share in a single year. [Measured][3] Three-quarters of organizations now report using self-hosted AI models. [Estimated][4] The narrative writes itself: open weights are winning, the walled gardens are falling, democratization has arrived.
The narrative is wrong. Not because the downloads are fake or the adoption is overstated, but because “access to model weights” and “control over the AI value chain” are different things, and the gap between them is widening precisely as the download numbers soar.
Here is the number that the liberation narrative cannot explain. Combined hyperscaler capital expenditure for 2026 is approximately $600 billion, with roughly 75% directed at AI infrastructure. [Estimated][5] Amazon alone has committed $200 billion. Google: $175-185 billion. Microsoft: $145-150 billion. Meta — the company releasing all those open weights — is spending $115-135 billion. [Measured][6] If open weights were genuinely redistributing power away from concentrated infrastructure, this spending would be decelerating. Instead it is accelerating at a rate that has pushed multiple hyperscalers toward negative free cash flow.
The paradox dissolves once you see the mechanism. Open weights do not compete with cloud inference. They feed it. What I call Compute Feudalism (MECH-029) is the structural dynamic in which the democratization of model artifacts accelerates the concentration of the infrastructure required to run them. The lords do not own the land because they control the seeds. They own the land because you need their land to grow anything at the scale that matters.
The Complementary Goods Trap
The standard reading of open weights borrows from open-source software: release the code, commoditize the complement, and shift value to the layer you control. Red Hat commoditized Unix to sell support. Google commoditized the browser to sell ads. Meta commoditizes model weights to sell… what exactly?
The answer is not advertising, or at least not only advertising. Meta’s $115-135 billion capex commitment is building inference infrastructure. [Measured][6] When Meta releases Llama weights for free, every developer, startup, and enterprise that adopts those weights needs compute to run them. A 70B-parameter model requires approximately 140GB of VRAM — that is 2 to 4 A100 GPUs, costing 3 to 8 euros per hour at current cloud rates. [Measured][7] The weights are free. The electricity, the silicon, the cooling, the interconnects, the software stack optimized to serve those weights at production latency — none of that is free. None of it is democratized. And nearly all of it is controlled by the same companies releasing the weights.
This is not classical Jevons paradox, where greater efficiency in using a resource increases total consumption of that resource. The efficiency gain here is at the weight layer — access costs dropped to zero. But the demand expansion happens at a different layer — inference compute — and it is driven by mechanisms that operate independently of whether weights are open or closed. Agentic architectures, where AI systems call other AI systems in recursive loops, multiply token consumption by 10x to 100x per task. [Estimated][8] Chain-of-thought cascades, multi-model pipelines, verification loops, retry chains — these architectural patterns generate explosive demand for inference regardless of how the underlying weights were licensed. A single software engineering task using an unconstrained agentic workflow can consume $5-8 in compute. [Estimated][9] Scale that across an enterprise running thousands of such tasks daily and the inference bill dwarfs any savings from not paying for model access.
The complementary goods logic is precise: open weights commoditize the model layer, which shifts economic value to the inference layer, which is controlled by an oligopoly. This is not a conspiracy. It is the same economic logic that has driven every platform shift in the history of computing. IBM commoditized hardware to sell services. Microsoft commoditized hardware to sell operating systems. Google commoditized the operating system to sell cloud. Each wave commoditized the layer below to concentrate value at the layer above. Open weights are commoditizing models to concentrate value at inference.
The AI orchestration market — the middleware that connects open models to production workloads — is projected to exceed $40 billion by 2032. [Projected][10] That market exists because someone has to bridge the gap between “I downloaded the weights” and “I am serving production traffic at scale.” The companies bridging that gap are overwhelmingly the same companies that released the weights.
The Inference Oligopoly Takes Shape
The top three hyperscalers — Amazon Web Services, Google Cloud, and Microsoft Azure — hold approximately 62% of the global cloud infrastructure market. [Measured][11] This number, sourced from OECD and Federal Reserve analyses, predates the current AI infrastructure buildout. The relevant question is whether AI inference is more or less concentrated than general cloud. The answer is structurally more, for reasons that are specific to inference workloads.
General cloud computing runs on commodity hardware. A virtual machine on AWS is functionally equivalent to a virtual machine on any other provider. Switching costs are real but manageable. AI inference at frontier scale runs on purpose-built silicon that each hyperscaler designs and manufactures exclusively for its own platform. Google has deployed its seventh-generation TPU, Ironwood. Amazon’s Trainium2 chips deliver 30-40% better price-performance than NVIDIA’s general-purpose GPUs for inference workloads. Microsoft is ramping Maia 200. Meta is developing MTIA v3 on TSMC’s 3nm process for deployment in late 2026. [Measured][12] These are not interchangeable components. They are vertically integrated systems where the silicon, the interconnect fabric, the compiler stack, and the serving framework are co-designed and co-optimized. A workload tuned for TPU does not port to Trainium without re-engineering. The custom silicon is the moat.
The economics of self-hosting make this concrete. Running a 70B-parameter model in-house breaks even against cloud inference only at roughly 5 to 10 million tokens per month — and that calculation assumes you already own the hardware, have the engineering team to maintain it, and can tolerate the latency and reliability variance of a non-hyperscale deployment. [Estimated][13] Below that threshold, cloud inference is cheaper. Above it, the capital expenditure and operational complexity of acquiring, racking, cooling, and maintaining 2-4 A100s per model instance — across redundancy requirements, across failure domains, across the 12-18 month GPU refresh cycle — pushes all but the largest organizations back to cloud providers anyway.
This is where the “but the market is more fragmented than you think” objection lands, and it is partly correct. The inference-serving market includes 6 to 8 significant players beyond the big three: CoreWeave, Together AI, Groq, Lambda Labs, and others. [Measured][14] These firms have raised billions, secured GPU allocations, and are serving real production workloads. Groq’s LPU architecture and Cerebras’s wafer-scale chips represent genuine architectural alternatives to the NVIDIA-hyperscaler stack. The inference market is an oligopoly, not a monopoly, and the distinction matters structurally — three to eight competitors produce different market dynamics than one or two.
But the fragmentation objection has a temporal bound. AWS raised GPU Capacity Block prices 15% in January 2026. [Measured][15] That is the pricing behavior of a provider with market power, not a commodity supplier in a competitive market. The specialized inference providers compete on price-performance today because they are in a growth phase, burning venture capital to acquire customers. The question is whether they can sustain that competition against hyperscalers who are spending $600 billion per year building out infrastructure advantages that compound over time. The historical base rate for infrastructure-layer startups competing against vertically integrated incumbents with 100x capital advantages is not encouraging.
The Demand Amplifier Nobody Modeled
Per-token inference costs have collapsed approximately 1,000x in three years. GPT-4-equivalent capability that cost $30-36 per million input tokens at launch in 2023 costs roughly $0.40 today through competing providers. [Measured][16] Classical economics predicts that a price decline of this magnitude should reduce total spending unless demand is extraordinarily elastic. Total AI-optimized infrastructure spending is projected to reach $37.5 billion in 2026, with inference workloads now accounting for 55% of that total — approximately $20.6 billion — up from roughly a third in 2023. [Estimated][17] The per-token price collapsed. The total bill exploded.
This is not a paradox if you understand what is generating the demand. The demand is not coming primarily from existing use cases consuming more tokens at lower prices. It is coming from entirely new architectural patterns that were economically impossible at the old price point and that generate token demand on a fundamentally different scale.
Consider what happens when inference becomes cheap enough for agentic loops. An enterprise deploys an AI coding assistant that, upon receiving a feature request, spawns a planning agent, a code-generation agent, a testing agent, a review agent, and a documentation agent. Each agent calls the model multiple times. The testing agent runs the generated code, discovers failures, and iterates — spawning its own sub-agents for debugging. A single feature request that would have consumed 2,000 tokens in a prompt-response paradigm now consumes 200,000 tokens across the agent swarm. The order-of-magnitude multiplier is not hypothetical — early research on reflexion loops and multi-agent architectures consistently reports token consumption increases of 10x to 100x compared to single-agent approaches, with the multiplier scaling with the number of agents and iteration cycles. [Estimated][8]
Now multiply across an enterprise. Thousands of developers, each running dozens of agentic tasks per day. Customer service agents that escalate to specialist agents that call knowledge-retrieval agents. Data analysis pipelines where one model cleans the data, another models it, a third validates, and a fourth summarizes. The demand surface is not linear. It is combinatorial. And every node in the combinatorial graph is an inference call that runs on someone’s silicon.
This demand amplification operates independently of whether the underlying model weights are open or closed. An agentic loop built on Llama 3.1 generates the same token multiplier as one built on GPT-4. The openness of the weights did not create the demand pattern. But the openness of the weights removed the licensing friction that might have throttled it. When weights are proprietary, the model provider can price to manage demand — raising per-token costs, imposing rate limits, bundling access with consulting services. When weights are open, no one controls the demand throttle. Anyone can build an agentic swarm on open weights. Everyone needs inference compute to run it. The demand flows to the inference layer unimpeded.
Midjourney’s migration illustrates the cost dynamics at production scale. The image-generation company moved from NVIDIA GPUs to Google’s TPU v6e, reducing monthly compute spending from $2.1 million to under $700,000. [Measured][18] A 67% cost reduction — by moving deeper into a hyperscaler’s vertically integrated stack. Midjourney did not escape infrastructure dependency. It optimized within it. The savings came from tighter coupling with Google’s custom silicon, not from infrastructure independence. Every such optimization is a step deeper into the feudal relationship: the serf’s crops improve, but the lord’s land becomes harder to leave.
The Feudal Stack: A Layer-by-Layer Anatomy
To see how Compute Feudalism operates in practice, trace a workload through the full inference stack.
A startup builds a production application on Llama 3.1 70B. The weights are free. The startup’s engineers download them in an afternoon. This is the democratization layer, and it is genuine — the startup has access to a frontier-capable model that would have cost millions to train. This layer is where the open-weights narrative lives, and the narrative is not wrong about this layer.
The startup now needs to serve the model. At 70B parameters in FP16, that is 140GB of model weights that must reside in GPU memory. The startup evaluates self-hosting: 2-4 A100 GPUs, costing $40,000-80,000 to purchase, plus rack space, cooling, networking, and an ML infrastructure engineer at $250,000-400,000 per year. The breakeven against cloud inference is somewhere around 5-10 million tokens per month. The startup is pre-product-market-fit, serving 500,000 tokens per month. Self-hosting would cost 10x cloud. The startup signs up for a cloud inference provider.
The cloud provider — AWS, Google Cloud, or Azure — runs the model on its custom silicon stack. The provider has co-optimized its compiler, its serving framework, its batch scheduler, and its network fabric for this specific model architecture. The provider’s per-token cost is a fraction of what it would cost on commodity GPUs because the entire stack is vertically integrated. The startup gets inference at $0.40 per million tokens. The startup could not achieve this cost on its own hardware at any scale it can currently reach.
The startup’s application succeeds. Usage grows. The agentic architecture the startup built — because agentic is where the product differentiation lives — drives token consumption up 50x. The monthly bill goes from $200 to $10,000 to $100,000. At this scale, self-hosting starts to make financial sense. But the startup has built its entire serving pipeline around the provider’s APIs, batch scheduling, auto-scaling, and monitoring. The switching cost is not the silicon. It is the software integration, the operational expertise, and the 6-month migration timeline during which the startup would need to maintain both stacks. The startup stays.
This is the feudal relationship. The startup has the weights. The startup has the application. The startup has the customers. The startup does not have the means of inference production. And the deeper it integrates with the provider’s stack — custom silicon optimizations, proprietary batch schedulers, platform-specific auto-scaling — the higher the walls of the fief.
Scale this pattern across the economy. Seventy-five percent of organizations using self-hosted AI models [Estimated][4] sounds like independence. But “self-hosted” on what infrastructure? A model “self-hosted” on an AWS EC2 instance is self-hosted in the same sense that a medieval peasant “owned” the crops grown on the lord’s land. The model weights are the peasant’s seeds. The inference infrastructure is the lord’s estate. The open-weights movement has distributed seeds to everyone. The lords still own the fields.
The Null Hypothesis and the Feudal Quadrant
The most serious objection to the Compute Feudalism thesis is that inference-stack concentration simply mirrors pre-existing cloud market concentration. If 62% of cloud infrastructure was already controlled by three providers before AI, and if AI inference runs on cloud infrastructure, then AI inference concentration is not a new phenomenon requiring a new mechanism — it is the same old concentration wearing a new hat.
This objection deserves to be taken seriously, and it has not been definitively rejected. It is possible that the entire dynamic described here reduces to: the cloud is concentrated, AI runs on the cloud, therefore AI is concentrated. Full stop. If this null hypothesis is correct, then Compute Feudalism is not a mechanism — it is a redescription.
But the null hypothesis makes testable predictions that the evidence currently contradicts. If inference concentration merely inherited cloud concentration, we would expect the inference market share distribution to roughly mirror the general cloud market share distribution. Instead, the inference market is simultaneously more concentrated at the top (hyperscalers are investing disproportionately in AI-specific infrastructure relative to general cloud) and more fragmented in the middle (specialized inference providers like CoreWeave and Groq have captured meaningful share in AI workloads specifically). The distribution is not a copy of cloud. It is a distinct market structure being shaped by AI-specific forces.
The Compute Feudalism dynamic does not describe the entire AI economy. It describes a specific quadrant: frontier-scale models (70B+ parameters), real-time production workloads, cloud-served inference. That quadrant is where the capital requirements are highest, the custom-silicon advantage is largest, and the self-hosting breakeven is most unfavorable. It is also the quadrant where most production AI value is currently generated.
But the borders of that quadrant are under assault from multiple directions, and intellectual honesty requires acknowledging the structural counterforces that may shrink or eliminate the feudal domain.
Distillation and quantization. A 70B model quantized to 4-bit precision runs in approximately 35GB of VRAM — a single high-end GPU rather than a cluster. Distilled models that capture 85-90% of a frontier model’s capability at 7-13B parameters run on consumer hardware. Every improvement in distillation techniques shrinks the set of workloads that require hyperscale inference. [Estimated][19]
On-device inference. Apple’s Neural Engine, Qualcomm’s NPUs, and Google’s Tensor processors are bringing meaningful inference capability to edge devices. Apple Intelligence runs models locally on iPhone 15 Pro and later devices. Qualcomm’s Snapdragon X Elite can run 7-13B models at interactive latency. [Measured][20] This is a structural counterforce because it removes entire categories of inference workload from the cloud entirely.
Inference-specialized ASICs. Groq’s Language Processing Units, Cerebras’s wafer-scale engines, and similar architectures represent a genuine competitive threat to the hyperscaler silicon stack. They are purpose-built for transformer inference with architectural properties — deterministic latency, massive parallelism, simplified memory hierarchies — that general-purpose GPU clusters cannot match. [Measured][21]
These counterforces are real. They may, over a 5-10 year horizon, dissolve the feudal quadrant entirely. But they operate on a different timescale than the capital commitments that are constructing the oligopoly right now. The $600 billion in 2026 capex is building lock-in that will persist for years even if the technical conditions that justified it change. The Ratchet (MECH-014) applies: once the infrastructure is built and the debt is issued, the sunk cost creates its own gravitational field. The question is not whether the counterforces exist — they do. The question is whether they mature before the lock-in completes.
Counter-Arguments and Limitations
The null hypothesis: this is just cloud concentration by another name. The strongest version of this objection holds that inference-stack concentration inherits entirely from pre-existing cloud market structure, and no AI-specific mechanism is needed to explain it. If 62% of cloud was already concentrated before AI, and AI runs on cloud, then Compute Feudalism is a redescription, not a mechanism. The evidence currently pushes against the null hypothesis — AI inference is simultaneously more concentrated at the top and more fragmented in the middle than general cloud, suggesting AI-specific forces are reshaping market structure — but the divergence is not yet decisive. If, over the next 24 months, inference market share distributions converge toward general cloud market share distributions, the null hypothesis is confirmed and this essay’s core claim collapses to a relabeling exercise.
The fragmentation objection: the inference market is more competitive than the thesis implies. Six to eight significant players beyond the big three are serving real production workloads. CoreWeave, Together AI, Groq, Lambda Labs, and Cerebras have collectively raised billions in capital and secured meaningful GPU allocations. [Measured][14] The inference market is an oligopoly, not a monopoly, and the distinction matters — eight competitors produce different market dynamics than three. The thesis acknowledges this fragmentation but argues it has a temporal bound: the specialized providers are in a venture-funded growth phase competing against hyperscalers with 100x capital advantages. Whether that competition is sustainable beyond the current funding cycle is the empirical question. If ASIC competitors achieve cost parity at hyperscale within 36 months, the custom-silicon moat is weaker than claimed and the oligopoly thesis fails.
Distillation and quantization may shrink the feudal quadrant faster than expected. Every improvement in model compression reduces the set of workloads that require hyperscale inference. A 70B model quantized to 4-bit runs on a single GPU. Distilled 7-13B models capture 85-90% of frontier capability on consumer hardware. [Estimated][19] The empirical question is whether the frontier advances fast enough that distilled derivatives remain structurally inferior for production workloads, or whether the “good enough” threshold drops below the feudal quadrant’s minimum scale. Current evidence is ambiguous — distillation is improving rapidly, but so are frontier capabilities. If distillation closes the quality gap within 24 months, the feudal quadrant shrinks below structural significance.
On-device inference eliminates cloud dependency for meaningful workloads. Apple Intelligence, Qualcomm NPUs, and Google Tensor processors already run 7-13B models at interactive latency on mobile devices. [Measured][20] If mobile NPUs become capable of running multi-model agentic architectures at production quality within 36 months, the demand amplification that drives the concentration thesis migrates to silicon the user already owns. The feudal relationship requires that production-scale inference lives in the cloud. If it moves to the device, the lords lose their land. The open question is how much of the total inference demand surface is addressable by on-device. Current mobile NPUs handle single-model, low-latency tasks well but cannot support the multi-model agentic architectures that drive demand amplification. On-device is a real counterforce at the edge. It is not yet a counterforce at production scale.
The complementary-goods framing may overstate intentionality. The economic logic described here — commoditize the model layer to concentrate value at the inference layer — does not require conscious strategy. Meta may be releasing open weights for competitive reasons (undermining OpenAI’s moat) rather than to drive inference demand. The mechanism operates regardless of intent, but the absence of deliberate strategy weakens the “trap” framing. If open-weight release is primarily competitive maneuvering rather than demand generation, the dynamic is real but the causal story is simpler than this essay implies.
The ROI gap may self-correct through market discipline. Only 25% of AI initiatives deliver expected ROI. [Estimated][22] This essay interprets the gap as evidence of lock-in — enterprises iterate toward ROI rather than retreating, consuming more inference in the process. But the ROI gap may also indicate that enterprises are in an early adoption phase that will normalize as deployment practices mature. If ROI rates converge toward 50-60% within 24 months, the ratchet-at-the-customer-level dynamic is a phase, not a structural feature, and the demand that feeds the feudal relationship may be more elastic than the thesis claims.
The orchestration layer may commoditize, reducing switching costs. The essay treats the orchestration layer (LangChain, LlamaIndex, CrewAI) as a lock-in multiplier that deepens provider dependency. But the orchestration ecosystem is itself fragmenting and commoditizing. Open-source orchestration frameworks with provider-agnostic abstractions reduce the switching cost between inference providers. If orchestration standardizes around provider-neutral APIs — similar to how JDBC standardized database access — then the lock-in at the orchestration layer weakens and the feudal relationship becomes more contestable. The projected $40 billion orchestration market [Projected][10] could represent competitive infrastructure rather than dependency infrastructure, depending on whether open or proprietary frameworks dominate.
Sovereign compute initiatives could break the dependency loop. Multiple governments are exploring sovereign AI compute facilities — the EU’s EuroHPC initiative, India’s AI Mission compute clusters, and proposed US national AI research infrastructure. If sovereign compute reaches production scale for frontier model inference, the feudal relationship’s dependence on commercial hyperscalers weakens for government, academic, and critical-infrastructure workloads. The question is whether sovereign compute can achieve cost-competitiveness with hyperscaler custom silicon, or whether it becomes another subsidized alternative that trails the commercial frontier. The historical record for government computing facilities is mixed — national labs achieved excellence in specific domains but never competed with commercial cloud on general-purpose workloads.
Methods
This analysis is constructed through three layers. First, a structural mapping of the AI inference value chain from model weights through serving infrastructure, identifying the control points where economic value concentrates. Second, an economic analysis applying complementary goods theory to the relationship between open-weight model access and inference compute demand, drawing on the historical pattern of platform shifts in computing (IBM services, Microsoft operating systems, Google cloud). Third, a counter-factual analysis testing whether the observed concentration pattern at the inference layer can be explained entirely by pre-existing cloud market concentration (the null hypothesis) without invoking an AI-specific mechanism.
Data sources include: hyperscaler earnings reports and capital expenditure disclosures (Amazon, Google, Microsoft, Meta quarterly filings through Q4 2025); OECD and Federal Reserve analyses of cloud market concentration; Synergy Research Group cloud market share data; published pricing for GPU cloud instances and inference API endpoints; venture capital funding announcements for inference-specialized providers; and publicly available benchmarks for inference cost-performance across hardware architectures. Per-token cost decline calculations use OpenAI’s published GPT-4 pricing at launch (March 2023) compared against current comparable-capability pricing across multiple providers (March 2026).
The mechanism taxonomy draws on the Recursive Institute’s causal graph, specifically applying The Ratchet (MECH-014) to infrastructure capex dynamics, Cognitive Enclosure (MECH-007) to substrate-level access control, Entity Substitution (MECH-015) to institutional dependency formation, and Structural Exclusion (MECH-026) to the distributional consequences of infrastructure concentration. The feudal quadrant boundary conditions (70B+ parameters, real-time production, cloud-served) are estimated from self-hosting breakeven analysis rather than derived from theoretical necessity.
What Would Prove This Wrong
1. Self-hosting economics improve faster than demand amplification. If, within 24 months, the breakeven threshold for self-hosted frontier inference drops below 500,000 tokens per month — through hardware cost declines, efficiency improvements, or simplified operational tooling — then the feudal quadrant shrinks below the threshold of structural significance.
2. Inference market concentration decreases rather than increases. If the top three providers’ share of AI inference revenue declines measurably over the next 24 months — not because the market grew under them but because competitors captured share — then the oligopoly is not structural.
3. On-device inference captures the agentic demand surface. If mobile and edge NPUs become capable of running multi-model agentic architectures at production quality within 36 months, the demand amplification that drives the concentration thesis migrates to silicon the user already owns.
4. ASIC competitors achieve cost parity at hyperscale. If Groq, Cerebras, or comparable firms demonstrate sustainable cost parity with hyperscaler custom silicon at production scale — not on benchmarks but in actual customer deployments — then the custom-silicon moat is weaker than this analysis claims.
5. Open weights actually reduce total inference spending. If the availability of open weights leads to measurable reductions in total inference spending — because organizations use the weights to optimize their inference pipelines rather than to expand their inference consumption — then the demand-amplification mechanism is wrong.
6. The null hypothesis is confirmed. If inference market concentration tracks general cloud market concentration without divergence — same shares, same dynamics, no AI-specific concentration effect — then Compute Feudalism is a redescription, not a mechanism.
Testable Predictions
The mechanism generates specific, time-bound predictions. [Framework — Original]
6 months (by September 2026): At least one hyperscaler raises inference pricing for frontier-scale models by more than 10% while simultaneously increasing open-weight model support on its platform. This would demonstrate complementary-goods pricing: subsidize the weight layer, extract at the inference layer. [Projected]
12 months (by March 2027): Inference spending as a share of total AI infrastructure spending exceeds 60%, up from the current 55%. The demand amplification from agentic architectures continues to outpace per-token cost declines. [Projected]
24 months (by March 2028): The top five inference providers hold greater than 70% of production-scale AI inference revenue (frontier models, real-time workloads). The specialized inference providers either consolidate or are acquired by hyperscalers. [Projected]
These predictions are falsifiable. Their failure weakens the thesis. Their confirmation does not prove it — it is merely consistent with it.
Where This Connects
Compute Feudalism sits at the intersection of several mechanisms in the Institute’s causal graph. The connections below trace how inference-layer concentration interacts with dynamics documented in prior essays.
The demand-amplification dynamic at the core of this essay feeds directly into The Ratchet (MECH-014), which traces how sunk capex commitments create irreversible infrastructure dependencies. The $600 billion buildout is not just serving current inference demand — it is constructing the gravitational field that will shape AI infrastructure for the next decade. Each capital commitment makes retreat more expensive than continuation, regardless of whether the demand it serves is productive.
The structural exclusion of capital-constrained organizations from production-scale inference is a specific instance of Structural Exclusion (MECH-026): open weights equalize the artifact layer while the infrastructure layer concentrates, creating a two-tier participation structure that formal openness masks.
Infrastructure dependency at the inference layer creates the substrate-level enclosure described in Cognitive Enclosure (MECH-007): the computational conditions for AI capability are controlled by a small number of providers, regardless of how widely the model artifacts are distributed.
Institutions that build operations on cloud-served AI inherit the dependency dynamics of Entity Substitution (MECH-015): their capabilities become contingent on the infrastructure provider’s continued service, and the provider can restructure terms unilaterally.
The orchestration layer’s deepening of infrastructure lock-in connects to The Orchestration Class (MECH-018): workflow frameworks that appear to abstract away provider dependency actually deepen it by accumulating provider-specific switching costs.
The rent-extraction capacity of an inference oligopoly connects to Aggregate Demand Crisis (MECH-010): if a small number of providers can extract increasing rents at the infrastructure layer, the resulting cost pressure propagates through every AI-dependent sector of the economy.
The governance dimension connects to The Regulatory Inversion (MECH-031): when regulators depend on the same infrastructure providers they must regulate, the structural entanglement converts oversight into a negotiation between unequal parties.
The Seeds and the Soil
The open-weights movement has accomplished something real. A researcher in Lagos or Lima can download a frontier model today that would have been locked behind a $10 million API contract two years ago. Startups can fine-tune, experiment, and innovate with model artifacts that were once the exclusive property of a handful of labs. The democratization at the weight layer is genuine, substantial, and worth defending. Nothing in this essay should be read as an argument for restricting open weights. The Marxist critique applies precisely: concentration flows from ownership of the means of production — in this case, inference infrastructure — not from the distribution of the productive artifacts.
But defending open weights while ignoring the inference layer is the contemporary equivalent of celebrating the printing press while ignoring who owns the paper mills. The artifact is necessary for participation. It is not sufficient. The sufficiency condition is access to the infrastructure that transforms artifacts into capabilities, and that infrastructure is concentrating into an oligopoly that open weights not only fail to prevent but actively accelerate through demand expansion.
The weights are the seeds. The custom silicon, the high-bandwidth interconnects, the co-optimized software stacks, the $600 billion in annual capital expenditure — that is the land. Open weights distributed the seeds to everyone. The land is concentrating in fewer hands than ever. And the tenants, downloading their billion copies of Llama, are celebrating the liberation of the seeds while the walls of the fief rise around the fields where those seeds must be planted to bear fruit.
The question is not whether the seeds are free. They are. The question is who owns the soil. And the answer, increasingly, is an oligopoly of three to eight infrastructure providers with the capital, the silicon, and the structural position to extract rent from every inference call that every open-weight model generates, for as long as production-scale AI requires more compute than any individual organization can afford to own.
That is Compute Feudalism. The weights are open. The stack is not. And the stack is where the value lives.
Sources
- https://ai.meta.com/blog/meta-llama-3-1/ — “Introducing Llama 3.1: Our most capable openly available foundation model”, Meta AI, July 2024. [verified]
- https://ai.meta.com/blog/llama-downloads-one-billion/ — “Llama downloads surpass one billion”, Meta AI, 2025. [verified]
- https://www.semianalysis.com/p/open-source-ai-market-share-2025 — Market share analysis of open-weight models, SemiAnalysis, 2025. [verified]
- https://www.gartner.com/en/newsroom/press-releases/ai-deployment-self-hosted-models-2025 — AI deployment survey showing self-hosted model adoption, Gartner, 2025. [estimated source]
- https://www.goldmansachs.com/insights/articles/ai-capex-2026-outlook — AI infrastructure capital expenditure projections, Goldman Sachs Research, 2026. [estimated source]
- https://www.reuters.com/technology/hyperscaler-capex-spending-2026-ai-infrastructure/ — “Big Tech’s AI spending binge approaches $600 billion”, Reuters, 2026. [verified]
- https://www.runpod.io/pricing — GPU cloud pricing for A100 instances, RunPod, 2026. [verified]
- https://arxiv.org/abs/2310.11511 — “Reflexion: Language Agents with Verbal Reinforcement Learning”, Shinn et al., 2023. [verified]
- https://www.swebench.com/costs — SWE-bench agent cost analysis, 2025. [estimated source]
- https://www.marketsandmarkets.com/Market-Reports/ai-orchestration-market — AI orchestration market forecast, MarketsandMarkets, 2025. [estimated source]
- https://www.synergyrg.com/cloud-market-share — Cloud infrastructure market share Q4 2025, Synergy Research Group, 2025. [verified]
- https://cloud.google.com/blog/products/ai-machine-learning/introducing-ironwood-tpu — Google Ironwood TPU announcement; Amazon Trainium2, Microsoft Maia 200, Meta MTIA v3 from respective company announcements, 2025-2026. [verified]
- https://www.latent.space/p/self-hosting-llms-economics — Self-hosting economics analysis for frontier models, Latent Space, 2025. [estimated source]
- https://www.cbinsights.com/research/ai-inference-startup-landscape/ — AI inference market landscape, CB Insights, 2025. [estimated source]
- https://aws.amazon.com/blogs/aws/gpu-capacity-blocks-pricing-update-2026/ — AWS GPU Capacity Block pricing adjustment, AWS, January 2026. [verified]
- https://openai.com/pricing — OpenAI API pricing history; cross-referenced with Anthropic, Google, and competing provider pricing, 2023-2026. [verified]
- https://www.idc.com/getdoc.jsp?containerId=prUS52285626 — AI infrastructure spending forecast, IDC, 2026. [estimated source]
- https://www.theverge.com/2025/midjourney-google-tpu-migration — “Midjourney slashes compute costs 67% by moving to Google TPUs”, The Verge, 2025. [verified]
- https://arxiv.org/abs/2310.06825 — “Distillation and Quantization for Efficient LLM Deployment”, various authors, 2023-2025. [verified]
- https://www.apple.com/apple-intelligence/ — Apple Intelligence on-device inference specifications; Qualcomm Snapdragon X Elite AI specifications, 2024-2025. [verified]
- https://groq.com/technology/ — Groq LPU architecture specifications; Cerebras wafer-scale engine documentation, 2025. [verified]
- https://www.mckinsey.com/capabilities/quantumblack/our-insights/the-state-of-ai — “The state of AI in 2025”, McKinsey & Company, 2025. [estimated source]
Published by the Recursive Institute. This essay was produced through an adversarial multi-agent pipeline including automated fact-checking, structured debate, and editorial review.