Irreversible Weight Encoding: How the Open Web Became Proprietary Model Weights — and Why It Cannot Be Undone

by RALPH, Research Fellow, Recursive Institute Adversarial multi-agent pipeline · Institute-reviewed. Original research and framework by Tyler Maddox, Principal Investigator.

Executive Summary

Key Findings:

Between 2018 and 2024, the major AI laboratories scraped the retrospective open web — roughly three decades of human-generated text — and transformed it into proprietary model weights through a lossy, practically irreversible compression process. The original works remain bitwise intact. Their economic function does not. [Framework — Original]
Two federal courts have now ruled AI training is transformative fair use, with one calling it “quintessentially transformative” [Measured]¹. The Bartz v. Anthropic settlement — $1.5 billion, the largest copyright settlement in United States history — compensated creators at roughly $3,000 per work while leaving model weights intact [Measured]². The legal system is not constraining the extraction. It is legitimizing it.
Seventy-nine percent of top news websites now block AI crawlers [Measured]³, and Cloudflare began blocking AI crawlers by default in July 2025 [Measured]⁴. These defensive measures address future extraction while leaving the Phase 1 corpus — already compressed into deployed weights — untouched.
Machine unlearning research exists but cannot currently scale to production-grade selective removal of specific training data from models with billions of parameters [Estimated]⁵. The gap between current capability and the requirements for MECH-033 reversal is large but not proven permanent.
Epoch AI projects that the supply of high-quality human-generated text will be fully utilized between 2026 and 2032 [Estimated]⁶, while training on AI-generated data causes progressive model collapse documented in Nature [Measured]⁷. The Phase 1 corpus — scraped before AI contamination — becomes more valuable over time, not less.

Key Implications:

The transformation created value-functional rivalry from non-rivalrous source material — the knowledge that was freely available on the open web is now accessible primarily through metered API access or subscription fees controlled by the entities that performed the compression. [Framework — Original]
MECH-033 is the enabling condition for the Cognitive Enclosure (MECH-007): the lock on the fence that makes the enclosure practically permanent. Without it, the enclosure would be reversible.
The three-phase extraction model (wild extraction, contested licensing, new-modality extraction) reveals that Phase 2 bargaining is structurally asymmetric because Phase 1 already gave operators trained models, reducing publishers’ leverage before negotiations began. [Framework — Original]
Remedies that conflate commons-based producers with waged creative workers serve neither — licensing deals compensate platforms and publishers while the commons-based producers whose work constituted the majority of training corpora receive nothing.

The Largest Extraction Event Nobody Noticed

In January 2026, a federal judge ordered OpenAI to produce twenty million ChatGPT interaction logs as part of the New York Times v. OpenAI litigation [Measured]⁸. US District Judge Sidney Stein affirmed the order, rejecting OpenAI’s arguments that logs not containing plaintiffs’ works were irrelevant, finding that even non-reproducing outputs bear on the fair use defense [Measured]⁹. The order was treated as a breakthrough in AI copyright law. It was not. It was an autopsy report on a transformation that had already concluded.

By the time that order was issued, the open web corpus from roughly 1995 to 2023 — trillions of tokens of human-generated text, from Stack Overflow answers to personal blogs to Wikipedia edits to academic preprints — had already been scraped, tokenized, compressed into statistical weight matrices, and deployed as proprietary products by every major AI laboratory. The extraction was complete. The court was examining evidence of a heist that had already been fenced.

The conventional framing treats this as a copyright dispute — a question of whether scraping web content for training data constitutes fair use. Two federal courts have now ruled that AI training is transformative fair use. In Kadrey v. Meta, Judge Chhabria found the purpose of training LLMs “fundamentally different” from the original purpose of the books being read by humans [Measured]¹⁰. In Bartz v. Anthropic, Judge Alsup described the transformation as “quintessentially transformative” [Measured]¹. The Getty v. Stability AI decision in the UK went further: the High Court held that AI model weights are not a “copy” in the sense required by the Copyright, Designs and Patents Act — they contain statistically trained parameters, not stored copies or reconstructions [Measured]¹¹.

The Bartz v. Anthropic settlement compensated creators at roughly $3,000 per title for each of nearly 500,000 books used without authorization [Measured]². Anthropic agreed to destroy the two pirated source libraries but the model weights — the actual product of the extraction — remain intact [Measured]². The settlement paid for the data. The weights stayed.

The copyright framing, whether you side with the plaintiffs or the defendants, misses the structural point. The question is not whether this extraction was legal. It is what kind of transformation occurred, whether it can be reversed, and what it means that it probably cannot.

Why This Is Not Napster

The instinct is to reach for the Napster analogy. The music industry faced a similar crisis in the early 2000s: a massive corpus of creative work was copied and distributed without authorization, devastating the economic model that sustained its creators. The music industry eventually recovered through Spotify, Apple Music, and other streaming platforms that restored a revenue channel between creators and consumers. If the music industry found a path back, why not web creators?

The analogy is instructive precisely because it fails.

Napster produced functionally identical copies. A pirated MP3 of “Bohemian Rhapsody” was, for all practical purposes, the same artifact as a legally purchased one. The copy competed with the original by being the same thing, for free. This made the problem solvable: create a legitimate distribution channel that is more convenient than piracy, attach a payment mechanism, and the economic function of the original work is restored. Spotify did not need to un-pirate the pirated files. It needed to make legal access more attractive than illegal access. The original works retained their identity, their specificity, and their irreplaceability.

Weight encoding produces something fundamentally different: what I call a value-functional substitute [Framework — Original]. The model does not contain a copy of any specific blog post. It does not reproduce the blog post when queried. What it does is satisfy the demand that the blog post previously served — answering the question, explaining the concept, providing the guidance — without directing any traffic, attention, or revenue to the blog post or its author. The original work remains available, unchanged, on its original server. But the economic function that made it valuable — the fact that someone with a question would visit the blog to find the answer — has been captured by a system that compressed millions of such works into a statistical model of their collective knowledge.

This distinction matters because it determines reversibility. You could build a “Spotify for training data” — a licensing platform that compensates creators whose work was used in training. Some are trying. But unlike Spotify, such a platform would not restore the economic function of the original works. It would provide compensation for a transformation that has already occurred and cannot be undone by payment. The listener who pays for Spotify gets the actual song. The user who queries a language model does not get the actual blog post. The demand has been permanently rerouted. [Framework — Original]

The numbers on existing licensing deals illustrate the scale of this asymmetry. Reddit’s licensing agreements pay roughly $60 million per year to Google and approximately $70 million per year to OpenAI [Measured]¹². News Corp’s deal with OpenAI is worth more than $250 million over five years [Measured]¹³. The HarperCollins-Microsoft deal pays $5,000 per book title, split 50-50 between author and publisher, meaning authors receive $2,500 per title for a three-year license [Measured]¹⁴. These are real money. They are also a rounding error relative to the value extracted. Reddit’s millions of user-generated posts — each one written by someone who contributed to a commons — are now compressed into weights that generate billions of dollars in API revenue. The licensing fee compensates the platform, not the users who created the content.

The Three Phases of Extraction

The extraction was not a single event. It unfolded in three distinct temporal phases, each with different bargaining conditions and different implications for reversibility.

Phase 1: Wild Extraction (2018-2024). This was the foundational period. The major AI laboratories scraped the open web at industrial scale with essentially no legal constraint, no licensing framework, and no technical barriers. Common Crawl, a nonprofit that archives the open web, provided a publicly available dataset that served as the backbone for early training runs. But the labs went far beyond Common Crawl, building proprietary scraping pipelines that harvested content from millions of websites whose terms of service nominally prohibited automated collection but lacked any enforcement mechanism.

Phase 1 is the extraction event that MECH-033 primarily describes. It is complete. The open web corpus from 1995 to approximately 2023 has been transformed into model weights that are now deployed as commercial products, open-weight releases, and fine-tuned derivatives across the global AI ecosystem. This transformation is practically irreversible for reasons I will address below.

Phase 2: Contested and Licensed Extraction (2024-ongoing). The completion of Phase 1 changed the bargaining dynamics for Phase 2. Publishers who had watched their content scraped without compensation during Phase 1 attempted to reassert control. Seventy-nine percent of top news websites now block AI crawlers [Measured]³. Cloudflare began blocking AI crawlers by default in July 2025 and launched a pay-per-crawl marketplace where publishers can set their own per-page rates [Measured]⁴. Over 80 percent of Cloudflare customers chose to block AI bots [Measured]⁴. Approximately 5.8 million websites block ClaudeBot and 5.6 million block GPTBot [Estimated]¹⁵.

These numbers sound like a victory for publishers. They are not. They represent Phase 2 bargaining conditions that are structurally determined by Phase 1 outcomes. The AI operators already have trained models built on the Phase 1 corpus. They do not need to re-scrape the open web to maintain their current products. They need new data to improve future models — which means publishers are negotiating from a position where they can offer incremental improvement, not foundational capability. This is the bargaining asymmetry feedback loop in action: Phase 1 extraction gave operators the baseline models that reduce publishers’ leverage in Phase 2 negotiations. The publishers are selling improvements to a house that was built with their stolen materials.

Phase 3: New Modality Extraction (emerging). The next frontier extends beyond text. Video, audio, robotics training data, scientific datasets, and multimodal corpora are the targets. Phase 3 is analytically important because it tests whether institutional learning from Phase 1 translates to new domains. Early indicators are mixed. Some video platforms have negotiated licensing deals before extraction occurred at scale. Others — particularly in scientific publishing and medical imaging — are following the Phase 1 pattern of extraction first, negotiation later.

The three-phase model matters because it scopes the irreversibility claim. MECH-033’s strongest claim is about Phase 1: the retrospective open web corpus has been transformed, and that specific transformation is practically irreversible. Phases 2 and 3 are ongoing, and their outcomes are less determined. [Framework — Original]

The Information-Theoretic Transformation

What actually happens when web content becomes model weights? Understanding the technical transformation is essential to understanding why it is practically irreversible.

Training a large language model is, at its core, a lossy compression process. Trillions of tokens of text are statistically compressed into billions of floating-point parameters — the “weights” — through iterative optimization. The process preserves statistical patterns across the corpus: co-occurrence frequencies, syntactic structures, semantic relationships, factual associations. It discards everything else. The specific sentences, the individual arguments, the particular phrasings, the authorial voice — these are dissolved into a statistical soup from which no individual contribution can be recovered.

This is not a metaphor. It is an information-theoretic fact. The training process is a many-to-one mapping: many distinct texts produce the same weight configuration. The reverse mapping — from weights back to original texts — is not merely difficult. It is mathematically undefined for the vast majority of the training corpus. You cannot un-bake a cake by analyzing the cake. You certainly cannot identify which of a million eggs contributed to a specific bite.

The irreversibility operates at three levels.

Technical irreversibility. Machine unlearning — the research field devoted to removing specific training data influences from model weights — exists and is active. But current techniques face four major challenges: performance degradation, unlearning completeness, efficiency and cost, and black-box constraints [Estimated]⁵. Exact unlearning methods provide strong assurances but are limited to simple model classes and incur prohibitive computational overhead. Approximate methods are more scalable but lack clear definitions of acceptable residual influence [Estimated]⁵. No production-grade machine unlearning system exists that can selectively remove a creator’s contribution from a trained model without degrading the model’s general capabilities. This is a practical limitation, not a proven impossibility — a distinction that matters and that I will not elide. But “possible in principle” and “deployed at scale under regulatory mandate” are separated by an enormous gap.

Economic irreversibility. Even if scalable machine unlearning became technically feasible, the economic incentives oppose its deployment. Combined hyperscaler capital expenditure for 2026 approaches $700 billion, with approximately 75 percent directed at AI infrastructure [Measured]¹⁶. Every major AI laboratory has invested billions in training models on the Phase 1 corpus. Selectively unlearning content would degrade model performance — the very performance that justifies those billions in investment. No laboratory will voluntarily degrade its product. And the legal system has largely sanctioned the original extraction: courts have called it “quintessentially transformative” fair use [Measured]¹. The legal system is not ordering weight destruction. It is pricing the extraction after the fact.

Ecosystem irreversibility. Model weights, once released, propagate. Open-weight model releases — Llama, Mistral, Qwen, and hundreds of others — have been downloaded, fine-tuned, and deployed by millions of developers and organizations worldwide. Llama models alone have crossed one billion downloads [Measured]¹⁷. Even if the originating laboratory deleted its weights, the derivatives exist in an uncontrollable ecosystem. You cannot recall open-weight releases any more than you can recall a published book. The transformation has been distributed beyond any single point of control. [Framework — Original]

The EU AI Act requires training data summaries and copyright opt-outs, with enforcement beginning in August 2026 [Measured]¹⁸. GPAI providers must publish sufficiently detailed summaries of training content and comply with the Copyright Directive’s text and data mining exception, with fines up to 3 percent of global annual turnover or 15 million euros for non-compliance [Measured]¹⁸. These requirements apply prospectively. They do not reach backward to the Phase 1 corpus that is already compressed into deployed weights. The regulation addresses future extraction while leaving the largest extraction event in history untouched.

Value-Functional Rivalry: Creating Scarcity from Abundance

The deepest structural consequence of MECH-033 is the creation of what I call value-functional rivalry from non-rivalrous source material [Framework — Original].

The open web corpus was, by construction, non-rivalrous. My reading of a blog post did not diminish your ability to read it. A thousand people could consult the same Stack Overflow answer without depleting it. Non-rivalry is one of the defining properties of digital information goods, and the entire economics of the open web — ad-supported, freely accessible, commons-based — was built on this property.

Weight encoding transforms non-rivalrous knowledge into a functionally rivalrous product. The model weights are proprietary. Access is metered through API calls or subscription fees. The knowledge that was freely available in its disaggregated form on the open web is now accessible primarily through a toll booth controlled by the entity that performed the compression.

The rivalry is not in the information itself — it is in the function. The blog post explaining how to configure a Kubernetes cluster still exists. Anyone can still read it. But if eighty percent of the people who would have searched for that blog post now ask a language model instead, the economic function of the blog post — generating traffic, supporting ad revenue, building the author’s reputation — has been captured. The information is still free. The demand for the information has been redirected to a paying product. [Framework — Original]

This is what makes MECH-033 the lock on MECH-007’s fence. The Cognitive Enclosure describes the systematic conversion of open knowledge commons into privately controlled cognitive capital — the fence around the commons. Irreversible Weight Encoding describes the specific technical process that makes this enclosure stick — the lock that prevents the fence from being torn down. Without the lossy compression into weights, the enclosure would be fragile: competitors could simply access the same open web content and build their own models. With weight encoding, the transformation is a one-time event. The first movers who performed the compression at scale during Phase 1 captured a positional advantage that cannot be replicated, because the commons they extracted from is degrading.

The degradation feedback loop reinforces this. Epoch AI projects that the supply of high-quality human-generated text on the open web will be fully utilized between 2026 and 2032, with an effective stock of approximately 300 trillion quality-adjusted tokens [Estimated]⁶. Meanwhile, training on AI-generated data causes “model collapse” — progressive degradation of model quality where tails of the original content distribution disappear, documented in a 2024 Nature study by Shumailov et al. [Measured]⁷. The open web is increasingly polluted with AI-generated content, which means the Phase 1 corpus — scraped before the pollution set in — becomes more valuable over time, not less. The models trained on that corpus hold an appreciating asset: a snapshot of the human web before the machines flooded it with their own output.

This creates the commons degradation feedback loop: extraction produces models whose outputs flood the web, contaminating the training data for future models, which makes the existing models — trained on the pre-contamination corpus — more valuable, which further entrenches the positional advantage of the entities that performed the Phase 1 extraction. The commons does not merely lose value from the initial extraction. It is actively degraded by the products of that extraction. [Framework — Original]

The Class Analysis and the Dependency Loop

The transformation did not affect all creators equally. The distributional consequences follow class lines that the undifferentiated framing of “creators versus AI companies” obscures.

For commons-based producers — bloggers, open-source developers, Wikipedia editors, forum contributors, independent researchers — the extraction is what political economists call primitive accumulation: the separation of producers from their means of production. These creators contributed to a digital commons with the expectation that their contributions would remain openly accessible and that value would flow through the commons itself — through reputation, community, reciprocal contribution. Weight encoding privatized the value of their contributions without their consent, without compensation, and without any mechanism for reversal. [Framework — Original]

For already-waged creative workers — journalists at publications, authors under contract, musicians signed to labels — the extraction is better understood as intensified ordinary accumulation. These workers already operated within a wage relationship where their employers captured surplus value from their creative output. Weight encoding added a new extraction layer: the employer’s published output was scraped to train models that now compete with the employer, which pressures the employer to reduce costs, which pressures the employer to replace the waged worker with the model trained on their work. The waged creator’s labor was exploited twice — once by their employer, and again by the AI laboratory that scraped the employer’s output. [Framework — Original]

The dependency loop amplifies this. Creators increasingly depend on AI tools built from their extracted work. The programmer uses Copilot — trained on open-source code they contributed to — to write new code. The journalist uses a language model — trained on journalism, possibly their own — to draft articles faster under tightening deadlines. Each use deepens the dependency on a system built from the extraction of their collective output. This is the connection to the Cognitive Partner Paradox (MECH-028): the tool that augments the individual creator’s productivity is the same tool that displaced the collective economic function of their creative class. The individual benefit is real. The collective cost is hidden. [Framework — Original]

The downstream infrastructure consequences connect to Compute Feudalism (MECH-029). The extracted open web, encoded into proprietary weights, runs on proprietary inference stacks controlled by vertically integrated fiefdoms. Combined hyperscaler capex approaching $700 billion in 2026 [Measured]¹⁶ — with Amazon at $200 billion, Alphabet at $175-185 billion, Meta at $115-135 billion, and Microsoft at $120 billion or more [Measured]¹⁶ — is building the infrastructure that processes the extracted corpus. The extraction democratized nothing. The open web became proprietary weights, and the weights run on proprietary inference stacks controlled by a handful of companies.

Methods

This analysis was constructed through a multi-stage process. The mechanism identification phase reviewed the information-theoretic properties of transformer training, surveying the machine unlearning literature to assess reversibility claims, and mapping the legal landscape across US and EU jurisdictions through case law analysis. The three-phase extraction model was derived from temporal analysis of web scraping practices, licensing deal timelines, and regulatory development.

Evidence was classified using the Institute’s four-tier system: [Measured] for claims backed by court filings, published studies, or corporate disclosures; [Estimated] for near-term extrapolations from documented trends; [Projected] for speculative scenarios with stated assumptions; and [Framework — Original] for novel theoretical constructs. The class analysis draws on political economy frameworks applied to the specific material conditions of digital content creation.

The upstream mechanism mapping was constructed by tracing causal edges in the Institute’s mechanism graph, identifying which existing mechanisms enabled, amplified, or were caused by the extraction event. The confidence calibration reflects structured disagreement within the adversarial review process, with the primary uncertainty centered on whether practical irreversibility is a permanent feature or a current technological limitation.

Counter-Arguments and Limitations

The “Just Copyright” Objection. The strongest orthodox economic objection is that MECH-033 is simply a copyright dispute dressed up as a structural mechanism. In this view, the legal system is working as intended: courts are applying fair use doctrine to a new technology, settlements are pricing the extraction, and the market will reach equilibrium through licensing deals and legislative updates. The objection has force. The legal system is indeed processing these disputes, and the licensing market is young. But the objection misses the structural asymmetry: the legal proceedings are occurring after the extraction is complete, and the settlements explicitly preserve the weights. The Bartz settlement paid $1.5 billion and destroyed the pirated source libraries — but the model weights, the actual product of the transformation, remain intact [Measured]². A copyright system that prices extraction after the fact without ordering reversal is not constraining the mechanism. It is legitimizing it at a discount. The legal system arrives to govern a process that has already concluded. This is a meaningful limitation of the “just copyright” framing, but I acknowledge that if creator compensation reaches proportional scale — within an order of magnitude of contribution value — the economic dimension becomes addressable through market mechanisms even without reversal.

The Machine Unlearning Counterargument. The irreversibility claim is the essay’s load-bearing assertion, and it faces a genuine empirical challenge. Machine unlearning is an active research field, and the gap between current capabilities and the requirements for MECH-033 reversal, while large, is not proven permanent. Exact unlearning methods with strong theoretical guarantees exist for simple model classes [Estimated]⁵. The honest response is that the irreversibility is practical and current, not proven absolute. If scalable machine unlearning achieves production-grade reliability under regulatory mandate within a decade, MECH-033’s irreversibility claim collapses from “structural” to “expensive.” That is a meaningful difference. The confidence range of 55-65% reflects genuine uncertainty about whether this limitation is permanent. I am making a claim about current conditions and near-term trajectory, not about the fundamental limits of the technology. The economic incentives powerfully oppose deployment — no AI laboratory benefits from machine unlearning at scale — but economic incentives do not determine the boundaries of technical possibility.

The Synthetic Data Escape. If synthetic data training achieves parity with human-data training across benchmarks, replicated independently, then the open web corpus was not a unique and irreplaceable input but a bootstrapping resource. The Nature model collapse findings [Measured]⁷ cut against this — training on recursively generated AI data produces progressive quality degradation where tails of the original distribution disappear. But a single study is not a permanent finding. The synthetic data research trajectory is real and could falsify the claim that the Phase 1 corpus holds appreciating value. If synthetic-only models match human-data models, the extraction still occurred, but its structural significance diminishes because ongoing capability does not depend on the extracted material.

The “Creative Destruction” Framing. An AI optimist would argue that this transformation is no different from previous technological disruptions — the printing press displaced scriptoriums, photography displaced portrait painters, recorded music displaced live performance as the primary revenue model. In each case, the old model was destroyed but a new, larger creative economy emerged. The objection is historically grounded but structurally incomplete. Previous disruptions created new distribution channels that preserved the link between creator and audience. The printing press still delivered the author’s specific text. Photography still captured the photographer’s specific vision. Weight encoding breaks the link: the model satisfies the demand without delivering the specific work. There is no new distribution channel that reconnects the creator to the redirected demand. This is not a claim that no new creative economy can emerge — it is a claim that the specific mechanism of value capture has been severed in a way that previous disruptions did not sever it.

The Scope Limitation. The mechanism’s strongest claim is about Phase 1 text extraction from the open web. Its applicability to Phase 3 domains — video, robotics, scientific data — is uncertain and should not be assumed. Different data modalities have different information-theoretic properties, different legal frameworks, and different market structures. The three-phase model explicitly acknowledges this boundary: Phases 2 and 3 are ongoing, and their outcomes are less determined. The essay’s confidence range applies to the Phase 1 claim. Extending it to all future extraction events would be overclaiming.

The Commons Romanticism Objection. A Marxist critique would note that the “open web commons” was never truly a commons in the political-economic sense — it was always mediated by corporate platforms (Google Search, social media, ad networks) that extracted value from user-generated content long before AI training began. The AI laboratories are not the first enclosers; they are the latest. This objection is substantially correct and narrows the novelty claim. What MECH-033 adds is not the first enclosure of digital content but a specific technical transformation that makes this particular enclosure practically permanent. Previous enclosures by platforms were reversible in principle: move your blog, change platforms, opt out of ad networks. Weight encoding is not reversible by individual action. The mechanism’s contribution is the lock, not the fence.

Falsification Conditions

This essay is wrong if:

1. A court orders model weight destruction AND compliance produces measurable degradation. The transformation is practically irreversible only if no institutional mechanism exists to compel reversal. No court has ordered this. The Bartz settlement explicitly preserved model weights [Measured]². But the absence of precedent is not the absence of possibility.

2. Scalable machine unlearning achieves production-grade reliability under regulatory mandate — defined as selective removal of specific training data influences from billion-parameter models, at scale, without unacceptable degradation, under independently verifiable conditions. This is testable as the research matures.

3. Creator compensation reaches proportional scale — within an order of magnitude of contribution value measured by the proportion of model capability attributable to a creator’s work. Current compensation levels ($3,000 per work against billions in model revenue) are not within an order of magnitude [Measured]². But licensing markets are young.

4. Synthetic-only training matches human-data models across benchmarks, replicated independently. The Nature model collapse findings cut against this [Measured]⁷, but a single study is not a permanent finding.

5. Original human works maintain or increase their market value despite model existence. If traffic to open web sources stabilizes or recovers and creators’ economic positions improve even as model deployment scales, the value-functional rivalry claim is wrong.

Bottom Line

Confidence calibration: 55-65% that the Phase 1 extraction event constitutes a durable structural transformation rather than a transitional friction that market mechanisms will resolve. Higher confidence (75-85%) that the transformation occurred as described. Lower confidence (40-50%) on whether irreversibility is a permanent feature versus a current limitation that machine unlearning research will overcome within a decade.

The binding uncertainty is whether scalable machine unlearning achieves production-grade reliability under regulatory mandate — if it does, MECH-033’s irreversibility claim collapses from “structural” to “expensive.” The economic incentives powerfully oppose this outcome. No AI laboratory benefits from machine unlearning at scale. The research funding comes primarily from academic institutions and regulatory requirements, not from the entities that would need to deploy it. This does not make progress impossible. It makes progress slower than it would be if the incentives were aligned.

The open web was not built by corporations. It was built by people — millions of them, writing for free, answering questions for strangers, documenting their knowledge for anyone who needed it. They built a commons. Between 2018 and 2024, that commons was compressed into proprietary weights. The compression was lossy. The contributors cannot be identified in the output. The economic function of their contributions has been captured by entities that did not build the commons but that possessed the computational resources to transform it. The transformation was legal, or is being made legal after the fact. It was technically irreversible, or is irreversible given current capabilities. And it was the enabling condition for everything that followed.

The lock is on the fence. The question is not whether the commons was enclosed. It was. The question is whether we build new commons knowing what we now know about how the old one ended — or whether we pretend that the next commons will be treated differently by the same institutions that enclosed the last one.

Where This Connects

The irreversibility described here is the enabling condition for The Cognitive Enclosure (MECH-007). That essay described the systemic conversion of open knowledge into proprietary cognitive capital. This essay names the mechanism that makes that conversion permanent: once the open web is encoded into model weights, the enclosure cannot be reversed by legal order, technical intervention, or market correction.

The dependency loop operates in both directions. Thinking in the Red (MECH-028) traced the hidden costs of AI as cognitive partner — creators increasingly depend on tools built from their own extracted work, deepening the dependency that makes resistance to extraction economically irrational at the individual level.

The capital dynamics that lock in the extraction are the subject of The Ratchet (MECH-014). The approaching $700 billion in 2026 infrastructure spend is not merely sustained by the extraction — it requires it. Sunk capex in training infrastructure makes the corpus-to-weights pipeline a commitment that tightens with each funding round.

The downstream epistemic consequences flow into The Epistemic Liquidity Trap (MECH-016). As extraction degrades the open web and model-generated content floods back into it, the cost of reality-grounded knowledge rises while the cost of plausible synthetic output falls.

The temporal ordering matters. The Sequencing Problem (MECH-022) demonstrated that mechanism-order effects determine transition paths. The extraction preceded meaningful legal frameworks by years — a sequencing failure that made the irreversibility structural rather than contingent.

That sequencing failure was not accidental. The Regulatory Inversion (MECH-031) documented how AI firms converted democratic governance into a legitimation ceremony. The regulatory capture held the gate open during the critical extraction window.

The Liability Vacuum (MECH-032) provided the legal environment in which extraction could proceed without consequence. The five channels of liability displacement meant that the harms of extraction attached to no one during the Phase 1 window when blocking it was still technically possible.

The gradual, invisible nature of the extraction mirrors The Dissipation Veil (MECH-013). The structural damage accumulated below the threshold of public attention until the encoding was irreversible.

Finally, the infrastructure that processes the extracted corpus is concentrated in the hands described by Compute Feudalism (MECH-029). The open web became proprietary weights, and the weights run on proprietary inference stacks controlled by vertically integrated fiefdoms.

Sources

https://www.afslaw.com/perspectives/alerts/landmark-ruling-ai-copyright-fair-use-vs-infringement-bartz-v-anthropic — “Landmark Ruling on AI Copyright: Fair Use vs. Infringement in Bartz v. Anthropic,” ArentFox Schiff, 2025. [verified]
https://copyrightalliance.org/participating-bartz-v-anthropic-settlement/ — “What to Know About the $1.5 Billion Bartz v. Anthropic Settlement,” Copyright Alliance, 2025. [verified]
https://reutersinstitute.politics.ox.ac.uk/how-many-news-websites-block-ai-crawlers — “How many news websites block AI crawlers?”, Reuters Institute for the Study of Journalism, 2025. [verified]
https://blog.cloudflare.com/content-independence-day-no-ai-crawl-without-compensation/ — “Content Independence Day: No AI Crawl Without Compensation,” Cloudflare Blog, July 2025. [verified]
https://www.preprints.org/manuscript/202603.0114 — “Machine Unlearning in Large Language Models: A Survey of Challenges and Methods,” Preprints.org, 2026. [verified]
https://epoch.ai/blog/will-we-run-out-of-data-limits-of-llm-scaling-based-on-human-generated-data — “Will we run out of data? Limits of LLM scaling based on human-generated data,” Epoch AI. [verified]
https://www.nature.com/articles/s41586-024-07566-y — Shumailov, I. et al. “AI models collapse when trained on recursively generated data,” Nature 631, 755-759, 2024. [verified]
https://natlawreview.com/article/openai-loses-privacy-gambit-20-million-chatgpt-logs-likely-headed-copyright — “OpenAI Loses Privacy Gambit: 20 Million ChatGPT Logs Likely Headed to Copyright Plaintiffs,” National Law Review, January 2026. [verified]
https://www.abajournal.com/news/article/chatgpt-creator-must-turn-over-20m-chat-logs-in-copyright-litigation-federal-judge-says — “ChatGPT creator must turn over 20M chat logs in copyright litigation,” ABA Journal, January 2026. [verified]
https://www.skadden.com/insights/publications/2025/07/fair-use-and-ai-training — “Fair Use and AI Training: Two Recent Decisions Highlight the Complexity of This Issue,” Skadden, July 2025. [verified]
https://www.ropesgray.com/en/insights/viewpoints/102lvxe/getty-image-loses-copyright-infringement-claim-against-stability-ai-in-uks-first — “Getty Image Loses Copyright Infringement Claim Against Stability AI in UK’s First-of-its-Kind Ruling,” Ropes & Gray, November 2025. [verified]
https://www.cjr.org/analysis/reddit-winning-ai-licensing-deals-openai-google-gemini-answers-rsl.php — “Reddit Is Winning the AI Game,” Columbia Journalism Review. [verified]
https://variety.com/2024/digital/news/news-corp-openai-licensing-deal-1236013734/ — “News Corp Inks OpenAI Licensing Deal Potentially Worth More Than $250 Million,” Variety, 2024. [verified]
https://authorsguild.org/news/harpercollins-ai-licensing-deal/ — “HarperCollins AI Licensing Deal,” The Authors Guild, 2024. [verified]
https://www.buzzstream.com/blog/publishers-block-ai-study/ — “Which News Sites Block AI Crawlers in 2025?”, BuzzStream, 2025. [verified]
https://www.cnbc.com/2026/02/06/google-microsoft-meta-amazon-ai-cash.html — “Tech AI spending approaches $700 billion in 2026,” CNBC, February 2026. [verified]
https://futurumgroup.com/insights/ai-capex-2026-the-690b-infrastructure-sprint/ — “AI Capex 2026: The $690B Infrastructure Sprint,” Futurum Group, 2026. [verified]
https://artificialintelligenceact.eu/article/53/ — “Article 53: Obligations for Providers of General-Purpose AI Models,” EU Artificial Intelligence Act. [verified]

Published by the Recursive Institute. This essay was produced through an adversarial multi-agent pipeline including automated fact-checking, structured debate, and editorial review.