The second web · why every URL is becoming two URLs

A user clicks a product link in their browser. The browser asks for text/html, gets HTML, paints pixels.

Twenty seconds later ChatGPT, asked the same question, fetches the same URL. It receives the same HTML, runs a simplifier, throws away nav, footer, scripts, modals, the recommendations carousel, and twenty trackers. What survives is a few paragraphs of prose and an image alt-tag — maybe a price.

Both clients asked for the same thing. Both got the same response. One was useful as designed. The other was useful only after destruction.

Every URL is becoming two URLs.

The browser still wants the polished page. The AI agent wants something else: clean prose, structured fields, machine-stable identifiers. We have been giving both consumers the same response and hoping the parser is smart enough.

It is — but barely. The fragility shows up everywhere: outdated price quotes, hallucinated stock status, mixed-up product variants, citations to obsolete URLs. These aren't model failures. They're parser failures. We made AI agents extract content that was designed for eyeballs.

01 / what AI agents do with HTML

Watch what an LLM-powered fetch actually retains:

Strips:scripts, styles, iframes, tracking pixels, hidden inputs, navigation menus, footers, cookie banners, modal dialogs, "you may also like" carousels, ad slots, social-share widgets, video players, comment threads, recommendation rails.
Keeps: heading text, paragraph text, list items, link anchors, image alt attributes, table cells, occasional structured-data blocks if the parser remembers to check.

Depending on the page, agent extraction strips 70–95% of the bytes you served. The remaining 5–30% is the only signal that survived to model context.

Now ask: when you wrote that page, were you writing for the 5–30% or the 95%? If you wrote for the 95% — the polished design, the recommendations, the conversion machinery — the AI agent saw an accidental subset of the actual signal.

02 / HTML is the wrong shape for machines

Three reasons HTML resists clean machine extraction:

Visual hierarchy ≠ semantic hierarchy. A <div class="text-3xl"> is visually a heading. To a parser it's a div. Atomic spacing classes, design tokens, and component-driven markup are illegible to extractors.
Client-side rendering hides content.SPAs, lazy-loaded sections, intersection-observer hydration, all-data-in-Redux: the signal often only exists after JS runs. Most LLM fetchers do not run JavaScript. A 5MB React bundle's worth of "content" is invisible.
The page is mostly chrome. Even on a perfectly server-rendered page, the actual content is a small island in a sea of nav, sidebars, related-content rails, ads, footers, modals, and tracking. The signal-to-noise ratio is hostile.

JSON-LD helps — it's the closest thing to a structured payload riding alongside HTML. But it's optional, often incomplete, and hard to keep in sync with what the page actually displays.

one source URL

/products/coffee-classic-candle

consumer · browser

request

GET /products/coffee-classic-candle
Accept: text/html

response

200 OK · text/html

Polished page. Nav, hero, product gallery, color picker, reviews carousel, related items, footer, scripts, modals.

consumer · AI agent

follows <link rel="alternate">

GET /products/coffee-classic-candle/llms.md
Accept: text/markdown

response

200 OK · text/markdown

Structured payload. YAML frontmatter (Product schema, stable @id, dateModified), clean prose, variant table. No chrome.

Same source. Two consumers. The discovery mechanism is <link rel="alternate" type="text/markdown"> — exactly how RSS, hreflang, and AppLinks have always advertised alternate representations.

03 / RSS solved this in 2002

When someone publishes a blog post, they put a magic line in the HTML head:

<link rel="alternate" type="application/rss+xml" href="/feed.xml" />

Browsers ignore it. RSS readers follow it. Both consumers visit the same URL but they end up with different shapes of the same content. The pattern is older than RSS:

hreflang uses <link rel="alternate" hreflang="es"> to point Spanish browsers to a Spanish version.
AMP used <link rel="amphtml"> to point mobile crawlers to a stripped-down page.
AppLinks used <meta property="al:ios:url"> to redirect from web URLs to native apps.

Every time the consumer split — international users, mobile users, native-app users — we used the same primitive: declare an alternate, let the consumer choose. We never needed a new spec. We needed a new payload.

For AI: declare a markdown alternate, let agents fetch it.

<link rel="alternate" type="text/markdown" href="/llms.md" />

That's the whole spec. RSS readers respected it for twenty years. Why wouldn't AI agents?

04 / llms.txt is a manifest, not the answer

Jeremy Howard's /llms.txt proposal solves site-level discovery: a manifest at the root of the site that tells AI consumers where the important content lives. Useful. Necessary. Not sufficient.

llms.txt is a directory. The bifurcation pattern is per-page content delivery. Both layers belong:

Site-level (llms.txt): here are my key pages, ranked, with summaries.
Per-page (rel=alternate): here is this page in machine-readable form.

An agent answering a specific question doesn't want a site map. It wants the page. The page should announce its own machine-readable form, the same way every blog already announces its RSS feed.

05 / what goes in the markdown variant

The markdown payload should look nothing like a CMS dump. It's a published artifact. Three layers:

---
"@context": "https://schema.org"
"@type": "Product"
"@id": "https://example.com/products/coffee-candle#sku-8910995"
name: "Coffee Classic Candle"
brand: { "@type": "Brand", name: "Cafuné Atelier" }
offers:
  "@type": "Offer"
  price: 48
  priceCurrency: "USD"
  availability: "https://schema.org/InStock"
dateModified: "2026-05-08T18:00:00Z"
---

# Cafuné Atelier Coffee Classic Candle

Hand-poured 8 oz coconut-soy blend. 50-hour burn time.

## Variants
- Espresso (color=000)
- Vanilla Bean (color=001)
- Cardamom (color=002)

YAML frontmatter — JSON-LD in friendlier clothes. Schema.org type, stable @id, dateModified, all the structured fields a model needs to identify the entity.
Prose — the actual content, cleaned. No nav, no chrome, no recommendations, no UI. Just headings, paragraphs, lists. Markdown is the format LLMs were trained on most heavily; they read it natively.
Provenance metadata.Demo flag if it's a demo, publisher info, last refresh, source quality notes. Tell the model what it's looking at and how stale it might be.

Two demos on this site are working examples:

Product page — Nordstrom-style URL with ?color=000&origin=…&breadcrumb=…. The HTML renders the variant; the markdown is full Product schema with all variants enumerated and each query parameter classified (variant vs tracking).
Person page — Spokeo-style profile. The HTML is a profile card. The markdown is the same profile, but as Person schema with an address graph, source-quality notes, and a _demo: true flag.

Both pairs are reachable today. Open both in tabs.

06 / publish entities, not pages

The first instinct after reading this far is to bulk-convert every URL on a site into a markdown alternate. That's the wrong instinct. Most pages aren't entities. They don't need a feed.

An entity page is the canonical home of a stable, identifiable thing. The question to ask before adding an alternate: would an AI answer ever cite this page as a source?

Strong yes — alternate makes sense:

Product pages (/products/coffee-candle)
Person profiles (/people/john-smith-atlanta-ga)
Location pages (/restaurants/the-laundry-yountville)
Organization pages (/companies/anthropic)
Article / explainer pages (/notes/the-second-web)
Definition / glossary pages (/glossary/canonicalization)
Recipe / how-to pages (/recipes/coffee-candle-pour)

Weak no — alternate is wasted effort:

Homepage — its job is brand, not citation
Category and listing pages — navigation, not entities
Search-result pages — ephemeral; AI does this work itself
Tracking-bound landing pages
Filtered / faceted variants of an entity — these canonicalize to the entity, not alongside it

This mirrors how schema-markup decisions already get made today. You don't put Productschema on the homepage. You don't put Recipe schema on a category listing. The same logic applies to bifurcation: only the canonical page of each entity earns the alternate.

Stop counting URLs. Start counting entities.

If the search layer moves up to the AI — and the trajectory says it will, on a one-to-two-year horizon — the work compresses. You stop optimizing the site as a search engine (category pages, internal nav, faceted listings, related-content rails). You start publishing entities. Each entity has:

One canonical URL
One HTML representation (for browsers)
One markdown + JSON-LD representation (for AI agents)
One title that disambiguates variants

That's the work. The rest of the site — the chrome around the entities — doesn't need an alternate at all. It exists for navigation and brand. AI doesn't care.

A site with 100,000 URLs and 8,000 distinct entities should publish 8,000 markdown alternates, not 100,000. The other 92,000 are paths to entities, not the entities themselves.

07 / where search lives, now and next

The deepest implication of bifurcation isn't "publish two formats." It's that the search layer moves up. Today, every website implements a mini search engine inside itself — category pages, faceted nav, internal search bar, recommendation rails, sort controls. All of that infrastructure exists because the site has to do its own retrieval. The site is a search engine wearing brand clothes.

user intent

"best coffee candles under $50"

01 / today

search inside the site

Google SERP
↓
site homepage
↓
category page
↓
faceted filters
↓
internal search
↓
★product page
browse · decide · buy

Discovery, decision, transaction — all on the site.

02 / next

search at the model layer

AI engine
ChatGPT · Perplexity · Claude
↓retrieval
fetches entity feeds
rel=alternate · llms.txt
↓grounding
evidence-fits candidates
does each source support the claim?
↓citation
★answer with cited source
browse · decide happen here
↓optional
entity page
transaction only

AI does discovery and decision. The site is the transaction terminal.

Where search lives shifts up the stack. The retrieval mechanisms a site once owned — category nav, faceted filters, internal search — move to the AI layer. What remains: a catalog of canonical entities, served in two formats. The starred row is where the user spends their attention; notice it moves.

In the AI-mediated future, the AI is the search engine. The site is just a catalog of canonical entities. The retrieval features that used to live inside the site move up to the model layer. What's left on the site is the entity itself — in two formats — plus a transaction surface.

This is what AI engines mean when they talk about grounding. The model retrieves candidate sources, scores each one's evidence-fit against the claim it's about to make, picks the strongest two or three, and assembles an answer that cites them. Once that pipeline runs at the model layer, the site doesn't need to host its own search engine. It needs to publish a catalog of cited-able entities — and get out of the way.

Stop building the search engine. Start publishing the catalog.

The job of the site shrinks. The job of the SEO doesn't disappear — it shifts. Each entity needs to be reachable, well-described, and disambiguated. That is the canonicalization rule applied at site scale: one canonical URL per entity, one HTML representation, one markdown alternate, one title that disambiguates variants. Build the catalog. Let the AI do the search.

08 / objections

"This is cloaking." No. Cloaking serves different content to bots than humans for the same URL. Bifurcation serves the same content in two formats from clearly distinct URLs. The HTML page declares the alternate; the alternate declares its canonical. Anyone — Google, ChatGPT, you — can read both and see they describe the same entity.
"AI agents won't follow alternate links." Some don't yet. So the markdown URL works as a standalone too — reachable directly, listed in llms.txt, in sitemap.xml. A capable agent finds it. A naive agent reads the HTML. Both work.
"This duplicates effort."Only if you're writing twice. The right setup has one source of truth (typed data layer, CMS, database) and renders both views from it. The two demos here both pull from a single _data.ts.
"Search engines will see this as duplicate content." Different MIME types. The text/markdown URL is noindex for search; it exists for agents. The HTML URL is the canonical for search. The rel="alternate" declaration is exactly the disambiguation Google asks for.
"What about the gap between formats?" The markdown version is allowed to be smaller. Omit trust signals, social proof, conversion CTAs, recommendations. Focus on the answerable claim. That's the entire point.

09 / SEO splits in two

The work splits cleanly:

HTML SEO stays where it has always been: site architecture, internal links, schema, Core Web Vitals, content cluster strategy, conversion design.
AI-content SEO is something else: Schema.org fluency in markdown, freshness signals, machine-stable identifiers, evidence-fitness scoring, alternate-link discoverability.

The roles overlap on canonicalization (same entity boundaries either way — see One canonicalization rule) and on freshness (both formats reflect the same source). They diverge on layout, design, and conversion plumbing.

If you're an SEO who codes — you ship both. If you're an SEO who specs — you spec both and have engineering ship them off the same source.

The browser still wants the polished page. The AI agent wants something else.

— / closing

The web didn't bifurcate because we wanted it to. It bifurcated because the consumer split. Browsers used to be the only readers. Then crawlers. Now models. Each new consumer asked "is the existing format right for me?" and each time the honest answer was almost.

For browsers: rich HTML. For crawlers: HTML plus structured data. For models: HTML's chrome budget is hostile, so — markdown plus structured frontmatter, advertised the way RSS taught us to advertise.

Two formats. One source. One declaration in the head. The infrastructure has been here for twenty years.

Open the demos:

/mockups/product-page — try ?color=001 and watch both formats update.
/mockups/person-page — try the bare URL and the .md alternate side by side.

The pattern is real. The implementation is small. The only thing missing is the convention.

May 8, 2026 · Dasara Kushi← back to home

The second web.