One canonicalization rule for cannibalization, duplicate content, and crawl budget

I worked on a people-search site for 14 years. Most catalog sites at scale have the same problem: they publish more URLs than search intent justifies. The waste shows up as URL cannibalization — multiple pages competing for the same query, the site competing with itself, ranking signal pooling in the wrong place.

The standard fix is to set canonical tags from filtered URLs back to the parent. Some sites do the opposite — every facet self-canonical. Both are wrong half the time, because the right answer depends on something the template doesn't know.

The right question is simpler than the canonical decision suggests:

For this URL, where does search intent actually live?

Search intent lives at entities, not properties. If a URL identifies a distinct entity, it earns its own canonical. If it's only a property of an entity represented elsewhere, canonical it to the entity and let the title handle query-match.

Apply that recursively at every facet level. That's the whole principle. The rest is implementation.

01 / collapseOne entity, many properties

/mac/lip-liner/whirl
/mac/lip-liner/pomegranate
/mac/lip-liner/spice
/mac/lip-liner/velvet

↓ rel="canonical"

/mac/lip-liner

Title differentiates: "MAC Lip Liner Whirl", "MAC Lip Liner Pomegranate"…
Same product. Different shade. One canonical.

02 / splitSame string, many entities

query string · "John Smith"

↓ disambiguates to

/john-smith/atlanta-ga
/john-smith/tucson-az
/john-smith/new-york-ny

Each is self-canonical. Title disambiguates further: "John Smith — Atlanta, GA".
Different person. Different canonical.

Same rule, opposite outcomes. The variant is either a property of one entity (collapse to entity, title handles the property) or a different entity sharing the same string (split into separate canonicals, title disambiguates which one).

01 / domain one · people-search

A site indexing 100M+ records ends up with thousands of John Smiths. URLs look like:

/john-smith
/john-smith/california
/john-smith/california/los-angeles

The naive answer is to canonical the filtered URLs to the root. That's correct sometimes and wrong other times — because where search intent lives depends on the name itself.

For an uncommon name (Rajesh Chatraptra), one record in Maine, NY, and California usually means relocation, not three people. Entity resolution collapses them into one entity. Root identifies that entity. The states become biographical properties — different angles on the same Rajesh, not separate Rajeshes. All state pages canonical to root.

collapse · people-searchOne person, three addresses

records

Rajesh Chatraptra

12 Pine St

Bangor, ME

2014–2017

Rajesh Chatraptra

418 Atlantic Ave

Brooklyn, NY

2017–2021

Rajesh Chatraptra

2200 Kettner Blvd

San Diego, CA

2021–now

↓ entity resolution · overlapping address-history graph

one entity · Rajesh Chatraptra (ME → NY → CA)

↓ rel="canonical"

state URLs

/rajesh-chatraptra/maine

/rajesh-chatraptra/ny

/rajesh-chatraptra/california

↓ all canonical to root

/rajesh-chatraptra

Title handles the property: "Rajesh Chatraptra — addresses in Maine, New York, California"
One person relocating across three states, not three people. One entity. One canonical. State pages still serve query-match for "Rajesh Chatraptra Maine" via title — without splitting authority across three URLs.

The uncommon-name case. n=1 in three states is statistically relocation, not coincidence — so entity resolution merges the records into one person, and canonicalization reflects that. The states are biographical properties of one entity, not three distinct entities sharing a string.

For a common name(John Smith): the same pattern doesn't collapse. One John Smith in Tucson, one John Smith in NY — these are not plausibly the same person. They're different entities at different cities. The root is an aggregator of many distinct John Smiths, and the recursion has to drill down until it identifies a specific one.

That's the unifying mechanic underneath everything in this article:

Recursively disambiguate until an entity is identified. Stop. Everything below that depth becomes a property of the entity identified there.

Rajesh: entity identified at root (one person globally, scattered states) → all state/city facets are properties → all canonical to root.
John Smith Tucson, John Smith NY: entity not identified at root (many distinct people) → recurse down → each city URL identifies a specific entity → each self-canonical.
John Smith Ohio with n=200: entity not identified at the state level either (still 200 distinct people) → state is an aggregator entity → self-canonical → keep recursing into cities.

The naive version of this rule is to count records at each facet and canonical to root whenever the count drops to one. That gets it almost right. The case it gets wrong is worth working through.

The Toronto case

John Smith in Toronto, Ontario, Canada. John Smith in Toronto, Ohio, USA. Both pages have one record at the facet. Naive count says: both canonical to root.

But these are two distinct peopleat the same name × city string — different entities, different professions, different biographies. Canonical-to-root would collapse two distinct entity surfaces into one disambiguation listing, and neither would have a canonical URL to rank for "John Smith Toronto Canada" or "John Smith Toronto Ohio."

Each entity needs its own canonical. The fix isn't a special case — it's reading the right signal. Count was a proxy for entity identitythat silently failed when a common name carried multiple distinct entities scattered sparsely across facets. The corrected condition: state pages canonical to root only when there's one entity globally for the name. If multiple distinct entities exist at the same name, each entity URL is its own canonical surface.

e_global = distinct_entities(name)         // entities after resolution
e_facet  = distinct_entities(name, state)  // entities at this facet

if e_facet == 0                          → don't generate
if e_global == 1                         → all facet URLs canonical to root
                                           (one entity total; facets are properties)
if e_facet == 1 and e_global >= 2        → state URL self-canonical
                                           (distinct entity scoped to this facet)
if e_facet >= 2                          → state URL self-canonical
                                           → recurse into cities

The corrected rule reads off entity identity, not raw count. That's the case the count rule got wrong, and the case the entity/property frame gets right by construction.

The template doesn't decide this. The data does.

02 / domain two · e-commerce variants

The same question shows up in any product catalog. A camera or phone comes in multiple SKUs: body-only, body-with-lens, refurbished, different storage tiers, different colors, different bundles. URLs could be:

/leica-q3
/leica-q3-body-only
/leica-q3-with-lens
/leica-q3-refurbished

Same predicate as people-search: does this URL identify a different entity from its parent canonical? The signals change; the question doesn't.

MAC Lip Liner Whirl. The shade is a property of the line entity — same product, different angle. Canonical to the entity; per-shade title for query-match.
iPhone 14 256GB. Apple treats this as a property (one product page, configurable) — canonical to the line entity. Best Buy treats it as a distinct SKU entity (separate inventory, separate price tracking) — self-canonical at the variant. Both are correct. The rule respects whichever entity granularity the catalog has chosen.
Leica Q3 black vs. silver. Color is a descriptive property of the same body — same price, same warranty, same SKU pool. Canonical to the entity.
Leica Q3 refurbished. Different price, different warranty, different stock pool — a separate commercial entity. Self-canonical.

Structurally analogous to people-search:

MAC Whirl                  ≅  Rajesh Chatraptra in Maine
                              (property of one entity → canonical to entity)

iPhone 14 256GB (Best Buy) ≅  John Smith Ohio with n=200
                              (distinct entity → self-canonical)

People-search asks "is this a different person?" E-commerce asks "is this a different commercial unit?" Same binary, same recursion, same canonical decision. The catalog model determines what counts as an entity. The rule doesn't impose that — it just respects the model.

Where the entity boundary sits is itself a decision.

The bullets above treat the entity boundary as if it falls out of the catalog data — which it usually does, but not always. Sometimes the boundary is a deliberate SEO call driven by demand.

"Red prom dress" is the canonical case. Year one, treat color as a property: parent line page, all colors canonical upward. Year three, GSC shows ~6,200 monthly impressions for "red prom dress" at average position 14 — striking distance for top-three placement. Promote color from property to entity: a dedicated /dresses/prom/red, self-canonical, title disambiguates further ("Red Prom Dresses for 2026 — 84 styles"). The rule fires identically before and after; what moved is the entity boundary, not the rule.

The rule takes "what is an entity?" as input. It does not compute it. The decision is the SEO team's, and it's driven by demand signals as much as by catalog model. The catalog model decides where the boundary can sit; demand decides where it should.

03 / the principle generalizes

The rule isn't really about people-search or e-commerce. It's a property of any catalog with a faceted URL hierarchy:

real estate · /properties/austin/east-side/under-500k
jobs · /engineer/python/remote/senior
travel · /hotels/lisbon/by-the-river
reviews · /restaurants/portland/thai
local services · /plumber/brooklyn/emergency

For each URL in any of these hierarchies, the same binary applies: does this URL identify an entity, or a property of one already represented elsewhere? What counts as an entity is domain-specific — listing, role, hotel, restaurant. The decision isn't. The template doesn't know the answer. The data does.

04 / formula

If you want a single-line statement of the rule, here it is. For any node u in a faceted URL tree:

canonical(u) =
    u                       if intent_lives_at(u)
    canonical(parent(u))    otherwise

The recursion is the load-bearing part. If intent doesn't live at u, the canonical isn't the parent automatically — it's whatever the parent's canonical resolves to. So /john-smith/wyoming/cheyenne with one record canonicals to /john-smith/wyoming, where the first stable entity bucket is identified.

The predicate has one job: ask whether u identifies a different entity from its parent canonical.

intent_lives_at(u) ≜  ¬ same_entity_as(u, canonical(parent(u)))

Same entity → uis a property of the parent's entity (a different angle, a filtered view, a variant). Canonical to parent; let the title handle query-match.

Different entity → u has its own ranking target. Self-canonical.

Domains differ only in how they evaluate same_entity_as — what counts as an entity depends on the catalog model:

People-search. Entity = distinct person. Different person at the facet ⇒ different entity.
E-commerce. Entity = whatever the catalog treats as a commercial unit. Apple treats iPhone 14 as the entity (storage is a property). Best Buy treats iPhone 14 256GB as the entity (per-SKU inventory). Both are correct — the rule respects whichever model the catalog uses.
Real estate, jobs, travel.Entity = listing, role, hotel, restaurant — whatever your catalog's commercial granularity is. The rule doesn't impose entity granularity; it respects what you've decided your entities are.

The formula separates two concerns:

Direction— where does intent live? Universal recursive structure. Doesn't change between domains.
Detection — does this URL identify an entity? Domain-specific. Swap the entity definition, keep the rest.

That's why the same rule fits everywhere with no rewrite: only the entity definition changes. One predicate, one recursion, many domains.

05 / tools, not theorems

The rule decides direction — where intent lives — not aggressiveness. There are several tools for actually pruning:

Self-canonical — keep the URL; intent lives here
Canonical to parent — pool authority upward, URL still crawlable
noindex, follow — drop from index but preserve crawl depth
301 redirect — strongest consolidation, URL eliminated entirely
410 Gone — full deletion when the URL has no value at all
robots.txt disallow — block crawl entirely

For most facet pages, canonical is the right tool — it's reversible and lets the URL exist for direct-URL access. But the rule itself isn't tied to canonical. It's tied to where intent lives, and the tool depends on how aggressively you want to prune.

One worth noting: Google treats <link rel="canonical"> as a suggestion, not a directive. It can ignore your tag if it disagrees with it. The rule says where authority should flow. Google still decides where it actually flows. In practice the alignment is high but not perfect — another reason this is a heuristic, not a theorem.

06 / two outputs, not one

The rule operates on two layers. Canonicals identify entities. Titles, snippets, and on-page content identify properties of entities. They use different signals because they answer different questions.

Ulta is a clean example. For "MAC Lip Liner Whirl," they publish a per-shade URL with a per-shade title, a per-shade swatch, and shade-specific reviews — but with a rel=canonical pointing to the consolidated line page. Two layers, both correct:

Canonical → entity. The shade is a property of the MAC Lip Liner line. Same entity as the parent canonical. The canonical points at the entity (root).
Title → property. The shade name is the property the searcher used. The per-shade page exposes the shade title, swatch, and reviews — the property surface for that query.

Both queries get the right answer because the layers don't compete:

For "MAC lip liner" (entity query), the line page surfaces — authority pools there, title matches there.
For "MAC Whirl" (property query), the per-shade page surfaces — its title carries the property token, and Google routes the query to the URL whose title matches it. Authority still belongs to the entity, but the index entry that matches the query lives on the property page.

Canonicals identify entities. Titles identify properties of entities. Both layers are doing what they're supposed to.

This resolves the apparent contradiction with "titles affect ranking." Titles rank at the property layer (for property-specific queries). Canonicals rank at the entity layer (by pooling authority). They operate on different surfaces. Google reconciles them at query time: rel=canonical is a suggestion, so when a property-page title is a much stronger match, Google can route the query there even if the canonical points elsewhere.

Takeaway: don't conflate the entity layer with the property layer. Canonicals are for entities. Titles, snippets, schema, and on-page content are for properties. Treating both as one decision loses surface area. Treating them as two lets entity queries land on entities, property queries land on properties, and authority concentrate where it belongs.

07 / what i'd do differently today

Three things, looking back.

Count was a proxy, not the thing itself.

The people-search rule used name-string-count. Today I'd condition on entity disambiguation rather than name string — two records with the same name but distinct employers, distinct LinkedIn profiles, or distinct address histories should count as two entities, not two records of one. The count usually got this right but not always, and the proxy is worth replacing.

The resolution itself is cheap when records carry address history — which they almost always do in aggregated people-search data. The signal isn't name rarity or a Bayesian prior; it's direct overlap. Build a graph where each record is a node and an edge exists between two records whose addresses overlap (current address in the other's history, or shared prior addresses). Take connected components. Each component is an entity.

graph G:
  nodes: records with name = N
  edge(r1, r2): r1.current ∈ r2.history  OR
                r2.current ∈ r1.history  OR
                |r1.history ∩ r2.history| ≥ threshold

e_global = count(connected_components(G))
e_facet  = count(connected_components touching state s)

For Rajesh Chatraptra in Maine, NY, and California, each record's address history references the others — every pair has an edge → one component → one entity → all states canonical to root. For John Smith Tucson and John Smith NY, histories don't overlap → two components → two entities → each city self-canonical. The Toronto case resolves the same way: disjoint histories, two components.

No Bayes factors required when the data carries the linkage signal directly. The recursion in the rule operates on the component count; the component count comes from a single graph traversal. That's how cheap it is.

The rule should have run at more facet levels.

It worked at state and city. There were facets — employer, school, age range — where extending the same logic would have been correct, and the implementation didn't reach that far.

AI Overviews add a layer the rule doesn't address.

The rule still governs the indexable layer, which is still what AIO consumes. But the question of which entity gets cited in the AI summary is different from which URL ranks blue-link, and the rule doesn't reach that question on its own. The rule remains the foundation; AIO sits on top.

— / the principle, named

URL hierarchies should match search-intent hierarchies, not catalog hierarchies. A facet URL keeps its own canonical only when it identifies an entity the parent canonical doesn't already represent. Otherwise it's a property — and properties are handled with the title, not the canonical.

Canonicals identify entities. Title handles properties.

Same principle whether the facet is geography, variant, role, or anything else. The same recursion at every depth: stop when you've identified the entity; everything below is property.

Practical heuristic: if the facet change is biography, canonical up. If the facet change is identity, stay separate.

Three second-order benefits fall out for free. The same rule prevents three distinct failure modes:

Cannibalization.Property URLs canonical to their entity ⇒ authority concentrates at the entity ⇒ URLs don't compete with each other for the same query.
Duplicate/thin content.Every canonical URL has substantive content — the entity itself. If a URL would have had only thin or derivative content, it shouldn't have been its own canonical in the first place.
Crawl budget.Googlebot's crawl signal concentrates on entities, not on the long tail of property variants. Programmatic sites with millions of URLs stop wasting crawl on pages that don't deserve indexing — the canonical signal tells Google where to spend, and where not to.

All three failure modes have the same root cause: treating properties as if they were entities. One rule fixes all three.

Step back from those three failure modes and look at what else falls out. Filtered URLs canonical to the entity. Only entities in the XML sitemap. Internal-link blocks target entities. Crawl budget concentrates on entities. Schema specificity sits at entity nodes, not facet leaves. The decision is the entity. Everything else is respecting it.

Apply that recursively across a catalog of millions of URLs, and what looks like a canonical-tag policy starts looking like the design of an indexable ontology. The work itself stays small: pick the entities; let the rest follow.

The formulation isn't original. Google's duplicate-content guidance points the same direction, and the established SEO literature describes the same pattern through pagination, view-all pages, and faceted navigation discussions. What's missing in the literature is the binary: entity, or property of one. That's the question the rule actually answers, and that's the part worth naming.

May 5, 2026 · Dasara Kushi← back to home

One canonicalization rule.

The Toronto case

Where the entity boundary sits is itself a decision.

Count was a proxy, not the thing itself.

The rule should have run at more facet levels.

AI Overviews add a layer the rule doesn't address.