Methodology
How we measure AI search visibility
An honest, public explanation of what the scores mean, how we calculate them, and — importantly — what they cannot tell you. Our methodology is the thing we’re most proud of; it deserves to be in the open.
Last updated: June 2026 · Field changes fast — we re-verify key statistics quarterly.
What we measure
GEO Tool measures two different things, and it’s important to understand both:
- AI Readiness (the free audit). A deterministic, heuristic score based on signals we extract directly from your public page — robots.txt rules, structured data, content extractability, freshness signals, and technical health. This runs in seconds and costs nothing because it requires no live AI API calls. It tells you whether your page is positioned to be found by AI crawlers and parsers.
- Measured Visibility (paid tier). Live grounded queries sent to the actual AI engines (ChatGPT, Perplexity, Gemini, and Claude) to measure whether your brand is actually appearing in answers. This is a retrieval-based measurement, not a training-memory check. See Section 4 for the full detail.
The distinction matters. A Readiness score tells you about the technical prerequisites for AI visibility; Measured Visibility tells you the actual outcome. Think of Readiness as “is the door unlocked?” and Measured Visibility as “are people actually walking through it?”
The six readiness signals
The Readiness score aggregates six signal categories. Each is independently measurable from your public page with no AI API call.
AI Crawler Access
HighWhat: Whether the major AI crawlers can reach your content at all.
How we measure it: We parse your robots.txt for explicit Disallow rules targeting GPTBot, OAI-SearchBot, ClaudeBot, PerplexityBot, and Google-Extended. If a crawler is blocked, that engine cannot index your page — it is categorically excluded from being cited, regardless of content quality.
Evidence: Consensus: crawler access is a binary gate. 87% of ChatGPT citations match Bing’s top-10 results (Seer Interactive, 2025), so Bing indexing — which requires OAI-SearchBot access — is the most critical single signal for ChatGPT visibility.
Structured Data
High for Gemini, Medium for othersWhat: JSON-LD schema markup that lets AI engines parse your content as structured facts.
How we measure it: We scan your page's JSON-LD blocks for FAQPage, Article, HowTo, Organization, and BreadcrumbList types. FAQPage is the highest-priority type: a 2026 study of 1,508 real estate websites found FAQPage schema present on 6.2% of ChatGPT-visible sites versus only 0.8% of non-visible sites.
Evidence: Probable: controlled study (Schanbacher, 2026, n=1,508). Other schema types show mixed or no correlation in independent studies — do not over-index on schema as a silver bullet.
Extractability
Highest for Claude, High for othersWhat: How easily an AI can identify and quote the most relevant passage from your page.
How we measure it: We look for answer-first structure (direct answer in the first ~200 words), heading count, FAQ-style Q&A pairs, lists and tables, and Flesch readability. Pages that answer the query upfront get cited from their first third of content 44% of the time (practitioner research). Readability in the 50–70 Flesch range is optimal for AI citation.
Evidence: Probable: practitioner consensus plus the Princeton GEO paper finding that content structure signals correlate with citation lift (n=10,000 queries, Perplexity).
Authority and Factual Density
Highest for Perplexity, High for GeminiWhat: How well the page signals its own trustworthiness through factual content.
How we measure it: We count statistics and percentages (numerical claims increase citation probability by ~41% per the Princeton GEO paper), outbound links to authoritative domains, in-text citations, and overall word count (a proxy for depth). These proxy for E-E-A-T signals, which appear in 96% of Google AI Overview citations.
Evidence: Proven: Princeton GEO paper (statistics: +41%, citations: +34%, n=10,000 queries). E-E-A-T correlation from Wellows study (practitioner, 2025). Brand mentions correlation with AIO: r=0.664 — the strongest known predictor we can measure from the page itself.
Freshness
Highest for PerplexityWhat: Whether your page tells AI engines how recent its information is.
How we measure it: We check for datePublished and dateModified in JSON-LD, sitemap lastmod, and HTML meta date tags. We compute how many days ago the most recent signal was. Perplexity's retrieval pipeline applies an aggressive recency filter — pages without freshness signals are treated as potentially stale.
Evidence: Consensus: Perplexity explicitly documents freshness as a ranking factor. No quantified study isolates the citation lift from freshness signals alone, but practitioner consensus is strong.
Technical Health
High for ChatGPT, Medium for othersWhat: Whether the page is actually fetchable and parseable by an AI crawler.
How we measure it: We check for HTTPS, a valid HTTP 200 status, viewport meta tag (mobile-friendliness proxy), and whether the page depends heavily on JavaScript rendering. AI crawlers typically fetch raw HTML without executing JavaScript — JS-heavy pages may appear nearly blank to them.
Evidence: Consensus on HTTPS and JS rendering. Page speed effect: pages with FCP under 0.4s receive 6.7 ChatGPT citations versus 2.1 for FCP over 1.13s in one practitioner study — but this has not been independently replicated and should be treated as indicative only.
Note on llms.txt
We flag the absence of an /llms.txt file as an informational (not critical) issue. Current evidence is against its importance: Ahrefs’ May 2026 study of 137,000 domains found 97% of llms.txt files received zero traffic from any AI bot. We include the signal because the spec might gain adoption — but we will not overstate its current effect.
Per-engine scoring
Each AI engine weights the six signal categories differently because they have genuinely different retrieval architectures. Rather than presenting a single “AI score,” we compute a per-engine Readiness score and then average them for the overall headline number.
| Signal | ChatGPT | Perplexity | Gemini | Claude |
|---|---|---|---|---|
| Crawler Access | 25% | 20% | 20% | 20% |
| Structured Data | 15% | 10% | 25% | 20% |
| Extractability | 20% | 15% | 15% | 30% |
| Authority | 15% | 25% | 20% | 15% |
| Freshness | 10% | 25% | 10% | 5% |
| Technical | 15% | 5% | 10% | 10% |
ChatGPT weights crawler access highly because 87% of its citations come from Bing’s top-10, and Bing requires OAI-SearchBot access. Perplexity weights freshness and authority most because its six-stage RAG pipeline aggressively filters for recency and factual density. Gemini most values structured data and authority signals, reflecting Google’s E-E-A-T framework. Claude puts extractability first because it is the most selective engine, with 97.8% of citations from established, clearly structured sources.
Any engine that is blocked in robots.txt receives a 55% penalty on its blended score (the score is multiplied by 0.45), reflecting that blocked engines cannot index or cite the page at all.
The overall Readiness score is the arithmetic mean of the four per-engine scores, each clamped to [0, 100].
Measured Visibility (live grounded queries)
Readiness scores are a necessary but not sufficient condition for AI visibility. Measured Visibility goes further: we send real queries to the actual AI engines and measure whether your brand appears in the grounded, cited answers.
Why grounded queries, not training-memory probes
For all major consumer AI engines — ChatGPT Search, Perplexity, Google AI Overviews, Copilot, and Claude with web search — answers are grounded in live web retrievalat query time. Training memory decides whether an AI “knows” a brand exists; live retrieval decides whether it cites that brand in answers. GEO is primarily a retrieval problem.
We do not use base-model prompting (asking the AI what it “knows” about a brand without web search). That tests frozen training memory, is highly non-deterministic, prone to hallucination for less-known brands, and does not reflect what users actually see in grounded consumer engines.
Query set design
We build a curated query set covering three buckets: discovery queries (category-based, no brand names), comparison queries (evaluation-stage), and brand-direct queries. The mix is roughly 60/30/10 — discovery is weighted highest because that is where new audience is captured. A minimum of 30 queries per audit provides meaningful signal; the full standard audit uses 50.
Repetitions and statistical treatment
A single run of a single query is nearly meaningless. LLM non-determinism means the same query can produce different citations across runs, even at temperature=0 (batch-size variance at the GPU level cannot be eliminated). The Princeton GEO paper found 60–100 repetitions are needed to stabilise visibility percentages.
We run a minimum of 5 repetitions per query per engine. Citation rate is calculated as the fraction of runs where the brand was cited (brandCited = true). We use the Wilson score interval to compute a 95% confidence interval on this proportion.
API vs. consumer UI fidelity
We use each engine’s programmatic API with web search enabled. This is the industry standard and the only scalable approach, but it introduces known gaps:
- Consumer ChatGPT.com may include personalization and browsing history that APIs do not replicate.
- Gemini API with Google Search grounding and Google AI Overviews share only ~38.5% of top cited domains — a structural gap we explicitly disclose on every Google score.
- Perplexity documents potential divergence between its API and consumer UI.
We flag these limitations in every report that includes a Measured Visibility score. We believe transparency about measurement limitations is more valuable than false precision.
Confidence bands and volatility
AI visibility scores are estimates, not facts. Three sources of volatility affect any measurement:
- LLM non-determinism. Even at temperature=0, API responses vary due to GPU-level batch-size variance. A point-in-time score with no repetitions may vary ±30% in a rerun. Our Measured Visibility engine uses multiple repetitions and reports Wilson-score confidence intervals so you can see how stable each number is.
- Model and algorithm updates. OpenAI, Google, and Perplexity update their models continuously, often without public announcements. A score from six weeks ago may not reflect current reality. We show the measurement date on every score and flag scores older than 30 days.
- Geographic and personalization variation. The same query produces different answers across locations, languages, and account types. Our measurements are run from standardised locations. Results for other geographies may differ.
The Readiness score (free audit) is much more stable than Measured Visibility because it tests deterministic signals — robots.txt, structured data, HTML content — not stochastic AI outputs. A Readiness score change reflects a real change to your page.
What this does and doesn't tell you
We think intellectual honesty about measurement limits is what separates a useful tool from a convincing-looking score generator. Here is what this tool explicitly cannot tell you:
We do not measure organic traffic from AI.
AI citation and AI-referral traffic are different things. A brand can be heavily cited and still see flat referral traffic — AI platforms increasingly answer in-UI without generating external clicks (Digiday, 2025). Our score tells you about citation probability, not traffic volume.
Citation rate does not guarantee revenue impact.
AI referral traffic converts at higher rates than traditional organic (multiple studies find 5–23× higher conversion than Google organic), but the absolute volume remains under 1% of web traffic. A high visibility score is a quality-of-positioning signal, not a revenue forecast.
Readiness scores are about prerequisites, not guarantees.
A Readiness score of 90/100 means your page has the right technical signals in place. It does not guarantee you will be cited. Off-site factors — Reddit presence, brand mentions across the web, Wikipedia/Wikidata entity recognition, review site profiles — are major citation drivers we cannot measure from your page alone.
We cannot measure what we cannot crawl.
Our audit fetches your public page as an AI crawler would. If content is behind authentication, loaded entirely via JavaScript without server rendering, or on subdomains we haven't been given, we cannot assess it.
Scores are not comparable across tools.
GEO Tool's score of 65 cannot be compared to a 65 from Otterly, Peec, Profound, or any other tool. Every tool uses a different query set, different repetition counts, different engine mixes, and different weighting. There is no industry standard. Our score is calibrated and internally consistent — it can track your improvement over time, but it is not an industry absolute.
Research and sources
Our methodology is built on published academic research and primary practitioner studies. The field moves fast; we re-verify key statistics quarterly. Evidence strength labels: Proven = controlled study or replicated across multiple studies; Probable = single study or strong practitioner consensus with plausible mechanism; Folklore = widely repeated, no study.
Princeton / IIT Delhi / Georgia Tech GEO paper (Aggarwal et al., KDD 2024)
Statistics addition +41%, cite sources +34%, quotation addition +28% citation lift. n=10,000 queries on a Bing Chat-style system, validated on Perplexity.
ProvenGoogle AI Overviews measurement study (arxiv:2605.14021, June 2026)
AIO trigger rate 13.7% overall, 64.7% for question-form queries. AIO-cited domains more credible than first-page organic (0.732 vs 0.645). 30% of AIO citations do not appear in top organic results.
ProvenAhrefs llms.txt study (May 2026, 137,000 domains)
97% of llms.txt files received zero AI bot traffic. 96% of requests were from automated audit tools, not AI engines.
ProvenSeer Interactive ChatGPT/Bing correlation study (2025)
87% of ChatGPT Search citations match Bing's top-10 organic results.
ProvenWellows E-E-A-T and AIO study
96% of AIO citations from high E-E-A-T sources. Named authors 2.3× more likely cited. Domain authority correlation with AIO: r=0.18. External brand mentions correlation: r=0.664.
ProbableSE Ranking Reddit study
Domains with 50+ Reddit mentions show citation index 22.5 vs 10.4 for 11–50 mentions.
ProbableSchanbacher 2026 FAQ schema study (n=1,508 real estate websites)
FAQPage schema on 6.2% of ChatGPT-visible sites vs 0.8% of non-visible sites.
ProbableOtterlyAI llms.txt 90-day experiment
Only 0.1% of AI crawler requests touched /llms.txt.
ProvenSemrush 2026 AI Visibility Index (126M US AI search prompts)
45% of marketing leaders cannot accurately measure their AI visibility. Scaled from 2,500 to 126M prompts between Jan–Apr 2026.
ProvenSuperprompt.com conversion analysis (12M website visits)
AI referral traffic converts at 14.2% vs Google organic at 2.8%.
ProbableSee your score now
Run a free audit and get your Readiness score across ChatGPT, Gemini, Perplexity, and Claude in under 30 seconds. No signup required.
Related: Category benchmark data · Pricing · Free audit