§ blog · Development06/09/2026

Optimizing a website for SEO and AI search: when ChatGPT, Claude, and Perplexity read your pages too

Alongside Googlebot, crawlers from ChatGPT, Claude, Perplexity, and Google AI Overview now read website content directly to synthesize answers — not to rank pages. Concrete technical changes: robots.txt for AI crawlers, JSON-LD structured data, writing 'quotable' content, and why Core Web Vitals still matter.

DevelopmentSEOAI SearchStructured Data8 min read

By KonexForge Engineering Team

For years, "optimizing a website" was nearly synonymous with "optimizing for Google": sitemaps, meta tags, backlinks, Core Web Vitals — all aimed at one goal, a position on the search engine results page (SERP). But since ChatGPT, Perplexity, Claude, and Google AI Overview started answering users' questions directly — with or without citing sources — a new kind of "reader" has emerged: AI crawlers, which gather content not to rank pages, but to synthesize it into a single answer. This article breaks down the concrete technical changes a website — especially a technical blog, product documentation, or a services page — needs to consider during this period when traditional search engines and AI answer engines coexist.

Two kinds of "readers": search crawlers and AI crawlers

Googlebot and Bingbot have long rendered JavaScript through a headless Chromium before indexing — a single-page app can still be indexed correctly, just more slowly than static HTML. The newer group of crawlers — GPTBot (OpenAI), ClaudeBot (Anthropic), PerplexityBot, Google-Extended, Applebot-Extended — work differently: most only fetch raw HTML via a simple HTTP request, don't execute JavaScript, and use the content for one of two purposes: model training, or real-time retrieval to answer a question (RAG-style — the model finds a few relevant pages and synthesizes a cited answer).

The direct consequence: a page that only renders content via client-side JavaScript (e.g., a React SPA without server-side rendering) might still get indexed correctly by Googlebot after a few seconds of rendering, but for many AI crawlers, that same request returns an almost-empty HTML shell — nothing to "read" or cite. This is why server-side rendering or static export, already important for traditional SEO, becomes even more essential at this stage — not an advanced optimization, but a precondition for content to exist in the eyes of AI crawlers.

robots.txt: deciding who gets to read your content

robots.txt remains the first control point — the difference is the list of user-agents to manage is now much longer. Alongside the familiar Googlebot/Bingbot, a website today may need to explicitly declare policies for:

GPTBot, ChatGPT-User — OpenAI's crawlers, used for training and for ChatGPT to access pages when answering
ClaudeBot, Claude-Web — Anthropic's crawlers
PerplexityBot — Perplexity AI's crawler, used for real-time retrieval to answer questions
Google-Extended — controls separately whether Gemini/AI Overview can use the content, distinct from regular Googlebot indexing
CCBot, Bytespider, Amazonbot — crawlers gathering training data for Common Crawl, ByteDance, and Amazon

Allowing or disallowing each bot isn't purely a technical decision — it's a business decision. Allowing means your content can appear as a citation in AI-generated answers, increasing discoverability through a new channel — but it can also be used as free training data, and users may never click through to the original page (zero-click). Blocking protects the content but means the page disappears from that channel entirely.

There's no universally "correct" configuration for every website. A technical blog that wants to be cited as a reference source should generally stay open to retrieval crawlers (PerplexityBot, ChatGPT-User); a product with proprietary content that needs protection might block training-focused crawlers (GPTBot, CCBot) while staying open to retrieval ones. This decision should be explicit in robots.txt and reviewed periodically — not left at a default.

A new convention taking shape is llms.txt — a Markdown file at the domain root, similar to sitemap.xml but written for LLMs: it summarizes the site's structure, lists important pages with short descriptions, and helps a model quickly understand "what this page is about" without crawling the entire site. This is still a proposal not officially committed to by the major AI providers, but the implementation cost is essentially zero — and if it becomes a widely adopted standard, sites that already have it will have a head start.

Structured data — the common language between traditional SEO and AI

JSON-LD following schema.org has long helped Google show rich results — star ratings, prices, breadcrumbs right on the SERP. For AI answer engines, its role is even more important: structured data is machine-readable directly, with no need to infer meaning from prose — a model can pull the exact author name, publish date, breadcrumb, or list of Q&A pairs straight from a JSON-LD block, instead of having to "understand" a complex piece of HTML.

Four schema types are most valuable for a content/services website:

Organization — the business's name, logo, and official URL, helping both search engines and AI correctly identify the "entity" behind the content
BreadcrumbList — navigation structure, helping a model understand where a page sits within the site and what topic it relates to
Article/BlogPosting — headline, author, publish date, modification date — these fields are often prioritized by AI answer engines when citing a source and its timing
FAQPage — for Q&A-style content, this is close to the ideal format for being cited directly in AI Overview or chatbot answers

One often-overlooked point: structured data must match the content visible on the page. A JSON-LD block declaring information that doesn't appear on the page (or vice versa) not only violates Google's guidelines, but also creates misleading data for AI — risking being cited with incorrect information about yourself.

Writing "citable" content — for people and machines alike

AI answer engines tend to quote individual sentences or passages that make sense standing alone — not an entire long article. A few writing principles that improve the odds of being cited correctly:

Answer the question directly in the first 1-2 sentences of each section, then explain in detail afterward — don't bury the main answer at the end after a long lead-in
Phrase headings as questions or phrases a user might ask an AI (e.g., "when should you use X", "how does X differ from Y") — headings that match how a model segments its answers
Each paragraph should carry exactly one idea or claim — paragraphs mixing multiple ideas are hard to quote intact without losing context
Bullet lists and comparison tables are easier for a model to parse into discrete criteria than continuous prose — especially useful for decision-criteria content

This isn't writing for machines instead of people — an article with clear structure that answers the question directly is always easier for a human reader too. AI answer engines simply reward content that was already written well in that sense.

Performance and Core Web Vitals are still the foundation

AI crawlers are less sensitive to render time than traditional search crawlers, but that doesn't mean performance no longer matters. Three reasons it remains foundational:

Core Web Vitals (LCP, INP, CLS) are still direct ranking signals for traditional SERPs — a traffic channel that still accounts for the majority of traffic in the short and medium term
When a user clicks through from an AI-answer citation to read more, a slow-loading page or a jumpy layout (high CLS) creates a bad first impression at exactly the moment conversion potential is highest
Crawl budget — a site that responds slowly or errors frequently gets crawled less often and less deeply by every type of crawler, including AI crawlers

Static export combined with edge hosting (Cloudflare Pages, Vercel Edge, Netlify Edge) delivers near-instant time-to-first-byte because there's no cold start or database query at request time — a benefit that applies equally to real users, Googlebot, and AI crawlers. In the internal dev portal for an engineering team that we built, every pull request gets its own deploy preview with Lighthouse CI running automatically against it — performance regressions get caught at review time, not discovered after they've already shipped to production.

Trust signals in the eyes of AI — not so different from E-E-A-T

Google uses the E-E-A-T framework (Experience, Expertise, Authoritativeness, Trust) to assess content credibility. AI answer engines, when deciding which sources to cite for an answer, rely on very similar signals — because most of these models are themselves evaluated on whether their answers are accurate and well-sourced:

Clear, consistent author and organization information across pages — Organization schema, an About page, real contact details
Content with clear publish and modification dates — especially important for technical topics, where outdated information can be wrong
Being cited or linked to from other sources — a kind of consensus that both search ranking and AI retrieval treat as a quality signal

There's no shortcut for this section — it's the cumulative result of consistently publishing accurate, well-structured content over time.

Minimum technical checklist

Server-side rendering or static export — content must be present in the initial HTML response, not dependent on JavaScript running afterward
sitemap.xml and robots.txt kept up to date, with explicit policies for each AI crawler group
JSON-LD for Organization, BreadcrumbList, and the relevant content type (Article, FAQPage, Product...) on every applicable page
Heading hierarchy (h1 → h2 → h3) that reflects the actual logical structure, with no skipped levels or headings used purely for styling
Core Web Vitals within good thresholds — measured with real-world data (CrUX, Search Console), not just local Lighthouse runs
A clear canonical URL for every page, avoiding duplicate content that confuses both traditional indexing and retrieval
(Optional, low-cost) llms.txt — a site-structure summary for LLMs, worth tracking as this convention gains wider adoption among major providers

Conclusion

Optimizing for AI answer engines isn't a separate SEO category requiring a whole new toolkit — most of the foundation (SSR/static export, structured data, performance, well-structured content) is exactly what good technical SEO has always required. The difference is that robots.txt now needs a more deliberate decision about this new group of crawlers, and content needs to be written so each passage can stand on its own and still be correct. This is the technical work we build in from day one for every website in the Development layer at KonexForge — not a separate audit done after the site is already in production.

Development

Accessibility for business websites: why 83.9% of home pages still fail the easiest criterion

The WebAIM Million 2026 report found low-contrast text on 83.9% of home pages — the single easiest WCAG criterion to check by machine — and the six most common failures haven't changed in seven years. This isn't a knowledge problem. It's a measurement problem.

Development

Technology solutions for private clinics and doctors: from e-invoicing to electronic medical records

Within an 18-month window, Vietnam's private clinics and independent doctors face three new compliance obligations at once: e-invoicing under Decree 70/2025, the end of presumptive tax from Jan 1, 2026, and a Dec 31, 2026 deadline for electronic medical records under Circular 13/2025/TT-BYT — while most still run on paper logs, Excel, and Zalo. The minimal-footprint architecture we propose, and why a scaled-down hospital system isn't the answer.

Development

Why digital transformation leads to 5-6 disconnected systems and data duplicated everywhere

Many companies and government agencies don't lack technology — they have too many systems. After a few years of piecemeal digital transformation, an organization typically ends up running 5-6 systems that don't talk to each other, with the same customer or employee existing under several different data versions. This post breaks down why that happens and the consolidation architecture KonexForge applies to fix it — not by buying a seventh system.

Have a similar problem to solve?

Contact the team

Optimizing a website for SEO and AI search: when ChatGPT, Claude, and Perplexity read your pages too

Two kinds of "readers": search crawlers and AI crawlers

robots.txt: deciding who gets to read your content

Structured data — the common language between traditional SEO and AI

Writing "citable" content — for people and machines alike

Performance and Core Web Vitals are still the foundation

Trust signals in the eyes of AI — not so different from E-E-A-T

Minimum technical checklist

Conclusion

Related articles

Accessibility for business websites: why 83.9% of home pages still fail the easiest criterion

Technology solutions for private clinics and doctors: from e-invoicing to electronic medical records

Why digital transformation leads to 5-6 disconnected systems and data duplicated everywhere

Have a similar problem to solve?