CGI 2008

Program

Participation

Contact

Call For Papers
Committee
Paper Submission
Past Conferences

Accepted Papers
Social Events
Tutorials / Keynotes
Program

Registration
Accommodation
Venue

Conference Chairs
Program Chairs
CGI 2008 Team

Technical SEO audit checklist for large websites in 2026

Running a technical SEO audit on a large website requires more than a crawl tool and a list of errors. At 100,000 pages or more, the methodology has to change – segmented crawls, log file analysis, rendering diagnostics, and AI crawler checks all become essential steps rather than optional extras. This checklist covers each phase in order, with the specific checks that matter most at scale.

Key Takeaways

Large-site audits require six distinct phases – skipping any one of them produces an incomplete picture
Log file analysis and JavaScript rendering are the two phases most commonly missed, and the most likely to surface critical issues
Every check should be evaluated at template level, not URL level, for sites above 100,000 pages
AI crawler readiness is now a required checklist item, not an advanced add-on

Before you start: what to gather

A technical SEO audit is only as thorough as the access behind it. Before running a single crawl, confirm you have the following:

Google Search Console access – full property, not read-only
Server log files – minimum 30 days, ideally 90; Apache, Nginx, or CDN-level logs
CMS and front-end stack documentation – is the site server-rendered, client-rendered, or hybrid?
Development environment access – staging URL, robots.txt confirmation, and release cycle details
Any previous audit reports – to baseline against prior findings

Agencies that begin a large-site audit without log file access are working with an incomplete data set from the start.

The audit checklist

#1 Crawl setup and architecture

Configure the crawl correctly before running it. On a large site, crawl configuration determines the quality of everything downstream.

Segment the crawl by page type – product pages, category pages, blog posts, and supporting pages should be treated as separate populations from the outset
Set crawl depth limits by section – avoid over-crawling low-value URL spaces such as filtered navigation or internal search
Configure crawl rendering mode correctly – do not default to rendered crawl; raw HTML and rendered output should be captured separately
Confirm robots.txt is not blocking key sections – check at the start, not after the crawl completes
Verify XML sitemap accuracy – compare sitemap contents against crawl output; orphaned pages and missing URLs are both significant findings
Map internal link architecture – identify crawl depth by page type and flag pages more than four clicks from the homepage
Identify redirect chains – flag any chains of two or more hops and pages receiving significant internal links through redirects

#2 Log file analysis

Log files reveal what search engines are actually doing on the site, which is often significantly different from what a crawl tool shows.

Segment bot traffic – isolate Googlebot, Bingbot, and AI crawlers (GPTBot, ClaudeBot, Google-Extended) from human traffic
Map crawl frequency by page type – identify which templates are over-crawled and which are being ignored
Cross-reference crawl frequency against organic performance – high-value pages with low crawl frequency are a priority finding
Identify crawl budget waste – parameter variants, session IDs, and faceted navigation pages consuming crawl budget without ranking value
Review response code distribution as seen by bots – 5xx errors in particular may not surface in crawl tools but will appear in log data
Flag crawl anomalies – sudden drops or spikes in crawl frequency often indicate an infrastructure or configuration change

#3 Indexation and canonicalisation

Indexation issues at scale are almost always template-level problems. Evaluate findings by page type, not individual URL.

Reconcile index coverage in Search Console against crawl data – identify pages that should be indexed but are not, and pages that are indexed but should not be
Audit canonical tag implementation across all page types – check for canonical conflicts, self-referencing errors, and canonical chains
Review noindex usage – confirm that noindex is intentional on every page type where it appears
Check hreflang implementation for multi-region sites – validate return tags, language code accuracy, and canonical alignment
Identify thin and duplicate content at template level – flag page types where content duplication is structural rather than incidental

#4 JavaScript rendering diagnostics

Essential for any site using a JavaScript framework. Do not assume that a rendered crawl is equivalent to what Googlebot sees.

Capture raw HTML for a representative sample across each major page type
Render the same URLs via headless browser and compare output
Identify content present in rendered DOM but absent from raw HTML – this content is at risk of not being indexed
Check internal links specifically in raw HTML – links rendered only via JavaScript may not be followed
Assess structured data presence in raw HTML versus rendered output
Flag pages where JavaScript execution latency may cause Googlebot to abandon rendering before completion

#5 Core Web Vitals and performance

Pull CrUX field data from Search Console by page type – do not rely on lab data alone for large sites
Run Lighthouse or PageSpeed Insights on representative URLs across each template
Assess LCP, CLS, and INP at template level – homepage-only testing is insufficient
Attribute performance issues to root causes – third-party scripts, unoptimised images, render-blocking resources, or server response time
Identify page types where poor CWV scores correlate with weaker organic performance
Estimate implementation effort for each performance fix – separate quick wins from structural changes

#6 AI crawler readiness

As AI Overviews, ChatGPT, and Perplexity become significant traffic sources, AI crawler readiness belongs in every large-site audit. Agencies such as SUSO Digital now include this as a standard audit phase rather than a separate engagement.

Confirm robots.txt is not blocking AI crawlers – check for GPTBot, ClaudeBot, PerplexityBot, and Google-Extended specifically
Review content structure for direct answerability – key pages should answer their primary query in the first two sentences, not after an introduction
Check for FAQ schema and definition blocks – these are high extraction surfaces for LLMs generating direct answers
Assess entity consistency – brand name, services, and key topics should be referenced consistently across all indexed pages
Review off-site citation volume – LLMs weight authoritative external references when deciding whether to cite a source

Prioritising findings at scale

A large-site audit will produce hundreds of findings. The prioritisation framework matters as much as the checklist itself. Score every issue across two axes before building the roadmap:

Impact: How significantly does this issue affect crawlability, indexation, ranking, or AI citation? Issues affecting entire page templates score higher than isolated URL-level problems.

Implementation effort: How much development resource is required to fix? Separate configuration changes (low effort) from template-level structural changes (high effort) from platform-level rewrites (very high effort).

Prioritise high-impact, low-effort fixes first. High-impact, high-effort fixes belong in the roadmap with a clear business case attached. Low-impact findings should be documented but not allowed to displace priority work.

Turning the checklist into results

A completed checklist produces a findings inventory. What converts that inventory into organic and AI search improvement is a prioritised roadmap, developer-ready specifications, and a re-audit once priority fixes are deployed.

The most common failure point in technical SEO is not the audit itself – it is the gap between findings and implementation. Treat the checklist as the start of the process, not the end of it.

FAQs

How long does a large-site technical SEO audit take?

For a site with 100,000–500,000 pages, a thorough audit covering all six phases takes six to ten weeks. Compressing this timeline usually means skipping log file analysis or rendering diagnostics – the two phases most likely to surface critical issues.

Which phase do agencies most commonly skip?

Log file analysis is the most frequently omitted phase, usually because the agency lacks the tooling or access to conduct it properly. JavaScript rendering diagnostics are the second most commonly skipped. Both are non-negotiable for large sites.

Do I need a technical SEO audit if my site was audited last year?

Yes, if the site has undergone any significant change – a template update, URL restructure, CMS migration, or new market launch. For large sites without major changes, a lighter quarterly crawl healt h review can bridge the gap between full annual audits.

Should AI crawler readiness be its own audit, or part of the main technical audit?

Part of the main audit. The technical conditions required for AI crawlers to access and cite content overlap significantly with traditional technical SEO requirements – crawlability, content structure, and entity signals all apply to both. Treating GEO as a separate engagement usually means duplicating work that a well-structured technical audit already covers.