What is XML Sitemap?
An XML sitemap is a file that lists all the important URLs on a website in a format search engines can read — helping Google discover, crawl, and understand the site's structure, especially for large sites, new sites, or pages with limited internal linking.
Why It Matters
Google discovers pages by following links. If every page on your site is linked from at least one other page, Google can theoretically find everything. But in practice, some pages have weak internal linking, some are new and not yet linked widely, and some are buried deep in the site hierarchy. An XML sitemap provides a direct route — a complete list of URLs that Google should know about.
For large sites, sitemaps are essential. An ecommerce store with 10,000 products needs Google to know about all of them. A news site publishing 50 articles per day needs Google to discover them quickly. Without a sitemap, Google relies entirely on crawling links, which may miss pages or discover them slowly.
How It Works
XML sitemaps follow the Sitemaps protocol:
- URL list — The sitemap contains
<url>entries, each with a<loc>tag specifying the full URL. Optionally includes<lastmod>(last modification date),<changefreq>(how often the page changes), and<priority>(relative importance). - Sitemap index — Large sites split URLs across multiple sitemaps (max 50,000 URLs per file) and list them in a sitemap index file. This is the file submitted to Google Search Console.
- Auto-generation — Most CMS platforms and static site generators create sitemaps automatically. The sitemap should update when pages are added, removed, or significantly modified.
- Submission — The sitemap URL is submitted to Google Search Console and referenced in robots.txt. Google then uses it alongside its normal crawling to discover and prioritise URLs.
Common Mistakes
Including URLs in the sitemap that should not be indexed — 404 pages, redirected URLs, noindex pages, duplicate URLs without proper canonicals. The sitemap should be a curated list of indexable, canonical URLs. Google treats sitemap inclusion as a signal of importance, so polluting it with low-quality URLs undermines that signal.
The other mistake is neglecting <lastmod> dates or setting them all to the current date. Google uses lastmod to prioritise crawling — recently modified pages get crawled sooner. If every page shows today's date, the signal is meaningless. Accurate lastmod dates help Google allocate crawl budget efficiently.
How I Use This
My SEO automation audits sitemap quality — checking for excluded important pages, included noindex pages, broken URLs, and inaccurate lastmod dates. For Astro-built sites like this one, the sitemap generates automatically at build time with only indexable pages included. The advanced SEO audit cross-references the sitemap against the crawl to identify discrepancies.
Related Services
How BrightIQ uses XML Sitemap
This concept is central to the following services:
Related Terms
Crawl Budget
Crawl budget is the number of pages a search engine will crawl on your site within a given timeframe — determined by your server's capacity and the perceived value of your content. Managing crawl budget ensures Google spends its limited crawling resources on the pages that matter.
Robots.txt
Robots.txt is a text file at the root of a website that tells search engine crawlers which pages or sections they are allowed or disallowed from crawling — controlling how search engines access and discover content on the site.
Technical SEO
Technical SEO is the foundation layer of search engine optimisation — the crawlability, indexability, site speed, and structural elements that determine whether search engines can find, understand, and rank your pages.