Technical SEO

What is XML Sitemap?

An XML sitemap is a file that lists all the important URLs on a website in a format search engines can read — helping Google discover, crawl, and understand the site's structure, especially for large sites, new sites, or pages with limited internal linking.

Why It Matters

Google discovers pages by following links. If every page on your site is linked from at least one other page, Google can theoretically find everything. But in practice, some pages have weak internal linking, some are new and not yet linked widely, and some are buried deep in the site hierarchy. An XML sitemap provides a direct route — a complete list of URLs that Google should know about.

For large sites, sitemaps are essential. An ecommerce store with 10,000 products needs Google to know about all of them. A news site publishing 50 articles per day needs Google to discover them quickly. Without a sitemap, Google relies entirely on crawling links, which may miss pages or discover them slowly.

How It Works

XML sitemaps follow the Sitemaps protocol:

  1. URL list — The sitemap contains <url> entries, each with a <loc> tag specifying the full URL. Optionally includes <lastmod> (last modification date), <changefreq> (how often the page changes), and <priority> (relative importance).
  2. Sitemap index — Large sites split URLs across multiple sitemaps (max 50,000 URLs per file) and list them in a sitemap index file. This is the file submitted to Google Search Console.
  3. Auto-generation — Most CMS platforms and static site generators create sitemaps automatically. The sitemap should update when pages are added, removed, or significantly modified.
  4. Submission — The sitemap URL is submitted to Google Search Console and referenced in robots.txt. Google then uses it alongside its normal crawling to discover and prioritise URLs.

Common Mistakes

Including URLs in the sitemap that should not be indexed — 404 pages, redirected URLs, noindex pages, duplicate URLs without proper canonicals. The sitemap should be a curated list of indexable, canonical URLs. Google treats sitemap inclusion as a signal of importance, so polluting it with low-quality URLs undermines that signal.

The other mistake is neglecting <lastmod> dates or setting them all to the current date. Google uses lastmod to prioritise crawling — recently modified pages get crawled sooner. If every page shows today's date, the signal is meaningless. Accurate lastmod dates help Google allocate crawl budget efficiently.

How I Use This

My SEO automation audits sitemap quality — checking for excluded important pages, included noindex pages, broken URLs, and inaccurate lastmod dates. For Astro-built sites like this one, the sitemap generates automatically at build time with only indexable pages included. The advanced SEO audit cross-references the sitemap against the crawl to identify discrepancies.

References & Authority

This term is recognised by established knowledge bases:

Related Services

How BrightIQ uses XML Sitemap

This concept is central to the following services: