Technical SEO

What is Robots.txt?

Robots.txt is a text file at the root of a website that tells search engine crawlers which pages or sections they are allowed or disallowed from crawling — controlling how search engines access and discover content on the site.

Why It Matters

Not everything on a website should be crawled by search engines. Admin pages, internal search results, staging environments, duplicate filtered views, and user account pages all waste crawl budget if Google tries to access them. Robots.txt is the first file a search engine reads when it visits a site — it sets the rules for what the crawler is allowed to do.

For large sites, robots.txt is a critical crawl budget management tool. An ecommerce site with thousands of filter combinations can generate millions of crawlable URLs from parameter-based pages. Without robots.txt blocking these, Google spends its crawl budget on worthless URL variations instead of the actual product and category pages that should be indexed.

How It Works

Robots.txt uses simple directives:

  1. User-agent — Specifies which crawler the rules apply to. User-agent: * applies to all crawlers. User-agent: Googlebot applies only to Google. Different rules can target different crawlers.
  2. Disallow — Prevents crawling of specified paths. Disallow: /admin/ blocks the entire admin directory. Disallow: /search? blocks internal search result pages. The crawler will not request these URLs.
  3. Allow — Permits crawling of specific paths within a disallowed directory. Allow: /admin/public/ within a disallowed /admin/ lets Google access public admin pages.
  4. Sitemap — Points to the XML sitemap location. Sitemap: https://example.com/sitemap.xml ensures crawlers can find the sitemap regardless of other directives.

Common Mistakes

Using robots.txt to hide pages from Google's index. Robots.txt prevents crawling, not indexing. If other sites link to a disallowed page, Google may index the URL anyway — it just cannot see the content, resulting in a blank listing. To prevent indexing, use a noindex meta tag or X-Robots-Tag header. The page must be crawlable for Google to see the noindex directive.

The other mistake is a misconfigured robots.txt that blocks important content. A single wrong Disallow rule can prevent Google from crawling the entire site. CSS and JavaScript files blocked by robots.txt prevent Google from rendering pages correctly. Always verify robots.txt changes with Google's robots.txt testing tool before deploying.

How I Use This

My SEO automation audits robots.txt configuration — checking for overly broad disallow rules, blocked resources that prevent rendering, and missing sitemap references. The advanced SEO audit cross-references robots.txt rules against the site's actual URL structure to identify crawl budget waste and unintentional blocking.

References & Authority

This term is recognised by established knowledge bases:

Related Services

How BrightIQ uses Robots.txt

This concept is central to the following services: