What is Robots.txt?
Robots.txt is a text file at the root of a website that tells search engine crawlers which pages or sections they are allowed or disallowed from crawling — controlling how search engines access and discover content on the site.
Why It Matters
Not everything on a website should be crawled by search engines. Admin pages, internal search results, staging environments, duplicate filtered views, and user account pages all waste crawl budget if Google tries to access them. Robots.txt is the first file a search engine reads when it visits a site — it sets the rules for what the crawler is allowed to do.
For large sites, robots.txt is a critical crawl budget management tool. An ecommerce site with thousands of filter combinations can generate millions of crawlable URLs from parameter-based pages. Without robots.txt blocking these, Google spends its crawl budget on worthless URL variations instead of the actual product and category pages that should be indexed.
How It Works
Robots.txt uses simple directives:
- User-agent — Specifies which crawler the rules apply to.
User-agent: *applies to all crawlers.User-agent: Googlebotapplies only to Google. Different rules can target different crawlers. - Disallow — Prevents crawling of specified paths.
Disallow: /admin/blocks the entire admin directory.Disallow: /search?blocks internal search result pages. The crawler will not request these URLs. - Allow — Permits crawling of specific paths within a disallowed directory.
Allow: /admin/public/within a disallowed/admin/lets Google access public admin pages. - Sitemap — Points to the XML sitemap location.
Sitemap: https://example.com/sitemap.xmlensures crawlers can find the sitemap regardless of other directives.
Common Mistakes
Using robots.txt to hide pages from Google's index. Robots.txt prevents crawling, not indexing. If other sites link to a disallowed page, Google may index the URL anyway — it just cannot see the content, resulting in a blank listing. To prevent indexing, use a noindex meta tag or X-Robots-Tag header. The page must be crawlable for Google to see the noindex directive.
The other mistake is a misconfigured robots.txt that blocks important content. A single wrong Disallow rule can prevent Google from crawling the entire site. CSS and JavaScript files blocked by robots.txt prevent Google from rendering pages correctly. Always verify robots.txt changes with Google's robots.txt testing tool before deploying.
How I Use This
My SEO automation audits robots.txt configuration — checking for overly broad disallow rules, blocked resources that prevent rendering, and missing sitemap references. The advanced SEO audit cross-references robots.txt rules against the site's actual URL structure to identify crawl budget waste and unintentional blocking.
References & Authority
This term is recognised by established knowledge bases:
Related Services
How BrightIQ uses Robots.txt
This concept is central to the following services:
Related Terms
Crawl Budget
Crawl budget is the number of pages a search engine will crawl on your site within a given timeframe — determined by your server's capacity and the perceived value of your content. Managing crawl budget ensures Google spends its limited crawling resources on the pages that matter.
Indexation
Indexation is the process by which search engines discover, crawl, process, and store web pages in their index — making them eligible to appear in search results. A page that is not indexed cannot rank, regardless of its content quality or optimisation.
Technical SEO
Technical SEO is the foundation layer of search engine optimisation — the crawlability, indexability, site speed, and structural elements that determine whether search engines can find, understand, and rank your pages.
XML Sitemap
An XML sitemap is a file that lists all the important URLs on a website in a format search engines can read — helping Google discover, crawl, and understand the site's structure, especially for large sites, new sites, or pages with limited internal linking.