How to Create a robots.txt File

What is robots.txt?

A robots.txt file is a plain text file you place at the root of your website that tells web crawlers — search engines, AI scrapers, and other automated bots — which pages they are allowed to access. It follows the Robots Exclusion Protocol, a standard that almost every major crawler respects.

When a bot visits your site for the first time, it checks https://yourdomain.com/robots.txt before crawling anything else. If your file says a path is off-limits, a compliant bot will skip it.

A robots.txt file is made up of one or more "records". Each record starts with a User-agent line that identifies which bot the rules apply to, followed by Allow and Disallow directives that specify which paths the bot can or cannot access.

Here is the simplest possible robots.txt — one that lets all crawlers access everything:

User-agent: *
Allow: /

And here is one that blocks all crawlers from the entire site:

User-agent: *
Disallow: /

In practice, most robots.txt files are more nuanced — allowing search engines while blocking specific paths, or preventing AI scrapers from accessing your content.

Why robots.txt matters for SEO

Robots.txt plays a role in three areas that directly affect how your site performs in search:

Crawl budget management

Search engines allocate a fixed amount of crawl activity to each site — this is called your crawl budget. If Googlebot spends time crawling low-value pages (session IDs, faceted navigation, internal search results), it may not get to your important content. By blocking those pages in robots.txt, you direct crawl budget to pages that actually need to rank.

This matters most for large sites with thousands of pages. If your site has under a few hundred pages, crawl budget is rarely a concern.

Preventing duplicate content indexing

Duplicate content — the same page accessible via multiple URLs — can dilute your rankings. Common culprits include filter combinations on e-commerce sites, URL parameters like ?ref= or ?sort=, and print versions of pages. Blocking these from crawlers keeps your index clean.

Protecting sensitive directories

Areas like /admin/, /wp-admin/, /checkout/, and staging environments should not appear in search results. Disallowing them in robots.txt keeps crawlers out, though for truly sensitive content you should also require authentication — robots.txt is not a security measure.

The fastest way is to use the Robots.txt Generator — configure your rules visually and download the file in seconds. Here is how:

1

Choose your crawler

Select which bot your rules apply to. Use * (wildcard) to target all crawlers, or pick a specific bot from the preset list — Googlebot, Bingbot, GPTBot, and many more. You can add multiple rule groups for different bots.

Tip: Start with a wildcard rule for your general settings, then add specific overrides for individual bots that need different treatment.

2

Set Allow and Disallow paths

Add the paths you want to block or allow. Use / to refer to the entire site, or specify paths like /admin/, /private/, or /wp-admin/. Allow rules take precedence over Disallow rules when both match a URL.

Tip: Always end directory paths with a trailing slash — /admin/ not /admin — to ensure the rule applies to the directory and everything inside it.

3

Add your sitemap URL

Including your sitemap URL in robots.txt helps search engines discover it automatically. Enter your sitemap URL — typically https://yourdomain.com/sitemap.xml — and the generator will add the correct Sitemap: directive at the end of the file.

Tip: If you have multiple sitemaps (separate sitemaps for posts, pages, products), you can add multiple Sitemap entries.

4

Generate, copy, and upload

Click Generate to produce your robots.txt content. Copy it to your clipboard or download the file, then upload it to the root of your website. The file must be accessible at https://yourdomain.com/robots.txt — not in a subdirectory.

Tip: After uploading, visit https://yourdomain.com/robots.txt in your browser to confirm the file is live and looks correct.

Manual creation

You can also create a robots.txt file in any plain text editor. Save it as robots.txt (not robots.txt.txt) with UTF-8 encoding and no byte-order mark. The format is straightforward:

User-agent: *
Disallow: /admin/
Disallow: /private/
Allow: /

Sitemap: https://yourdomain.com/sitemap.xml

Each blank line separates a new record. Comments start with #. The Sitemap: directive goes at the end and applies globally, not to a specific user-agent.

Open Robots.txt Generator

Common use cases

WordPress sites

WordPress generates some paths that should generally be blocked from crawlers — the admin panel, login page, and dynamically generated URLs that create duplicate content. A standard WordPress robots.txt looks like this:

User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php
Disallow: /wp-login.php
Disallow: /?s=
Disallow: /search/

Sitemap: https://yourdomain.com/sitemap.xml

The /wp-admin/admin-ajax.php allow rule is important — this endpoint is used by many WordPress themes and plugins for front-end functionality, so blocking it can break things for real users.

The Robots.txt Generator includes a WordPress CMS template that sets these rules automatically.

Shopify and e-commerce

E-commerce sites commonly have URL parameters for sorting, filtering, and tracking that create thousands of near-duplicate pages. Blocking these keeps search engines focused on your actual product and category pages:

User-agent: *
Disallow: /admin/
Disallow: /cart
Disallow: /checkout
Disallow: /account
Disallow: /collections/*?sort_by=
Disallow: /collections/*?filter.

Sitemap: https://yourdomain.com/sitemap.xml

Shopify manages its own robots.txt file and does not allow direct editing. You can customize it using a robots.txt.liquid template in your theme. The Shopify template in the Robots.txt Generator gives you a solid starting point.

JavaScript frameworks (Next.js, React)

Modern JavaScript frameworks often generate routes for API endpoints, preview modes, and development utilities that should not appear in search results:

User-agent: *
Disallow: /api/
Disallow: /_next/
Disallow: /preview/

Sitemap: https://yourdomain.com/sitemap.xml

Blocking admin and sensitive areas

Any path that should not be publicly accessible should be disallowed in robots.txt as a first line of defence. Common targets include:

/admin/ — administration panels
/private/ — internal documents
/staging/ — staging environments on production domains
/tmp/ — temporary files
/cgi-bin/ — legacy server scripts

Remember: robots.txt is not a security barrier. A disallowed path can still be accessed by anyone who knows the URL. Always use proper authentication for genuinely sensitive content.

Blocking AI bots

Since 2023, most major AI companies have released crawlers that collect web content to train large language models. Publishers, news sites, and content creators have increasingly chosen to block these crawlers. Robots.txt is the standard mechanism for doing so.

Why block AI scrapers?

The case for blocking comes down to two concerns. First, your content may be used to train AI models without compensation or credit. Second, AI bots can generate significant crawl traffic on sites with large content libraries, consuming bandwidth without delivering referral traffic in return.

The case against blocking is that some AI products (like Perplexity) surface citations and send traffic. The right call depends on your content model and audience.

Known AI crawlers

To block the main AI crawlers, add individual rules for each one:

User-agent: GPTBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: OAI-SearchBot
Disallow: /

User-agent: Meta-ExternalAgent
Disallow: /

The Robots.txt Generator's AI Bot Controls section lets you select which crawlers to block with checkboxes — no manual editing required. It stays updated with new crawlers as they emerge.

Important caveats

Blocking AI bots in robots.txt only works with crawlers that respect the Robots Exclusion Protocol. Not all do — some less scrupulous scrapers ignore robots.txt entirely. If you need stronger protection, consider IP-based blocking or rate limiting at the server level.

Also, blocking a crawler from your site does not remove content that has already been collected. It only prevents future crawls.

Common mistakes to avoid

Blocking CSS and JavaScript files

An old SEO practice was to block /wp-content/ to prevent CSS and JS from being crawled. This is now harmful. Google uses CSS and JavaScript to render pages, and blocking these files prevents it from seeing your content as users do. Never block your theme files, plugin assets, or JavaScript bundles.

Typos in paths

A path like Disallow: /admin (without a trailing slash) will block /admin but not /admin/settings/ on some crawlers. Always use trailing slashes for directories. Double-check paths against your actual URL structure.

Missing Sitemap directive

Many robots.txt files omit the Sitemap: directive. While it is not strictly required, including it means every crawler that reads your robots.txt will automatically know where to find your sitemap — making it easier for search engines to index your content.

Blocking the entire site accidentally

Disallow: / in a wildcard block will stop all crawlers from accessing any page on your site. This is a common mistake during development that sometimes slips into production. Always check your live robots.txt after deploying.

Conflating robots.txt with noindex

Disallowing a page does not remove it from the index. Google can still index a URL if other sites link to it, even if it cannot crawl it. To prevent indexing, use noindex in the page's meta robots tag or as an HTTP response header. Robots.txt and noindex serve different purposes.

Using robots.txt as a security tool

Listing sensitive paths in robots.txt actually advertises them to anyone who reads the file — which is public. Do not rely on robots.txt to protect confidential directories. Use authentication.

Frequently asked questions

What is a robots.txt file?

A robots.txt file is a plain text file at the root of your website that tells crawlers which pages they can or cannot access. It follows the Robots Exclusion Protocol and is checked by most bots before they crawl anything on your site.

Does robots.txt hide pages from Google?

No. Disallowing a page stops Google from crawling it, but Google may still index the URL if other sites link to it. To prevent a page from appearing in search results entirely, add a noindex meta robots tag to the page itself.

Where do I upload my robots.txt file?

Upload it to the root of your domain so it is accessible at https://yourdomain.com/robots.txt. A robots.txt file in a subdirectory (like /blog/robots.txt) will not be recognised by crawlers.

How do I block AI bots like GPTBot and ClaudeBot?

Add a separate User-agent block for each AI bot with Disallow: /. The Robots.txt Generator includes an AI Bot Controls section that lets you block multiple AI crawlers with a single click.

How do I test my robots.txt file?

Visit your robots.txt URL directly in a browser to confirm it is live. For deeper testing, use Google Search Console's robots.txt tester (under Settings) to check whether specific URLs are blocked or allowed.

Is robots.txt case-sensitive?

The directives (User-agent, Disallow, Allow, Sitemap) are case-insensitive. However, the paths in Allow and Disallow rules are case-sensitive on most web servers, so /Admin/ and /admin/ are treated as different paths.

Can I have multiple User-agent rules?

Yes. You can have as many User-agent blocks as you need. Each block applies to the crawler(s) named in its User-agent lines. This allows you to set different rules for different bots — for example, allowing Googlebot full access while blocking AI crawlers completely.

Do all bots respect robots.txt?

Major search engines (Google, Bing, Yandex) and most reputable AI crawlers respect robots.txt. Malicious scrapers and some lesser-known bots do not. For protection against non-compliant bots, you need server-level controls like IP blocking or rate limiting.

What is robots.txt?

Why robots.txt matters for SEO

Crawl budget management

Preventing duplicate content indexing

Protecting sensitive directories