How to Create a robots.txt File
A complete guide to robots.txt — what it does, how to write one correctly, how to block AI bots, and how to configure it for WordPress, Shopify, and other platforms.
What is robots.txt?
A robots.txt file is a plain text file you place at the root of your website that tells web crawlers — search engines, AI scrapers, and other automated bots — which pages they are allowed to access. It follows the Robots Exclusion Protocol, a standard that almost every major crawler respects.
When a bot visits your site for the first time, it checks https://yourdomain.com/robots.txt before crawling anything else. If your file says a path is off-limits, a compliant bot will skip it.
A robots.txt file is made up of one or more "records". Each record starts with a User-agent line that identifies which bot the rules apply to, followed by Allow and Disallow directives that specify which paths the bot can or cannot access.
Here is the simplest possible robots.txt — one that lets all crawlers access everything:
User-agent: * Allow: /
And here is one that blocks all crawlers from the entire site:
User-agent: * Disallow: /
In practice, most robots.txt files are more nuanced — allowing search engines while blocking specific paths, or preventing AI scrapers from accessing your content.
Why robots.txt matters for SEO
Robots.txt plays a role in three areas that directly affect how your site performs in search:
Crawl budget management
Search engines allocate a fixed amount of crawl activity to each site — this is called your crawl budget. If Googlebot spends time crawling low-value pages (session IDs, faceted navigation, internal search results), it may not get to your important content. By blocking those pages in robots.txt, you direct crawl budget to pages that actually need to rank.
This matters most for large sites with thousands of pages. If your site has under a few hundred pages, crawl budget is rarely a concern.
Preventing duplicate content indexing
Duplicate content — the same page accessible via multiple URLs — can dilute your rankings. Common culprits include filter combinations on e-commerce sites, URL parameters like ?ref= or ?sort=, and print versions of pages. Blocking these from crawlers keeps your index clean.
Protecting sensitive directories
Areas like /admin/, /wp-admin/, /checkout/, and staging environments should not appear in search results. Disallowing them in robots.txt keeps crawlers out, though for truly sensitive content you should also require authentication — robots.txt is not a security measure.
How to create a robots.txt file
The fastest way is to use the Robots.txt Generator — configure your rules visually and download the file in seconds. Here is how:
Choose your crawler
Select which bot your rules apply to. Use * (wildcard) to target all crawlers, or pick a specific bot from the preset list — Googlebot, Bingbot, GPTBot, and many more. You can add multiple rule groups for different bots.
Set Allow and Disallow paths
Add the paths you want to block or allow. Use / to refer to the entire site, or specify paths like /admin/, /private/, or /wp-admin/. Allow rules take precedence over Disallow rules when both match a URL.
/admin/ not /admin — to ensure the rule applies to the directory and everything inside it.Add your sitemap URL
Including your sitemap URL in robots.txt helps search engines discover it automatically. Enter your sitemap URL — typically https://yourdomain.com/sitemap.xml — and the generator will add the correct Sitemap: directive at the end of the file.
Generate, copy, and upload
Click Generate to produce your robots.txt content. Copy it to your clipboard or download the file, then upload it to the root of your website. The file must be accessible at https://yourdomain.com/robots.txt — not in a subdirectory.
https://yourdomain.com/robots.txt in your browser to confirm the file is live and looks correct.Manual creation
You can also create a robots.txt file in any plain text editor. Save it as robots.txt (not robots.txt.txt) with UTF-8 encoding and no byte-order mark. The format is straightforward:
User-agent: * Disallow: /admin/ Disallow: /private/ Allow: / Sitemap: https://yourdomain.com/sitemap.xml
Each blank line separates a new record. Comments start with #. The Sitemap: directive goes at the end and applies globally, not to a specific user-agent.
Common use cases
WordPress sites
WordPress generates some paths that should generally be blocked from crawlers — the admin panel, login page, and dynamically generated URLs that create duplicate content. A standard WordPress robots.txt looks like this:
User-agent: * Disallow: /wp-admin/ Allow: /wp-admin/admin-ajax.php Disallow: /wp-login.php Disallow: /?s= Disallow: /search/ Sitemap: https://yourdomain.com/sitemap.xml
The /wp-admin/admin-ajax.php allow rule is important — this endpoint is used by many WordPress themes and plugins for front-end functionality, so blocking it can break things for real users.
The Robots.txt Generator includes a WordPress CMS template that sets these rules automatically.
Shopify and e-commerce
E-commerce sites commonly have URL parameters for sorting, filtering, and tracking that create thousands of near-duplicate pages. Blocking these keeps search engines focused on your actual product and category pages:
User-agent: * Disallow: /admin/ Disallow: /cart Disallow: /checkout Disallow: /account Disallow: /collections/*?sort_by= Disallow: /collections/*?filter. Sitemap: https://yourdomain.com/sitemap.xml
Shopify manages its own robots.txt file and does not allow direct editing. You can customize it using a robots.txt.liquid template in your theme. The Shopify template in the Robots.txt Generator gives you a solid starting point.
JavaScript frameworks (Next.js, React)
Modern JavaScript frameworks often generate routes for API endpoints, preview modes, and development utilities that should not appear in search results:
User-agent: * Disallow: /api/ Disallow: /_next/ Disallow: /preview/ Sitemap: https://yourdomain.com/sitemap.xml
Blocking admin and sensitive areas
Any path that should not be publicly accessible should be disallowed in robots.txt as a first line of defence. Common targets include:
/admin/— administration panels/private/— internal documents/staging/— staging environments on production domains/tmp/— temporary files/cgi-bin/— legacy server scripts
Remember: robots.txt is not a security barrier. A disallowed path can still be accessed by anyone who knows the URL. Always use proper authentication for genuinely sensitive content.
Blocking AI bots
Since 2023, most major AI companies have released crawlers that collect web content to train large language models. Publishers, news sites, and content creators have increasingly chosen to block these crawlers. Robots.txt is the standard mechanism for doing so.
Why block AI scrapers?
The case for blocking comes down to two concerns. First, your content may be used to train AI models without compensation or credit. Second, AI bots can generate significant crawl traffic on sites with large content libraries, consuming bandwidth without delivering referral traffic in return.
The case against blocking is that some AI products (like Perplexity) surface citations and send traffic. The right call depends on your content model and audience.
Known AI crawlers
To block the main AI crawlers, add individual rules for each one:
User-agent: GPTBot Disallow: / User-agent: ChatGPT-User Disallow: / User-agent: ClaudeBot Disallow: / User-agent: PerplexityBot Disallow: / User-agent: CCBot Disallow: / User-agent: OAI-SearchBot Disallow: / User-agent: Meta-ExternalAgent Disallow: /
The Robots.txt Generator's AI Bot Controls section lets you select which crawlers to block with checkboxes — no manual editing required. It stays updated with new crawlers as they emerge.
Important caveats
Blocking AI bots in robots.txt only works with crawlers that respect the Robots Exclusion Protocol. Not all do — some less scrupulous scrapers ignore robots.txt entirely. If you need stronger protection, consider IP-based blocking or rate limiting at the server level.
Also, blocking a crawler from your site does not remove content that has already been collected. It only prevents future crawls.
Common mistakes to avoid
Blocking CSS and JavaScript files
An old SEO practice was to block /wp-content/ to prevent CSS and JS from being crawled. This is now harmful. Google uses CSS and JavaScript to render pages, and blocking these files prevents it from seeing your content as users do. Never block your theme files, plugin assets, or JavaScript bundles.
Typos in paths
A path like Disallow: /admin (without a trailing slash) will block /admin but not /admin/settings/ on some crawlers. Always use trailing slashes for directories. Double-check paths against your actual URL structure.
Missing Sitemap directive
Many robots.txt files omit the Sitemap: directive. While it is not strictly required, including it means every crawler that reads your robots.txt will automatically know where to find your sitemap — making it easier for search engines to index your content.
Blocking the entire site accidentally
Disallow: / in a wildcard block will stop all crawlers from accessing any page on your site. This is a common mistake during development that sometimes slips into production. Always check your live robots.txt after deploying.
Conflating robots.txt with noindex
Disallowing a page does not remove it from the index. Google can still index a URL if other sites link to it, even if it cannot crawl it. To prevent indexing, use noindex in the page's meta robots tag or as an HTTP response header. Robots.txt and noindex serve different purposes.
Using robots.txt as a security tool
Listing sensitive paths in robots.txt actually advertises them to anyone who reads the file — which is public. Do not rely on robots.txt to protect confidential directories. Use authentication.
Frequently asked questions
What is a robots.txt file?
A robots.txt file is a plain text file at the root of your website that tells crawlers which pages they can or cannot access. It follows the Robots Exclusion Protocol and is checked by most bots before they crawl anything on your site.
Does robots.txt hide pages from Google?
No. Disallowing a page stops Google from crawling it, but Google may still index the URL if other sites link to it. To prevent a page from appearing in search results entirely, add a noindex meta robots tag to the page itself.
Where do I upload my robots.txt file?
Upload it to the root of your domain so it is accessible at https://yourdomain.com/robots.txt. A robots.txt file in a subdirectory (like /blog/robots.txt) will not be recognised by crawlers.
How do I block AI bots like GPTBot and ClaudeBot?
Add a separate User-agent block for each AI bot with Disallow: /. The Robots.txt Generator includes an AI Bot Controls section that lets you block multiple AI crawlers with a single click.
How do I test my robots.txt file?
Visit your robots.txt URL directly in a browser to confirm it is live. For deeper testing, use Google Search Console's robots.txt tester (under Settings) to check whether specific URLs are blocked or allowed.
Is robots.txt case-sensitive?
The directives (User-agent, Disallow, Allow, Sitemap) are case-insensitive. However, the paths in Allow and Disallow rules are case-sensitive on most web servers, so /Admin/ and /admin/ are treated as different paths.
Can I have multiple User-agent rules?
Yes. You can have as many User-agent blocks as you need. Each block applies to the crawler(s) named in its User-agent lines. This allows you to set different rules for different bots — for example, allowing Googlebot full access while blocking AI crawlers completely.
Do all bots respect robots.txt?
Major search engines (Google, Bing, Yandex) and most reputable AI crawlers respect robots.txt. Malicious scrapers and some lesser-known bots do not. For protection against non-compliant bots, you need server-level controls like IP blocking or rate limiting.