The Robots.txt File: What It Is and Why It Matters

Posted on November 5, 2024 in Blog

For decades, the humble robots.txt file served as the de facto regulator of search engine crawlers. Thanks to AI, that democratic supremacy may be over. This small piece of code is tasked with restricting crawlers from sensitive areas of a website, and, to their credit, the likes of Google, Bing, and Firefox have honored those requests. Find out what a robots.txt file does, how to view your website’s file, and why brands should evaluate these files in a world where AI doesn’t follow existing norms.

What Is a Robots.txt File?

Search engines like Google and Bing use crawlers, or “bots,” to find, organize, and index URLs, ultimately serving those URLs on the search engine results page (SERP). Adding certain URLs to a robots.txt with a “Disallow” directive helps prevent search engines from crawling those URLs – or, rather, requests they don’t crawl or index those pages.

So, a robots.txt file manages which URLs of a domain that web crawlers can access on a website. It also manages which web crawlers are allowed to crawl the site, which XML sitemaps they focus on, and how long they should delay the start of those crawls.

When Should You Use a Robots.txt File?

There are several main reasons to use a robots.txt file:

To manage crawl budget with Disallow directives.
To keep potentially sensitive pages private on your domain.
To control which crawlers access the site, and how they behave.
To direct crawlers to preferred XML sitemaps.
To delay crawlers from repeated crawls in a short period.

Let’s detail how each of these functions works.

Control Crawl Budget with Robots.txt

Not all pages on your website are valuable to users. Crawlers have a specific “budget,” or the number of URLs they will crawl over a particular period. You want them to spend that budget on your best content, like product pages, services pages, or top blog posts.

Duplicate content, URLs with query parameters, or old content you need to clean up all waste crawl budget, which is why some brands add those unimportant URLs to their robots.txt with a “Disallow” directive.

Control Page Privacy with Robots.txt

Some pages on your site are potentially vulnerable to bad actors. It’s considered best practice for a robots.txt file to disallow bots from accessing administrative pages, log-in access URLs, or customer log-in screens. There’s a big asterisk here, though. By placing your sensitive URLs in the file, you’re making them visible, as we’ll see in a moment. Bad actors can find and attack these URLs, whether in your robots.txt or not. Placing them there, however, reduces the chances of them being indexed.

Control Which Crawlers Access the Site

Some administrators choose to block certain web crawlers from spidering a website. This discourages disreputable crawlers from accessing the site and potentially scraping data without permission. Crawlers also consume a domain’s server bandwidth, which may increase hosting costs or negatively impact performance for real users.

Delay Site Crawls

Additionally, a crawl delay can be added to a robots.txt file to delay the start of crawling. This is useful in protecting bandwidth during high-volume periods and reduces the risk of server crashers. It may also balance how often your site is indexed while preserving server resources.

Direct Crawlers to Your XML Sitemaps

Adding a Sitemap line (or multiple) to a robots.txt file gives search engine crawlers a direct path to your XML sitemap, so the important URLs in that sitemap can be frequently and consistently indexed and ranked. Unlike the other features described above, which prevent crawlers from doing certain things, this feature gives crawlers instructions to do what we want them to do!

Robots.txt Best Practices: URLs to Disallow

Remember: Disallows are the lines in a robots.txt file that tell crawlers not to crawl certain URLs. We recommend adding a few types of pages to the Disallow lines in your robots.txt file but carefully consider these additions for yourself. If you’re unsure, we’d be happy to answer your questions. (Just don’t get Cody started on the history of robots.txt.)

In most cases, it’s an industry best practice to disallow the following page types in the file:

Internal search results pages
URLs with query parameters
Admin login pages
Customer login pages
Duplicate content
Private content that is valuable on the site but only for qualified users, not the whole world wide web

Where to Find Robots.txt on Your Site

Most content management systems (CMS) automatically create robots.txt files and add many of the page types listed above. Tools like Yoast have a dedicated robots.txt generator that pulls in sensitive pages as they’re created or as web administrators submit URLs.

Remember that anyone can see your disallow list if they know where to find the robots.txt for your domain. Locating it is pretty simple. Here’s Apple’s, Oneupweb’s, and The Wall Street Journal’s. It’s as easy as slapping “robots.txt” on the root domain URL:

Large websites like these block crawlers from all kinds of article types, email signup pages, site searches, and very specific crawlers – crawlers from AI companies.

Depending on your CMS, you’ll be able to access the robots.txt file in a dedicated plug-in, like Yoast or Rank Math SEO. You can also find it in the domain’s root directory or, in some cases, navigating from your dashboard to Settings, Reading, and then looking for a section covering search engine indexing. The exact verbiage will vary based on your CMS and other factors.

Robots.txt, AI, and the New Internet

Until now, you might assume that robots.txt files serve as a wall, actively preventing bots from accessing certain URLs.

They don’t. They simply request crawlers stay away from those pages. That was the case for the long history of robots.txt (well, since it was first used in 1994). Most reputable bots respected those requests. Artificial intelligence companies like ChatGPT, Perplexity, and a dozen others have been accused of ignoring the robots.txt file, upsetting an unregulated but widely accepted rule among search companies to take robots.txt files seriously.

Things Are Getting Litigious

If you scoped The Wall Street Journal’s file, they specifically block a slew of artificial intelligence bots from crawling their site, disallowing access to their domain to bots run by Anthropic, Twitter (we’re not calling them X), and others.

Suffice it to say that these AI companies haven’t been respecting those requests. In October 2024, News Corp, which owns The Wall Street Journal and New York Post, filed a lawsuit against Perplexity. They’re alleging copyright infringement and have a solid case with nearly word-for-word news articles, editorials, and opinion pieces. News Corp also alleges that Perplexity misattributes sources or links to different content than it used to generate a response. Other outlets, including Wired and Forbes, have also accused Perplexity of scraping content without permission.

Similar lawsuits are filed against OpenAI, the owner of ChatGPT, and smaller AI companies.

We Used to Be a Proper Internet

These lawsuits and the behavior that spurred them may signal the decline of long-held digital norms. For marketers, it’s time to take a more active approach to security, with password-protected pages, off-site customer accounts, and other ways to reduce the vulnerability of sensitive pages. It’s always been the case, but it’s never been more true: If it’s on the internet and someone wants to get into it, they probably can.

If that’s the case, do you need a robots.txt file? Yes! Robots.txt is still a valuable tool for effectively organizing what should be indexed and what shouldn’t.

Concerned? We Can Help You Get the Most Out of Your Robots.txt File

As part of a technical site audit or monthly management, we monitor the URLs included in the robots.txt file and recommend changes to preserve the crawl budget and improve overall results. You’ll work with a hand-picked team of marketing pros with the experience and resources to make things happen! Let’s get started; reach out or call (231) 922-9977 today.