Ever thought about how search engines are able to pull up hundreds of search engine results for a simple query? Or how, if you change just a single aspect of that query in the slightest, you are presented with entirely new results within seconds?
You can thank the spiders for that.
Don’t torch your laptop just yet!
Search engine spiders scan billions of websites page by page in order to gain an understanding of their contents. According to Bots vs. Browsers, a public database of bots/user agents, search engine spiders alone account for 31% of all website traffic. Based on modern search engine algorithms, web pages are then presented in search engine results depending on their relevancy, authority and credibility.
However, sometimes search engine spiders are less helpful than they are designed to be. This is where robots.txt files comes in.
The Robots Exclusion Protocol
In 1994 Martijn Koster proposed the Robots Exclusion Protocol in response to the then increasing issue of robots, commonly referred to now as spiders or crawlers, visiting pages where they were not welcome. It is now a “de facto standard” for webmasters of all different types of websites. The Robots Exclusion Protocol, now referred to as Robot.txt, is an open-source tool used to dictate what spiders can peek at or pull content from, as well as what sections of a site are off limits.
Robots.txt is a set of directives used to prevent search engine crawlers from pulling content from certain pages or files. It is an uncomplicated text file that can be uploaded to any website’s root files via the FTP folder.
Usually, the code in a Robots.Txt file looks a little something like this:
If a Robots.txt file is in place, Google and other search engines will read it before any other file on the site to receive instructions about what pages can be crawled or disregarded. Any path following “Disallow” will be ignored by spiders and excluded from search results, provided it is not linked to from another crawlable web page. If your business’s website does not have a robots.txt file in place, search engine spiders will go on their merry way indexing the entire contents of your site, which can create a potentially poor user experience due to irrelevant content that does not align with your target audience’s search intentions.
It is important to note that robots.txt files only block honest, cooperative crawlers from scanning a page. It is therefore not a security measure, but an instructive one.
Why Would Anyone Want To Prevent Search Engine Crawlers From Seeing Their Page?
Google recommends in its guidelines that webmasters “use robots.txt to prevent crawling of search results pages or other auto-generated pages that don’t add much value for users coming from search engines.”
In human-speak, that means that robots.txt files are used to prevent crawlers from indexing an unannounced/incomplete site or burdening a web server by demanding too many of its resources. They’re also used to prevent crawlers from indexing pages that house strictly utilitarian code which is irrelevant to searchers as well as content that could appear to be duplicate (print versions of pages, for example).
For most small business websites, robots.txt files are simply used to prevent crawlers from seeing core files ( /feed/, /trackback/, /admin/, etc.).
The bottom line is that a robots.txt file is used to align with SEO best practices should a website have files that are not valuable to a user’s overall experience.
The Future of Robots.txt Files
Although no active efforts currently exist to further the development of the robots exclusionary protocol, major search engines continue to support it. The presence of robots.txt files within websites remain as one of the most solidified elements of any ongoing, successful SEO campaign for the foreseeable future.
If you choose to implement a robots.txt file on your business’s website site, make sure you implement it correctly. Keep an eye out for Jon Whitaker’s post about robots.txt implementation, coming soon.
“Robots.txt Specifications.” Google Developers. Accessed 8 April 2014.