What should (and should not be) in a robots.txt file?
What is robots.txt?
robots.txt is a file, accesible as a URL https://<domain>/robots.txt that tells crawlers and BOTs what URLS crawlers can access on the domain / website.
However, there is a lot of confusion among internet marketers on its value and power.
The format of the file is very simple. Instructions to allow or disallow paths per user agent are grouped together. The user agent itself is a shortened version of the user agent header and each BOT publishes its own shortcode. Google for exanple has a top level Googlebot but has many others for images, ads, etc.
In addition a global setting in robots.txt is sitemap. This directive tells all crawlers where the sitemap file is. Most commonly it is an xml file.
Each crawlers is free to interpret the robot.txt, including if it should honor the directives.
Do you use Disallow in robots.txt
SEO experts and internet marketers would like to think that by adding a Disallow of an existing URL will cause google to not index the page.
That is not true – google states this in many documentations. At one place it puts up a warning
“Warning: Don't use a robots.txt file as a means to hide your web pages from Google search results.
If other pages point to your page with descriptive text, Google could still index the URL without visiting the page. If you want to block your page from search results, use another method such as password protection or noindex."
We actually see internet marketers in the Magento space browse the internet and get a suggested robots.txt with perhaps the longest content they can find, thinking the more Disallows they have the better google will crawl.
Security Issues with bad disallows
We see commented sections saying “#Directories”.
Little do they know they are exposing the internal directory structure of the application with this. And since robots.txt is available to everyone including bad actors, you have just created a security issue.
We also see some marketers give a huge list of crawlers they don’t want on their website. This is completely unnecessary. Tell your hosting provider. If you are with luroConnect, all our plans include “BOT Blockers”. Just let us know and we will match a real HTTP User Agent header and block it.
What would be the best way to structure a robots.txt?
There is no significance of allow!
User-Agent : *
Disallow : /<path visible to user you don’t want crawler to go like product compare>
User-Agent : bingbot
Crawl-delay : 1
Sitemap : <sitemapurl>
Note though Google suggests using noindex on the pages you put in disallow.
Another interesting fact is google documents that it caches robots.txt for 24 hours. Essentially meaning that cache control headers are ignored by google. However, a browser does cache robots.txt, so when checking the current robots.txt of a website that google will see, please disable your browser cache.
So, why a note on robots.txt on a managed hosting blog?
We are a hosting platform that understands the importance of SEO and have acted on it.
We know and keep track of robots.txt, sitemaps and feeds for all our customers.
We also feature a very strong BOT Blocker - regular expression match any user agent string, identify googlebot imposters and block them, basic authentication password protected staging and dev sites to ensure they don't get crawled.
If you are SEO knowledgeable and do not find the recommendation acceptable, we encourage you put your comments in this blog. We will acknowledge and publish all updates we accept.