Technical SEO

What should (and should not be) in a robots.txt file?

What is robots.txt?

robots.txt is a file, accesible as a URL https://<domain>/robots.txt that tells crawlers and BOTs what URLS crawlers can access on the domain / website.

However, there is a lot of confusion among internet marketers on its value and power.

The format of the file is very simple. Instructions to allow or disallow paths per user agent are grouped together. The user agent itself is a shortened version of the user agent header and each BOT publishes its own shortcode. Google for exanple has a top level Googlebot but has many others for images, ads, etc.

In addition a global setting in robots.txt is sitemap. This directive tells all crawlers where the sitemap file is. Most commonly it is an xml file.

Each crawlers is free to interpret the robot.txt, including if it should honor the directives.

Do you use Disallow in robots.txt

SEO experts and internet marketers would like to think that by adding a Disallow of an existing URL will cause google to not index the page.

That is not true – google states this in many documentations. At one place it puts up a warning

“Warning: Don't use a robots.txt file as a means to hide your web pages from Google search results.

If other pages point to your page with descriptive text, Google could still index the URL without visiting the page. If you want to block your page from search results, use another method such as password protection or noindex."

We actually see internet marketers in the Magento space browse the internet and get a suggested robots.txt with perhaps the longest content they can find, thinking the more Disallows they have the better google will crawl.

Security Issues with bad disallows

We see commented sections saying “#Directories”.

Little do they know they are exposing the internal directory structure of the application with this. And since robots.txt is available to everyone including bad actors, you have just created a security issue.

We also see some marketers give a huge list of crawlers they don’t want on their website. This is completely unnecessary. Tell your hosting provider. If you are with luroConnect, all our plans include “BOT Blockers”. Just let us know and we will match a real HTTP User Agent header and block it.

What would be the best way to structure a robots.txt?

There is no significance of allow!

User-Agent : *
Disallow : /<path visible to user you don’t want crawler to go like product compare>
User-Agent : bingbot
Crawl-delay : 1
Sitemap : <sitemapurl>

Note though Google suggests using noindex on the pages you put in disallow.

Another interesting fact is google documents that it caches robots.txt for 24 hours. Essentially meaning that cache control headers are ignored by google. However, a browser does cache robots.txt, so when checking the current robots.txt of a website that google will see, please disable your browser cache.

So, why a note on robots.txt on a managed hosting blog?

We are a hosting platform that understands the importance of SEO and have acted on it.
We know and keep track of robots.txt, sitemaps and feeds for all our customers.

We also feature a very strong BOT Blocker - regular expression match any user agent string, identify googlebot imposters and block them, basic authentication password protected staging and dev sites to ensure they don't get crawled.

If you are SEO knowledgeable and do not find the recommendation acceptable, we encourage you put your comments in this blog. We will acknowledge and publish all updates we accept.

404 when moving to Magento 2

When migrating platforms for example from WooCommerce to Magento or from Magento 1 to Magento 2, it is imperative that we move all URLs  to ensure proper SEO authority is retained after the move as well as real users get a smooth Customer Experience (CX).

If it is not possible to maintain the same URLs, ensure redirects are made. This is especially true of a move from WooCommerce and also true when as part of the move the store is reorganized. When migrating data, an automated custom URL redirect process is recommended.

In spite of the best planning and intentions it is likely some URLs may be missed and redirects may not be setup.

Failing to do so has 2 negative effects

  • BOTS, especially Google may visit the new site with the older URL and receive a 404. This can reduce your site rating on Google.
  • Users that may have bookmarked or use browser history may get a 404 and get a bad CX resulting in visitors bouncing and reducing brand value.

Google Analytics does not show 404s. Google search console may show if the count is high. Neither of these is a reliable source to know what was missed.

Only the server knows for sure it severed a 404 and would have logged it in the apache or nginx access and/or error log files. On a new migration we recommend automating a 404 report from the log file, atleast once a day.

If using our luroConnect / Edge product, the dashboard can be used to setup a real-time alert for any 404 returned. This can then be actioned by developers or the agency to include a redirect.

Watch our webinar on performance and scaling in Magento

Its free!

Using analogy to vehicular traffic we explain performance and scaling in Magento.
Key takeaways

  • Know how to compare hosting options
  • Importance of good code
  • How to scale
  • Tuning Magento

We can analyze your site for free

Schedule a call

Not happy with your website performance and want an expert to look at it?

  • We will analyze your site using public information.
  • We will ask you to give us a 1 day web server log file.
  • We will try to identify what steps if any you should take to improve your sites performance goals.