Wednesday, November 22, 2017

How to keep your staging or development site out of the index

One of the most common technical SEO issues I come across is the inadvertent indexing of development servers, staging sites, production servers, or whatever other name you use.

There are a number of reasons this happens, ranging from people thinking no one would ever link to these areas to technical misunderstandings. These parts of the website are usually sensitive in nature and having them in the search engine’s index risks exposing planned campaigns, business intelligence or private data.

How to tell if your dev server is being indexed

You can use Google search to determine if your staging site is being indexed. For instance, to locate a staging site, you might search Google for site:domain.com and look through the results or add operators like -inurl:www to remove any www.domain.com URLs. You can also use third-party tools like SimilarWeb or SEMrush to find the subdomains.

There may be other sensitive areas that contain login portals or information not meant for public consumption. In addition to various Google search operators (also known as Google Dorking), websites tend to block these areas in their robots.txt files, telling you exactly where you shouldn’t look. What could go wrong with telling people where to find the information you don’t want them to see?

There are many actions you can take to keep visitors and search engines off dev servers and other sensitive areas of the site. Here are the options:

Good: HTTP authentication

Anything you want to keep out of the index should include server-side authentication. Requiring authentication for access is the preferred method of keeping out users and search engines.

Good: IP whitelisting

Allowing only known IP addresses — such as those belonging to your network, clients and so on — is another great step in securing your website and ensuring only those users who need to see the area of the website will see it.

Maybe: Noindex in robots.txt

Noindex in robots.txt is not officially supported, but it may work to remove pages from the index. The problem I have with this method is that it still tells people where they shouldn’t look, and it may not work forever or with all search engines.

The reason I say this is a “maybe” is that it can work and could actually be combined with a disallow in robots.txt, unlike some other methods which don’t work if you disallow crawling (which I will talk about later in this article).

Maybe: Noindex tags

A noindex tag either in the robots meta tag or an X-Robots-Tag in the HTTP header can help keep your pages out of the search results.

One issue I see with this is that it means more pages to be crawled by the search engines, which eats into your crawl budget. I typically see this tag used when there is also a disallow in the robots.txt file. If you’re telling Google not to crawl the page, then they can’t respect the noindex tag because they can’t see it.

Another common issue is that these tags may be applied on the staging site and then left on the page when it goes live, effectively removing that page from the index.

Maybe: Canonical

If you have a canonical set on your staging server that points to your main website, essentially all the signals should be consolidated correctly. There may be mismatches in content that could cause some issues, and as with noindex tags, Google will have to crawl additional pages. Webmasters also tend to add a disallow in the robots.txt file, so Google once again can’t crawl the page and can’t respect the canonical because they can’t see it.

You also risk these tags not changing when migrating from the production server to live, which may cause the one you don’t want to show to be the canonical version.

Bad: Not doing anything

Not doing anything to prevent indexing of staging sites is usually because someone assumes no one will ever link to this area, so there’s no need to do anything. I’ve also heard that Google will just “figure it out” — but I wouldn’t typically trust them with my duplicate content issues. Would you?

Bad: Disallow in robots.txt

This is probably the most common way people try to keep a staging site from being indexed. With the disallow directive in robots.txt, you’re telling search engines not to crawl the page — but that doesn’t keep them from indexing the page. They know a page exists at that location and will still show it in the search results, even without knowing exactly what is there. They have hints from links, for instance, on the type of information on the page.

When Google indexes a page that’s blocked from crawling, you’ll typically see the following message in search results: “A description for this result is not available because of this site’s robots.txt.”

If you recall from earlier, this directive will also prevent Google from seeing other tags on the page, such as noindex and canonical tags, because it prevents them from seeing anything on the page at all. You also risk not remembering to remove this disallow when taking a website live, which could prevent crawling upon launch.

What if you got something indexed by accident?

Crawling can take time depending on the importance of a URL (likely low in the case of a staging site). It may take months before a URL is re-crawled, so any block or issue may not be processed for quite a while.

If you got something indexed that shouldn’t be, your best bet is to submit a URL removal request in Google Search Console. This should remove it for around 90 days, giving you time to take corrective actions.

The post How to keep your staging or development site out of the index appeared first on Search Engine Land.

No comments:

Post a Comment