How to protect your AEM instances from Google searches: Robots.txt

Statement : How to protect AEM instances from Google searches.


Here is an example search that lists servers that have not removed Geometrixx:use this url in search engine for search :
-          First and foremost, as a best practice, recommend all CQ5 author and publish servers be put behind a firewall, not publicly accessible.
-          Only your web server (dispatcher) should be in front of the firewall. If your author and publish servers are behind a firewall, there won’t be any way for Google to index them.



If it is absolutely necessary for author or publish server to be in front of a firewall, we should add a robots.txt file to the root directory /.
-          This file will prevent most search engines from displaying your server in search results.

Here are the steps for doing this:

-          Navigate to CRXDelight at {server}/crx/de/ (Make sure you’re logged in as admin)
-          Right click on your root node, and go to Create … > Create File …

1.       Name the file robots.txt
2.       Place the following code in the file, and save it:
1.       User-agent: *
2.       Disallow: /
3.       Now we have to grant the anonymous user read access to the file. To do this, navigate to the user admin section at {server}/useradmin(loclhost:4502/useradmin)
4.       Open the anonymous user, and click on the permissions tab
5.       Grant read access to the robots.txt file, then click save
-          Verify the robots.txt file exists and is accessible by first logging out, then navigating to {server}/robots.txt (localhost:4502/robots.txt)
-          If it’s there, search engines should no longer index your server
-          Repeat these actions for all author/publish servers that are publicly accessible.

Robots.txt related findings

Finding ID
Total risk
Effort to Fix
Enable robots.txt in prod author and Publishers