OVERVIEW
In a default WordPress installation, a robots.txt file does not exist in your web site’s root directory. WordPress dynamically generates robots.txt content for your web site when it receives a request for the robots.txt file. If you want to modify the contents of robots.txt, the usual advice given is to create a robots.txt file in the root directory of your web site. This article will show you how to modify the dynamically generated contents without using a plugin or uploading a static robots.txt file.
BACKGROUND ON ROBOTS.TXT IN WORDPRESS
Why would you want to modify the dynamically generated robots.txt content instead of uploading a real robots.txt file to your web site?
- SEO plugins might dynamically generate robots.txt content to optimize how search engines index your web site.
- WordPress changes the content of robots.txt based on the Settings >> Reading >> Search engine visibility setting. For example…
- Search engine visibility enabled:
User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php - Search engine visibility disabled:
User-agent: *
Disallow: /
Creating a robots.txt file in your web site root directory will override all dynamically generated content.
- Search engine visibility enabled:
Doing it “The WordPress Way” by hooking into the robots.txt content generation allows you to make modifications while preserving other dynamically generated content you may want to keep.
HOW TO MODIFY ROBOTS.TXT WITH PHP
In the code sample below, a closure function is hooked into the WordPress ‘robots_txt’ filter. The priority parameter for the WordPress Filter API function add_filter( ) is set to 99 to help ensure that this function is executed after WordPress and other plugins have made their modifications (lower numbers execute sooner, the default value is 10). Add this code to the functions.php file of your WordPress theme.
/**
* Restore functionality lost in WP5.3+
* Remove "Allow: /wp-admin/admin-ajax.php".
* Add "Disallow: /wp-login.php/".
* Add Disallow for some media file types.
* Add a "Sitemap:" link.
*/
add_filter( 'robots_txt', function ( $output, $public ) {
/**
* Is site not public?
*/
if ( '0' == $public ) ) {
/**
* If "Search engine visibility" is disabled,
* strongly tell all robots to go away.
*/
$output = "User-agent: *\nDisallow: /\nDisallow: /*\nDisallow: /*?\n";
} else {
/**
* Remove line that allows robots to access AJAX interface.
*/
$ajax_path = parse_url( admin_url( 'admin-ajax.php' ), PHP_URL_PATH );
$robots = preg_replace( '/Allow: ' . preg_quote( $ajax_path, '/' ) . '\n/', '', $output );
/**
* If no error occurred, replace $output with modified value.
*/
if ( null !== $robots ) {
$output = $robots;
}
/**
* Add new disallow.
* Trim base url from output.
*/
$output .= 'Disallow: ' . parse_url( site_url( 'wp-login.php', 'login' ), PHP_URL_PATH ) . "\n";
/**
* Disallow some file types
*/
foreach ( [ 'jpeg', 'jpg', 'gif', 'png', 'mp4', 'webm', 'woff', 'woff2', 'ttf', 'eot' ] as $ext ) {
$output .= "Disallow: *.{$ext}$\n";
}
/**
* Calculate and add a "Sitemap:" link if it doesn't exist.
* Modify as needed.
*/
if ( false === stripos( $output, 'Sitemap: ' ) ) {
$output .= 'Sitemap: ' . home_url( 'sitemap_index.xml' ) . "\n";
}
}
return $output;
}, 99, 2 ); // Priority 99, Number of Arguments 2.
NOTES
The draft standard for robots.txt files was never accepted as a standard. The draft expired on June 4, 1997, and no work is being done to standardize it. Most commercial web indexing companies consider the expired draft a de-facto standard. But, do not use robots.txt files with the expectation that anyone or anything will honor or understand any of its directives.
Some search engine companies have added their own extensions to the expired draft standard. For example, Google has added support for a * wildcard character, a regular expression style $ end-of-line character, and a Sitemap location directive.
As of WordPress version 5.3.0, the robots.txt WordPress generates the following, despite what the “Search Engine Visibility” option is set to:
User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php
The code that used to set robots.txt content to “Disallow: /” when “Search Engine Visibility” is turned off has been removed from the do_robots( ) function. Instead, WordPress adds <meta name=’robots’ content=’noindex,follow’ /> through the wp_no_robots( ) function to the header of all web pages if Search Engine Visibility is disabled. This happens only if a client request returns HTML content.
Since it is possible to change the default WordPress URLs, it is safer to use the functions instead of hard-coded paths. Use the WordPress functions such as site_url() and admin_url(), as shown in the code, to safely get WordPress system URLs.
Meta tags are not added to AJAX or REST API output; two interfaces that Search Engines are accessing with increasing regularity. With AJAX and REST API responses, WordPress sends an HTML response header of “X-Robots-Tag: noindex, nofollow” to try to prevent robots from indexing this content.
Use the following code to replace the lost functionality, and add the X-Robots-Tag header if web site is not public. This also treats web sites with WP_DEBUG set to true as a non public web site…
/**
* Restore functionality lost in WP5.3+
* Add Disallow for some media file types.
* Remove "Allow: /wp-admin/admin-ajax.php".
* Add "Disallow: /wp-login.php".
* Add "Disallow: /wp-admin/admin-ajax.php".
* Add a "Sitemap:" link.
* Treat WP_DEBUG==true as a non-public web site.
*/
add_filter( 'robots_txt', function ( $output, $public ) {
/**
* Is site not public, or in debug mode?
*/
if ( '0' == $public || (
defined( 'WP_DEBUG' ) && true == WP_DEBUG) ) {
/**
* If "Search engine visibility" is disabled,
* strongly tell all robots to go away.
*/
$output = "User-agent: *\nDisallow: /\nDisallow: /*\nDisallow: /*?\n";
} else {
/**
* Disallow some file types
*/
foreach ( [ 'jpeg', 'jpg', 'gif', 'png', 'mp4', 'webm', 'woff', 'woff2', 'ttf', 'eot' ] as $ext ) {
$output .= "Disallow: *.{$ext}$\n";
}
/**
* Get login and AJAX endpoint path.
*/
$login_path = parse_url( site_url( 'wp-login.php', 'login' ), PHP_URL_PATH );
$ajax_path = parse_url( admin_url( 'admin-ajax.php' ), PHP_URL_PATH );
/**
* Remove line that allows robots to access AJAX interface.
*/
$robots = preg_replace( '/Allow: ' . preg_quote( $ajax_path, '/' ) . '\n/', '', $output );
/**
* If no error occurred, replace $output with modified value.
*/
if ( null !== $robots ) {
$output = $robots;
}
/**
* Add new disallows.
*/
$output .= "Disallow: {$login_path}\n";
$output .= "Disallow: {$ajax_path}\n";
/**
* Calculate and add a "Sitemap:" link if it doesn't exist.
* Modify as needed.
*/
if ( false === stripos( $output, 'Sitemap: ' ) ) {
$output .= 'Sitemap: ' . home_url( 'sitemap_index.xml' ) . "\n";
}
}
return $output;
}, 99, 2 ); // Priority 99, Number of Arguments 2.
/**
* Send "X-Robots-Tag: noindex, nofollow" header if not a public web site.
* If WP_DEBUG is true, treat web site as if it is non-public.
*/
add_action( 'send_headers', function () {
if ( '0' != get_option( 'blog_public' ) || (
defined( 'WP_DEBUG' ) && true == WP_DEBUG) ) {
/**
* Tell robots not to index or follow
* Set header replace parameters to true
*/
header( 'X-Robots-Tag: noindex, nofollow', true );
}
}, 99 ); // Try to execute last with priority set to 99.
WEB SERVER LEVEL NO-INDEX COMMANDS
The code samples above will only help for requests that WordPress handles the output for. It will not send X-Robots-Tag on requests for files you uploaded, such as images, PDF files, et cetera.
To set the X-Robots-Tag to “noindex, nofollow” for all responses, on an Apache web server, you can add the following to top of the .htaccess file in the root directory of your web site…
# Tell robots to go away
Header set X-Robots-Tag "noindex, nofollow"
You can also modify X-Robots-Tag only on specific files or URI’s. For example, add the following directives to the top of the .htaccess file in your web site’s root directory. This will tell bots to stop accessing the WordPress AJAX endpoint…
# Tell robots to stop visiting /wp-admin/admin-ajax.php
<files admin-ajax.php>
Header set X-Robots-Tag "noindex, nofollow"
</files>
To get the same effect with an nginx web server, use an add_header directive in the <server> section of your nginx.conf file.
add_header 'X-Robots-Tag' 'noindex, nofollow'
For Microsoft IIS web servers, use a <customHeaders> section in your web.config file.
<httpProtocol>
<customHeaders>
<add name="X-Robots-Tag" value="noindex, nofollow" />
</customHeaders>
</httpProtocol>
Share the post "How to modify WordPress’ robots.txt file with PHP"