How to modify the WordPress generated robots.txt file with PHP code

OVERVIEW

In a default WordPress installation, a robots.txt file does not exist in your web site’s root directory. WordPress dynamically generates robots.txt content for your web site when it receives a request for the robots.txt file. If you want to modify the contents of robots.txt, most advice given is to create a robots.txt file in the root directory of your web site. This article will show you how to modify the contents of WordPress’ dynamically generated robots.txt contents without using a plugin or uploading a static robots.txt file.

BACKGROUND

Why would you want to modify the dynamically generated robots.txt content instead of uploading a real robots.txt file to your web site?

  • SEO plugins might dynamically generate robots.txt content to optimize how search engines index your web site.
  • WordPress changes the content of robots.txt based on the Settings >> Reading >> Search engine visibility setting. For example…
    • Search engine visibility enabled:
      User-agent: *
      Disallow: /wp-admin/
      Allow: /wp-admin/admin-ajax.php
    • Search engine visibility disabled:
      User-agent: *
      Disallow: /

    Creating a robots.txt file in your web site root directory will override all dynamically generated content.

Doing it “The WordPress Way” by hooking into the robots.txt content generation allows you to make modifications while preserving other dynamically generated content you may want to keep.


THE SOLUTION

In the code sample below, a closure function is hooked into the WordPress ‘robots_txt’ filter. The priority parameter for the WordPress Filter API function add_filter( ) is set to 99 to help ensure that this function is executed after WordPress and other plugins have made their modifications (lower numbers execute sooner, the default value is 10). Add this code to the functions.php files of your WordPress theme.

/**
 * Add Disallow for some file types.
 * Add "Disallow: /wp-login.php/\n".
 * Remove "Allow: /wp-admin/admin-ajax.php\n".
 * Calculate and add a "Sitemap:" link.
 */
add_filter( 'robots_txt', function( $output, $public ) {
	/**
	 * If "Search engine visibility" is disabled,
	 * strongly tell all robots to go away.
	 */
	if ( '0' != $public ) {
		/**
		 * Disallow some file types
		 */
		foreach(['jpeg','jpg','gif','png','mp4','webm','woff','woff2','ttf','eot'] as $ext){
			$output .= "Disallow: /*.{$ext}$\n";
		}

		/**
		 * Get site path.
		 */
		$site_url = parse_url( site_url() );
		$path	  = ( ! empty( $site_url[ 'path' ] ) ) ? $site_url[ 'path' ] : '';

		/**
		 * Add new disallow.
		 */
		$output .= "Disallow: $path/wp-login.php\n";

		/**
		 * Remove line that allows robots to access AJAX interface.
		 */
		$robots = preg_replace( '/Allow: [^\0\s]*\/wp-admin\/admin-ajax\.php\n/', '', $output );

		/**
		 * If no error occurred, replace $output with modified value.
		 */
		if ( null !== $robots ) {
			$output = $robots;
		}

		/**
		 * Calculate and add a "Sitemap:" link.
		 * Modify as needed.
		 */
		$output .= "Sitemap: {$site_url[ 'scheme' ]}://{$site_url[ 'host' ]}/sitemap_index.xml\n";
	}

	return $output;

}, 99, 2 );  // Priority 99, Number of Arguments 2.

NOTES

The draft standard for robots.txt files was never accepted as a standard. The draft expired on June 4, 1997, and no work is being done to standardize it. Commercial web indexing companies consider the expired draft as a de-facto standard, but they are not obligated to. Do not use robots.txt files with the expectation that any of their directives will be honored or understood by anyone or anything.

Some search engine companies have added their own extensions to the expired draft standard. For example, Google has added support for a * wildcard character, the regular expression style $ end-of-line character, and a Sitemap location directive.

As of WordPress version 5.3.0, the robots.txt content generated by WordPress is always as follows, despite what the “Search Engine Visibility” option is set to:

User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php

The code that use to set robots.txt content to “Disallow: /” when “Search Engine Visibility” was disabled has been removed from the do_robots( ) function. Instead, WordPress adds <meta name=’robots’ content=’noindex,follow’ /> through the wp_no_robots( ) function to the header of all web pages if Search Engine Visibility is disabled. This happens only if a client request returns HTML content.

Meta tags are not added to AJAX or REST API output; two interfaces that Search Engines are accessing with increasing regularity. With AJAX and REST API responses, WordPress sends an HTML response header of “X-Robots-Tag: noindex, nofollow” to try to prevent robots from indexing this content.

Use the following code to replace the lost functionality, and add the X-Robots-Tag header if web site is not public. This also treats web sites with WP_DEBUG is set to true as a non public web site…

/**
 * Restore functionality lost in WP5.3+
 * Add Disallow for some file types.
 * Add "Disallow: /wp-login.php/\n".
 * Remove "Allow: /wp-admin/admin-ajax.php\n".
 * Calculate and add a "Sitemap:" link.
 * Treat WP_DEBUG==true as a non-public web site.
 */
add_filter( 'robots_txt', function( $output, $public ) {
	/**
	 * If "Search engine visibility" is disabled,
	 * strongly tell all robots to go away.
	 */
	if ( '0' != $public || (
			defined( 'WP_DEBUG' ) && true == WP_DEBUG) ) {
		$output = "User-agent: *\nDisallow: /\nDisallow: /*\nDisallow: /*?\n";
	} else {
		/**
		 * Disallow some file types
		 */
		foreach(['jpeg','jpg','gif','png','mp4','webm','woff','woff2','ttf','eot'] as $ext){
			$output .= "Disallow: /*.{$ext}$\n";
		}

		/**
		 * Get site path.
		 */
		$site_url = parse_url( site_url() );
		$path	  = ( ! empty( $site_url[ 'path' ] ) ) ? $site_url[ 'path' ] : '';

		/**
		 * Add new disallow.
		 */
		$output .= "Disallow: $path/wp-login.php\n";

		/**
		 * Remove line that allows robots to access AJAX interface.
		 */
		$robots = preg_replace( '/Allow: [^\0\s]*\/wp-admin\/admin-ajax\.php\n/', '', $output );

		/**
		 * If no error occurred, replace $output with modified value.
		 */
		if ( null !== $robots ) {
			$output = $robots;
		}
		/**
		 * Calculate and add a "Sitemap:" link.
		 * Modify as needed.
		 */
		$output .= "Sitemap: {$site_url[ 'scheme' ]}://{$site_url[ 'host' ]}/sitemap_index.xml\n";
	}

	return $output;

}, 99, 2 );  // Priority 99, Number of Arguments 2.

/**
 * Send "X-Robots-Tag: noindex, nofollow" header if not a public web site. 
 * If WP_DEBUG is true, treat web site as if it is non-public.
 */
add_action( 'send_headers', function() {
	if ( '0' != get_option( 'blog_public' ) || (
			defined( 'WP_DEBUG' ) && true == WP_DEBUG) ) {
		/**
		 * Tell robots not to index or follow
		 * Set header replace parameter to true
		 */
		header( 'X-Robots-Tag: noindex, nofollow', true );
	}
}, 99 );  // Try to execute last with priority set to 99

To make sure the X-Robots-Tag is set to “noindex, nofollow” for all responses, on an Apache web server, you can add the following to the .htaccess file in the root directory of your web site…

# Tell robots to go away
Header set X-Robots-Tag "noindex, nofollow"