Excluding non HTML files from Google

Excluding non HTML files from Google
Google has announced a new initiative for excluding non HTML files from Google (27 July 2007). We have Robots.txt where we can exclude files, we have meta robots lines in html code where we can direct Google to exclude pages from Google. But how about non HTML documents? Where can we put meta robots lines in their code?

The new Robots exclusion protocol allows for HTTP header instructions to be used when serving the file.

The HTTP header directives can be served when displaying PDF files, video, word, excel, xml, flash... and many other non HTML file types.

We've extended our support for META tags so they can now be associated with any file. Simply add any supported META tag to a new X-Robots-Tag directive in the HTTP Header used to serve the file. Here are some illustrative examples:

* Don't display a cache link or snippet for this item in the Google search results:
X-Robots-Tag: noarchive, nosnippet

* Don't include this document in the Google search results:
X-Robots-Tag: noindex

* Tell us that a document will be unavailable after 7th July 2007, 4:30pm GMT:
X-Robots-Tag: unavailable_after: 7 Jul 2007 16:30:00 GMT

Sometimes you may not want to tell people about the existence of certain files via the robots.txt files. You may wish the files to be accessible on your website, say by emailing selected people about them. It might be easier to add robots exclusions on a file by file basis rather than having an ever expanding robots.txt file.So if you add the HTTP Headers when you serve the file, you can make sure that if anyone does happen to link to the file, it will be excluded from the search results.

The following would let you serve a file of a different filename from the one named in the file structure.

header('X-Robots-Tag: noarchive, nosnippet'); // the Google robots instructions
header('Content-Type: application/msword'); // application/zip or application/pdf etc
header('Content-Disposition: attachment; ' .'filename="'.$different_filename.'"');
readfile('/path/to/files/' . $filename);

See examples of HTTP headers on Understanding HTTP Headers

Google PR Conservation and Robots exclusion

Adding a robots.txt or HTTP_Header exclusion will not conserve Google PR. To conserve Google PR, links need to have the rel=nofollow tag on them. Otherwise, all files linked to get Google PR assigned to them, they just may not be indexed given the robots exclusions applied to them.