ApacheCon: LinkRot

Sander van Zoest started off by describing three commond causes of link rot:

  • Redesign/reorganize your website
  • Switch dynamic page language (for example, from JSP to PHP)
  • Typos (user hand-edits URL and makes a mistake)

Consequences? Link rot can be distilled down to one thing: 404 == bad user experience.

van Zoest spoke about some ways of detecting and discovering link rot in an auomated manner, and some Apache directives you can use to avoid the problem. Redirect, the mod_rewrite module, and using a PHP or CGI page for ErrorDocument 404 to try to dynamically redirect the URL to the new location.

The HTTP Content-Location header (not to be confused with the HTTP Location header) can be used to specify the permanent archive location of the current content. Useful for time-sensitive information, but user agents don’t really take advantage of this metadata.

van Zoest spent a few slides discussing how to avoid using things in URLs that one should avoid. For example, any query strings (the key=value pairs after the question-mark) make your pages less index-able by search engines, and you can often use Path Info instead. In addition, you can avoid extensions such as .php in URLs using techniques like Options +MultiViews, DefaultType, and ForceType.

In the future, Apache 2.0 could provide a map_to_storage hook which should help to make the URL-to-file system mapping less tightly coupled.