Theo and George Schlossnagle gave a 2 hour talk on a hodge-podge of a few topics for scaling large websites. I’ll split this into two blogs.
Low-hanging fruit: Apache 1.3 optimizations
First, George pointed out one of the easiest tricks to optimize a large website: turn KeepAlive Off. No surprise here; Yahoo! has been doing this for a long time. It’s very resource-intensive to keep Apache children around for each of your clients, and heavy-traffic sites can’t afford to do this (even it it makes the client experience marginally faster).
[Editorial comment: proponents of KeepAlive often point out that the overhead of establishing 13 TCP connections to fetch an HTML page plus a dozen images really sucks when you could simply have a single TCP connection. However, on really large sites, images are often hosted on a completely different host (i.e. images.amazon.com or us.yimg.com) so running KeepAlive on the dynamic HTML machine is pointless. Users usually spend several seconds reading a page before clicking on another link.]
Next, he pointed out that you should set SendBufferSize to your maximal page size (something like 40K or however large your HTML pages tend to be). This way you effectively send each page with a single write() call so your web app (running dynamic stuff like PHP or mod_perl) never needs to block waiting for the client to consume the data.
Use gzip compression as a transfer encoding. You spend more CPU cycles but save on bandwidth. For a large website, bandwidth is far more expensive than buying more CPU power.
Don’t use Apache 1.3 for static content: use Apache 2.0, thttpd, or tux. In other words, use the right tool for the job.
George went into more detail about setting up reverse proxies (also known as HTTP accelerators) to handle clients with slow connections in front of your dynamic content servers. He discussed a couple of different approaches using mod_proxy in conjunction with mod_rewrite and mod_backhand.
Application caching
George made a compelling argument that commercial caching appliances can never do as good of a job of caching your data as you can do yourself using tools like Squid. He pointed out tradeoffs between black-box caching products (don’t require any changes to your application) and application-integrated caching (highly efficient, but requires rewriting your app).
Application-integrated caching can use a convenient shared storage system (like an NFS-mounted disk) or can write to more efficient local storage on each host and use some sort of messaging system to communicate to your server pool when they need to dirty or invalidate their caches. This is not terribly difficult to do with PHP or Perl using Spread and the XML-RPC hooks (we saw about 4 or 5 slides on how to implement all that).