For (literally) years, I've been pulling monthly web server logs from my server with a simple script that I run by hand. I have been using the scp -C option, which is supposed to compress the data before sending it across the wire. I'm sure it does so, but it sure doesn't seem very effective. Have a look at these timings:
$ time scp -C occam.hewgill.net:logs/hewgill.com/access_log access_log 4882.68 real 29.09 user 20.99 sys $ time ssh occam.hewgill.net 'bzip2 -c logs/hewgill.com/access_log' >/dev/null 859.47 real 0.50 user 0.22 sys $ time ssh occam.hewgill.net 'gzip -c logs/hewgill.com/access_log' >/dev/null 108.90 real 0.67 user 0.33 sys
These numbers are all counterintuitive. I would have expected that the scp -C option would have performed better (the man page claims that it uses the gzip algorithm, but why does it perform so badly in this case?). I would have expected that bzip2 would have been faster overall than gzip because there would be less data to transfer (in reality the server couldn't compress the data as fast as it was being sent).
This month's access_log file is just over 250 MB. The gzip compressed version is 6.2 MB. The bzip2 compressed version is 3.7 MB. It looks like it doesn't pay to wait the nearly an order of magnitude more time for a result that's less than twice as compressed.
This exercise has just reinforced the point that it pays to actually measure performance in different situations, instead of going with what "feels" right. Premature optimization is the root of all evil [Hoare].