Greg Hewgill (ghewgill) wrote,
Greg Hewgill

premature optimization

For (literally) years, I've been pulling monthly web server logs from my server with a simple script that I run by hand. I have been using the scp -C option, which is supposed to compress the data before sending it across the wire. I'm sure it does so, but it sure doesn't seem very effective. Have a look at these timings:

$ time scp -C access_log
     4882.68 real        29.09 user        20.99 sys
$ time ssh 'bzip2 -c logs/' >/dev/null
     859.47 real         0.50 user         0.22 sys
$ time ssh 'gzip -c logs/' >/dev/null
     108.90 real         0.67 user         0.33 sys

These numbers are all counterintuitive. I would have expected that the scp -C option would have performed better (the man page claims that it uses the gzip algorithm, but why does it perform so badly in this case?). I would have expected that bzip2 would have been faster overall than gzip because there would be less data to transfer (in reality the server couldn't compress the data as fast as it was being sent).

This month's access_log file is just over 250 MB. The gzip compressed version is 6.2 MB. The bzip2 compressed version is 3.7 MB. It looks like it doesn't pay to wait the nearly an order of magnitude more time for a result that's less than twice as compressed.

This exercise has just reinforced the point that it pays to actually measure performance in different situations, instead of going with what "feels" right. Premature optimization is the root of all evil [Hoare].

  • Post a new comment


    Anonymous comments are disabled in this journal

    default userpic

    Your reply will be screened

    Your IP address will be recorded