It seems that spammers have introduced a new twist on an old tactic. When Paul Graham's A Plan for Spam article introduced Bayesian filtering principles to the antispam world, spammers were quick to react to this new threat. Since their spam was now being scored by full content (and not just naive keyword matching), they started including snippets of legitimate text along with their spam messages. This legitimate text, since it wasn't part of their marketing campaign, was typically displayed in an impossibly small font or in invisible (ie. white on white) colors.
Anyway, I recall seeing text pulled from such works as Moby Dick, Ulysses, and various Shakespeare. It didn't matter what the text was, as long as it didn't look very much like spam. As far as I can tell, there are at least two goals involved here:
- With the inclusion of a lot of non-spam text, there would be a slightly higher probability that the message might look a little bit more like a legitimate message, and would then sneak through a slightly higher percentage of spam filters.
- Bayesian filters learn patterns from the messages you receive and mark as spam. When you mark a message as spam, each word in the entire message essentially gets a count in the "spam" column. By including a lot of non-spam text, this means that a lot of non-spam words will end up with higher counts in the "spam" column. This has the longer-term effect of decreasing the trustworthiness of the Bayesian filter data, because it may start to mark legitimate messages as spam. If this happens a lot, users may turn off the Bayesian part of the filter.
Recently, several people (cetan, leroy_brown242, Amy) who have journals, have received messages from other Internet users wondering why some of their journal text was included in the spam message. Obviously, the journal authors don't have anything to do with the sending of the spam. It seems that the spammers are now scraping text off the Internet instead of using text from the classics.
Perhaps this approach is intended to more closely match the kind of text that people normally receive in email. Because the text is written by today's Internet users and not 19th century authors, the vocabulary will be better suited to confuse spam filters.
This new technique is surprising and annoying to those users whose text is used in spam. Most recipients of the spam will either not see the message at all, or not see the small/obscured text, or just ignore it. The few who do look at the whole message and google for key words or phrases to find the original author's journal, seem skilled enough at that point to not accuse the user whose journal text was used.
Fortunately, Bayesian filtering techniques are just one weapon in the fight against spam. With blacklists, SPF, virus scanners, and the battery of tests provided by SpamAssassin, I now get, on average, about 5 spam messages in my inbox per day. Since my mail server receives about 1000 spam messages per day, that's less than a 1% miss rate on my spam filters.