Sun Dec 17 20:01:07 CET 2006

PhpBB and spam counter-measures

Narelle and I are running a PHPBB forum for her English in Toulouse's Web site. PHPBB is a fantastic piece of work. Unfortunately, it's a dangerous world out there for a forum, and we keep getting bogus users that register to spam the forum and spread pills, porn and desolation.

Our first counter-measure was to move to admin-approved subscription: new users must be vetted by admins. However, the illusion of security was short-lived: we get about 10 spam registrations a day. Sorting that out for a very large site with dozens of admins might be fine, but it's not in our case: we have about 50 users, and really don't want to spend more time than absolutely necessary administering the forum. So I started looking at more sofisticated anti-spam solutions.

This is where the diversity of use and requirements for PHPBB appears. Some people run very large forums with thousands of users, and want guests to be able to easily leave comments to keep their community as open as possible. Others like us run tiny forums, don't mind having new users have to wait a day before having their account, but really don't want to spend time vetting new users.

So while looking through anti-spam discussions on the PHPBB forum (about PHPBB -- their forum is run using their own software), I was fairly surprised to see a group trying to find an all-round solution that would work for everyone. To them, the solution must be the same for everyone. They seem to be under the impression that more sofisticated CAPTCHA is what is needed.

Meanwhile, others are advocating higher-level questionning of the new registrant to check they really are humans (the problem really is to sort out robots that subscribe everywhere they can, from genuine humans). Such solutions include Kitten Auth, that will have you pick kittens out of a bunch of pictures, HumanAuth that is basically the same but more mature, eXtreme anti-spam that presents pictures and ask you what they are, and some others which ask questions ("what is this forum about?", "who is the president of France?", "spell 'cat' in reverse", this sort of stuff).

The well-intentionned CAPTCHA crowd seems to poo-poo all these ideas, for the following reasons:

  • Administrators won't customise the questions, therefore:
  • the robots can just be told all the right answers to the right questions (and a simple image filtering will remove the watermark used in HumanAuth)
  • there is absolutely no way to tell a human from a robot that has been taught enough

Let me tell you about digital signal processing (and keep in mind I am no specialist -- I saw all of this at school, and it's all using 10-year old technologies and well-known principles). Digital signal processing ('DSP') is the art of taking signals, putting them through mathematical filters (normally implemented by computers) to transform or improve them. A big use for this speciality is to filter signal from noise: when you receive a radio signal from a satellite, or maybe your ISP's signal over your ADSL connection, noise is added to it from all the environment. DSP gives you tools to filter out the signal and remove the noise.

10 years ago when I was in engineering school, I had a course about DSP where we worked on filters that would find the signal in a level of noise that was several order of magnitude that of the signal (I remember the number of 24dB, but those are vague memories). What I do remember was watching the signal before and after the filter. Before the filter, we would see nothing but noise. After the filter, we would see the signal, beautiful and perfectly clean.

What's this got to do with our problem? Everything, really. A CAPTCHA, fundamentally, is a way of mixing up letters in noise, in the hope that the human brain will recognise the letters while the robots won't. But the experience I relate above shows that computers are infinitely better at extracting signal from noise than humans are. The rest of the problems (letter overlapping for example) is slowly being solved by neural networks.

This is why I think the basic principle of CAPTCHA is flawed: It's trying to identify humans by giving them a task to do at which computers are orders of magnitude better than they are. It's a hopeless fight.

Right, I have more to say on the subject, but I have divorce papers hanging in front of me, so I better go. BBL.

I think where the CAPTCHA crowd goes wrong is in thinking that administrators won't go out their way to customise their authentication system, which means that all the authentication algorithms and material must be ready-to-use in the standard PHPBB install, which in turns means that most authentication systems are doomed (It'd be easy to teach a robot however many dozen pictures to recognise). Indeed, I think the picture or question-based authentication can only work if every forum has its own, different questions.

Where I think they go wrong, however, is in thinking that administrators won't customise. We used to not have any user verification at all until we got our first round of spam, and I can promise you we suddenly started to look into solutions after that. One just need to look at the amount of anti-spam MODs for PHPBB to understand there is a lot of interest. If adding questions or pictures is made easy, I'm personally convinced that adminiastrators will, in fact, spend the time required to customise their board's authentication system, as most of us really would rather spend 15 minutes doing that once, rather sthan delete bogus users till the end of times. eXtreme, in that respect, goes the right way, although it's a bit messy to install at the moment.