Wednesday, August 8, 2012

The mathematics of profiling

Back in May, Sam Harris debated the efficiency of profiling Muslims at airports with security expert Bruce Schneier.  I have some very snippy things to say about that, but wouldn't it be more fun if I left the snippy comments to your imagination, and I talked about math instead?

During the debate, a particular paper got some attention: "Strong profiling is not mathematically optimal for discovering rare malfeasors".  This was probably not the best paper to highlight, because it's a pure math paper; it's bound to consider a situation that is too idealized.  Then again, Schneier cited lots of things that went into the nitty gritty, but I didn't read those because they didn't catch my interest.  Go figure.

A summary of the paper

The authors consider a scenario where there is one malfeasor among many, and the government needs to search people one by one until they find the malfeasor.  But the malfeasor is not equally likely to be anyone, there are different prior probabilities for different people.  The question is, in what order should the government search in order to catch the malfeasor as quickly as possible?

The solution for an "authoritarian" government is obvious.  Search the most likely people, and then move on downwards until you catch them.  However, the authors posit a "democratic" government with additional constraints.  Basically, the government has no memory of who it has already searched.  Thus each time they conduct a search, they must sample people randomly, though with weighted probabilities.

The question is, how should the sampling probabilities be weighted?  The "strong profiling" strategy weights the sampling probability by exactly at least the prior probabilities.  But it turns out this is not the best strategy.

The math is not too difficult, and can be found in the paper.*  The best strategy is to weigh the sampling probabilities according to the square root of the prior probabilities.  That means, if someone is four times as likely to be a malfeasor, a democratic government should be twice as likely to search them.

*If readers find the paper hard to follow, I will break down the math in the "Authoritarian vs Democratic strategies" upon request.

That was the primary conclusion of the paper, but for completeness I should mention a few others.  They show that the conclusion is identical in the case where there are multiple malfeasors.  They also consider the malfeasor has a chance to escape detection even when the correct person is searched.  In this case, the sampling probabilities should also be inversely proportional to the square-root of the probability of detection success for any given person.

Why it's inapplicable

While the paper was interesting, I honestly do not think it applies to the debate between Harris and Schneier.  The reason the "strong profiling" doesn't work in the paper is because it leads to redundant searches.  This might apply to the case of a smuggler who flies a lot and has repeated opportunities to get caught.  But in the case of an airplane hijacker, they only need to bring equipment onto the plane once, and it is only during that one time they can be caught.

Therefore, the proper unit is not the person, but the airplane ticket.  Some airplane tickets involve malfeasance, others do not.  And some are more likely to involve malfeasance than others.  But there is no danger of redundant searching, because obviously each person only has to go through airport security once per flight.  Therefore, the search for an airplane hijacker is more like the case of the "authoritarian" government.  The best strategy is to search all the most likely airplane tickets.

I have another finer point to make about the math.  The number they try to optimize is the mean number of searches before a success.  However, I think it makes more sense to try to optimize the probability of success given that you can make N searches.  This may adjust the results slightly, but I'm not sure in which direction.  (Hmmm... perhaps I should solve this problem and report the solution.)

On the other side, the paper is also missing a lot of complications that you would obviously need to account for to make a real-world judgment.  Schneier discusses many of these.  For instance, there is an additional cost associated with profiling, because you have to train people.  Presumably this higher cost means that you can make fewer searches overall.  And then there's the fact that the terrorists are intelligent beings and can game the system.  And then there will be errors in the assessment of prior probabilities of malfeasance (errors which can be exploited).

Anyway, I'm convinced this problem is complicated enough that neither this paper, nor intuition applies.

(ETA: I wrote two more related posts)

2 comments:

Anonymous said...

You know, pretty much this post is what I picture every time someone on the Freethought network talks about 'skeptical social justice.' It's a nice idea, but to do it properly generally gives an answer of 'I don't know, but the maths is really interesting!' Also, if a journalist were reporting on what you're saying here, what's the betting it'd be "Mathematical proof! Profiling is necessary!"?

I reckon the ability of terrorists to react to the system is a massively important factor here- I have no idea what the statistics are, but I've heard terrorists tend to specifically choose non-profiled groups, and the history of contraband smuggling is one of constantly using women, old people, children, affluent tourists or other groups who are currently beyond suspicion. Any obvious profiling is going to shift probabilities against itself.

miller said...

You know, pretty much this post is what I picture every time someone on the Freethought network talks about 'skeptical social justice.'

You flatter me. I'm just talking about math, can't you see? No social justice anywhere. ;)

Also, if a journalist were reporting on what you're saying here, what's the betting it'd be "Mathematical proof! Profiling is necessary!"?

That's more or less how Schneier and Harris were using it. They were both arguing that it supported their own conclusions. Both of them were wrong.

I'm willing to credit this to honest error. The paper applies to certain security situations, just not the one under discussion. I don't think this subtracts from Schneier's other valid points, because he didn't refer to redundant searching at any other point.