Posted on Nov 25, 2019
When talking about Oddity’s violence recognition system, we are often asked what the accuracy of our algorithm is. This seems like an easy enough question, but while answering it, we quickly run into trouble. To explain why, we need to look into a concept known within statistics as the Base Rate Fallacy.
In general, the Base Rate Fallacy concerns a psychological effect that clouds peoples’ judgement when presented with certain statistics. In the most general sense, the fallacy describes how people often focus on a more specific case, rather than the larger picture. To make this more concrete, we will be looking at the most common type of this fallacy: the False Positive Paradox. The False Positive Paradox occurs when the incidence of a specific case within a large body of data is particularly low. In the case of violence recognition, this paradox is omnipresent. This is due to the fact that violence, especially in the case of video surveillance, is actually extremely rare. This is especially true from the perspective of a computer system.
“Extremely” might even be an understatement. Let’s look at the incidence (i.e. how often something happens) of violence from a computer systems’ point of view to see why. To detect violence, the computer studies each frame it receives from a security camera and produces a confidence level between 0% and 100%. The higher the confidence, the more certain our algorithm is that violence is present in that particular frame. Let’s assume that the computer system does not look at each single frame, but combines them into batches of 16 frames. A confidence level is produced for each batch. Most security cameras produce 30 frames per second. So each batch covers around half a second of video footage. When the confidence level rises above a certain threshold, the system assumes there must be violence and sends out an alert.
Now, let’s take a look at how often violence occurs within view of the cameras. I will focus on the results of a recent pilot we did that took place on a busy street with plenty of night clubs. Oddity’s algorithm was tested there on a small stretch covered by two cameras. Even though the location is a so-called “high-risk area”, only 3 violent incidents occurred over a period of 3 months (which is a good thing!). This data was confirmed by local law enforcement. We can quickly calculate that the incidence rate of violence is about 0.5 incidents per camera per month, which translates to around 1.5 seconds of violent footage. Now, we can calculate the incidence rate in terms of number of “batches” that contain violence, like the computer sees it:
# 1.5 seconds of violent video footage per month, # translates to around 3 violent "batches" no_violent_batches = 3 # the total number of batches processed each month (assuming 30 days) no_batches = (30 * 24 * 60 * 60) * (16 / 30) # the incidence rate of violence according to the computer violence_incidence_rate = no_violent_batches / no_batches # Output >> violence_incidence_rate = 0.000002170138889 # 0.0002%
The computer needs to process more than a million batches for each camera, every month. Of those batches, only 3 contain violence. The incidence rate of violence as far as the computer is concerned is 0.0002%.
Already this number seems oddly low, as it doesn’t stroke with our perception of the world. A lot of people have been victims of street violence and even though the computer estimates the size of this problem is small, in reality, it is very big. A single violent incident can have large consequences. Such an incident results in substantial damage to property, judicial costs, usage of large amounts of of law enforcement resources and most importantly, substantial emotional and psychological damages to those involved. As such, our brains - rightfully so - consider violence a problem nonetheless.
The same kind of problem also often arises in service-level agreements (SLAs). Imagine your government were to setup a new nation-wide hospital system that would handle core hospital services. They sign a deal with a company to arrange system maintenance. Their SLA says they can guarantee a 99.9% uptime! Sounds great, but in practice, it is not. 0.1% downtime still comes down to about 9 hours each year. Imagine every year, there is one day where all hospitals are shut down between 9 in the morning and 6 in evening. Now that 99.9% uptime seems not so great.
We have already touched the basic premise of the Base Rate Fallacy and the False Positive Paradox, but they are cause to even more spectacular apparent dichotomies between statistical outcomes and reality. Let’s imagine a scenario where a cure is found to a rare, but very infectious and dangerous disease. The disease has just started spreading and only 0.1% of the population is infected. The cure is highly effective, but can have serious side effects for healthy people. Because of the risk of the disease spreading, the Dutch government wants to quickly deliver the cure to the people that are sick. Luckily, a researcher has developed a test to see whether someone has the disease that is 99% accurate. Everybody is ordered to take the test and is given the medicine if the test results are positive.
Maybe you already noticed the problem: large population, low incidence. With some simple calculations, we can show the true extent of the horrors the Dutch government has just brought upon its citizens. The number of people that have been cured by the medicine can be calculated as such:
accuracy = 0.99 # 99% incidence = 0.001 # 0.1% population = 17000000 # 17 million people live in the Netherlands no_people_cured = accuracy * incidence * population # Output >> no_people_cured = 16830
That’s great! Now, how many people have taken the cure even though they weren’t actually sick?
no_healthy_people_given_medicine = (1 - accuracy) * (1 - incidence) * population # Output >> no_healthy_people_given_medicine = 169830
Whoops. The test falsely marks a healthy person as sick only 1% of the time, but the number of people that are healthy is much larger than the number of people that are not. The test might have helped cure almost 17 thousand people, it also caused almost 170 thousand new patients due to the side effects.
The previous example shows just how misleading the term accuracy can be. Let’s come back to the case of Oddity’s algorithm. During the latest pilot, our algorithm performed really well. It detected 100% of fights (false negative rate = 0%) and produced about 2 false positives per camera per week (false positive rate = 0.00018%). Taking into account the confidence level and using the Accuracy Under the Curve - Receiver Operator Characteristic (ROC - AUC) accuracy metric, this produces an accuracy of 99.9999%. This number is ridiculously high. It gets weirder: Imagine we changed the system to remove all the smart algorithms, and make it simply say “no violence” all the time. In such a case, the false prediction rate would equal the incidence rate, which is 0.0002%. Therefore, a naive accuracy measure can be calculated to be 99.9998%. Even though the algorithm does not do anything!
In the case of the stupid naysayer algorithm, the more sophisticated ROC-AUC metric does not work since it is undefined when there is only one class in the body of predictions. This makes the ROC AUC metric somewhat more reasonable though it still fails in the case of the False Positive Paradox.
Technically, the 99.9999% accuracy is correct. But it is in no way representative of how it would feel to someone receiving alerts from our system. In a small camera surveillance setup, consisting of 100 cameras, an average of 29 false alerts are received each day. This is much better than our competition but not the result you would have in mind when you have been told the system is 99.9999% accurate. We continually work to decrease the false positive rate and we are confident that we can get it close to zero eventually. Either way, our system still significantly reduces the workload. Previously, human camera observers had to watch 100 video feeds, 24 hours a day, 7 days a week. With only 29 activations per day, observers now only need to watch around 15 minutes (30 seconds per activation) instead of 144000 minutes of footage to achieve the same result.
We can calculate a more sane accuracy figure by adjusting for the base rate fallacy in our validation set. In an experimental setting, we determine the accuracy of Oddity’s algorithm by validating it against a separate dataset. This dataset is more balanced (30%/70%) and doesn’t suffer from the base rate fallacy. The accuracy on our validation set is near 92%. Though a more meaningful number than the former accuracy figure, it is still important to realize that in principle, these kinds of figures are only relevant in the context they were calculated.
Hopefully, this post has helped you understand how the term accuracy can be grossly misleading or at the very least irrelevant in real-world cases. Whenever you hear someone boast with accuracy figures that seem too good to be true, you need to ask two questions:
If the answer to question 1 is big and the answer to question 2 is small, you have got yourself a Base Rate Fallacy. That does not mean that the figure is incorrect, but it does mean that your brain is most likely interpreting it more positively than it truly is. Simply asking what the fraction of true positives among all positives is could clear some things up. Alternatively, ask what the accuracy is in a more balanced dataset.