For our first entry this year, we are going to explore the fascinating topic of probability. But, before we do so, let’s first indulge in a thought experiment. Imagine we have a list of random measurements; for example, the populations of the top 100,000 cities. If we were to plot a histogram of the first (leftmost) digit of each number in this list, what do you think the graph would look like?
Logically we would assume that with a data set this large, there would be an almost even distribution across all digits with a small, statistically insignificant variance in representation. Therefore, each digit would account for around 11 percent of the numbers, right? WRONG. Instead, if you actually plotted the histogram, you would get a chart like this:
This means that, in reality, the probability that the first digit is a 1 is 30%, while the percent chance that the first digit is a 9 is less than 5. This phenomenon occurs across all sets of naturally occurring data sets regardless of units. Even if we were to change the units from meters to feet or feet to miles, the distribution would nevertheless remain the same. This is called Benford’s Law, named after physicist Frank Benford who noticed, while looking through students’ log function sheets, that same pages had more marks on them and thus were used more frequently than others. So, he realized that there must have been some relation to the frequency of numbers and log functions. After analyzing large amounts and types of data, he finally arrived at the following formula:
This represents the probability that a leading digit will appear in a given data set and because of his work in discovering and publishing this, is now known as Benford’s Law.
Ok, this sounds cool, but how do you explain this phenomenon?
Imagine a tree. Let’s assign the value of 1 unit to the height of the tree. Now, we are going to stretch the tree. For the tree to reach the value of 2 units, it has to double in height (grow by 100%). But, for the tree to grow to 3 units, it only has to grow another 50% from 2 units. This applies to all of the subsequent unit lengths. If we were to plot a logarithmic scale and then color-code the appearance of the digits, it would look something like this:
As we can see on this scale, there’s a shorter distance between each mark until the next decade is achieved. For example, at the end of each decade, when the leading digit changes from a 9 to a 1, it only requires an 11% change in the value of that number. And, with the probability that each digit being represented appears as the proportion of the area that its color makes up on the scale, we begin to see how this is relevant. If we were to calculate the exact values of these proportions, it would line up exactly with Benford’s Law and its subsequent formula and histogram.
But, before we go further, we must first understand the limitations of this principle. Benford’s Law does not apply to very small data sets and will not work if the possible set of values is limited by some maximum leading-digit value. For example, if a bank wants to analyze the leading digits of all deposits under $5,000, Benford’s model would likely not match their data set.
Ok, makes sense, but why does this matter?
Benford’s Law has many real-world applications and, in fact, is admissible in court as a means of detecting fraud. Criminals who try to fabricate fraudulent lists of numbers typically aren’t aware of Benford’s Law and almost always don’t account for this statistical phenomena. Many people have been caught for financial dishonesty by applying this formula and it remains a significant component in the fraud-detection industry.
Thank you for reading! We hope you enjoyed this article and gained a better understanding of the fascinating intricacies and useful applications of specifically probability and math in general. If you have any questions regarding this article, please feel free to reach out to us at firstname.lastname@example.org at any time and we’d be happy to help.