Tom Taverner introduced me to Benford's Law as we were eating lunch together at a statistical computing conference: If you look at the first digits of data in many naturally-occuring datasets, a startling 30 percent of them are ones. "Pah!" I said. "That belies intuition! Why would one digit occur any more than another? I'd expect each digit to occur with about equal frequency--1/9 of the time. Why isn't the probability 11 percent? Eh?" A hip, bespeckled biostatistician nearby joined the fray with the help of his iPhone. He was also skeptical, but he looked at the distribution of leading digits in a fancy gene database he was analyzing, and indeed 1 occurred about 30 percent of the time, with 2, 3, 4, and so on occuring with decreasing frequency. You prove me wrong so good, statistics!
Benford's Law, generally, states that the probability of the first digit d in base b is:
This turns out to give a 30 percent chance for starting with 1, 18 percent for starting with 2, and so on. It has even has wide applicability outside of entertaining lunch conversations--including fraud detection and computer disk space allocation. Several clever folks in the R world have recently used Benford to assess whether data is actually naturally occuring: Drew Conway decided there was not strong evidence of numerical tampering with the Wikileaks Afganistan War Logs, and Diego Valle discussed problems in homocide-reporting by the Mexican government. Rattle, a graphical interface for R, has a function to overlay plots of leading digits in base 10 of different subsets of the data to evaluate where funny business may be occuring in a dataset; also, as I found out from his comment below, Kevin Wright has posted some R distribution and plot functions for Benford's law on the R wiki.
Does Benford's law seem unintuitive? Well maybe it's because kids these days are just too darn LAZY to look at a good book of logarithms like we did in the good old days! (They're also too lazy to walk to school barefoot over barbed wire uphill both ways in the snow.) Logarithmic tables are where you look up the first several digits of a number to see what the logarithm of that number is--then you can simply add to the log of the number to represent greater powers of ten, or add the logs of two numbers to get the product; then you can look up that result in a table that will convert back to familiar numbers. It's a lot easier to add numbers than multiply them by hand, so this was a huge time-saver.
Not only does using logarithm books build character, but, in the words 19th-century science fiction author and astronomer Simon Newcomb, you can discover important scientific laws: "That the ten digits do not occure with equal frequency must must be evident to any one making much use of logarithmic tables, and noticing how much faster the first pages wear out than the last ones." In other words, people have simply been looking up more numbers that begin with the smaller digits, so those pages get dirty faster. Now you know what you've been missing, you technology-dependent slacker!
Newcomb continues his "Note on the Frequency of Use of the Different Digits in Natural Numbers" to precisely state the law, which was later independently discovered, popularized, and demonstrated in the 30s on all sorts of different kinds of data by a bright Schenectadian, Frank Benford. As an illustration of the similar paths that human minds can follow, Benford also was inspired by logarithmic books, citing in his "The Law of Anomalous Numbers" how "the logarithms of the low numbers 1 and 2 are apt to be more stained and frayed by use than those of the higher numbers 8 and 9."
So next time you spill coffee on or drool all over a library book, don't feel guilty--your detritus may just inspire scientists of future generations.
Coming up next: My functions for Benford analysis in R, and using them to look at baby names, MA property values, dinosaur bone lengths, traffic data, and Congressional lobbyists!