Finding a Needle in a Needle Stack
9/18/2012 6:24 AM
Finding a needle in a haystack is an oft-used metaphor for something that’s hard to ﬁnd. Because of
concerns about its signal-to-noise ratio, big data analytics is sometimes compared to ﬁnding a
golden needle in a haystack of data. In other words, you have to dig through a whole lot of hay (i.e.,
massive amounts of data)
before you ﬁnd a golden needle (i.e., data-driven
This is also why a lot of people talk about the
downside of data sampling, since even a statistically-valid haystack sample may not contain (Jedi mind
tricks aside) the needles you are looking for, especially with recent
technological advancements enabling you to analyze the whole haystack.
One thing that makes literally ﬁnding a needle in a haystack a little easier is how different a
needle is from a hay ﬁber — as long as the quality of the hay is good, since poor-quality hay is too coarse,
and thus needle-like. The similarity of a needle and a poor-quality hay ﬁber is similar to how it’s
difﬁcult to discern if a statistical outlier represents a business insight or a data quality issue.
Although data quality is a big concern for big data,
noise is sometimes over-identiﬁed with poor-quality data. In their book Made to Stick: Why Some Ideas Survive and Others Die,
Chip Heath and Dan Heath explained that “an accurate but useless idea is still useless. If a message can’t be used to make predictions or decisions, it is without value, no matter how accurate or comprehensive it is.”
Therefore, noise can also be high-quality data that’s not relevant to your current analytical goals.
It’s not always easy to differentiate hay ﬁbers from needles (i.e., noise from signal), or differentiate a needle from a golden needle (i.e., high-quality, but analytically useless, data from a relevant data insight). Although big data requires you to exercise better data management, having high-quality data still leaves you with an analytical challenge that’s comparable to ﬁnding a needle in a needle stack.