Big Data Blog

Big Data Blog

Finding a Needle in a Needle Stack

Sep 18

Written by:
9/18/2012 6:24 AM  RssIcon

Finding a needle in a haystack is an oft-used metaphor for something that’s hard to find. Because of concerns about its signal-to-noise ratio, big data analytics is sometimes compared to finding a golden needle in a haystack of data. In other words, you have to dig through a whole lot of hay (i.e., massive amounts of data) before you find a golden needle (i.e., data-driven business insight).

This is also why a lot of people talk about the downside of data sampling, since even a statistically-valid haystack sample may not contain (Jedi mind tricks aside) the needles you are looking for, especially with recent technological advancements enabling you to analyze the whole haystack.

One thing that makes literally finding a needle in a haystack a little easier is how different a needle is from a hay fiber — as long as the quality of the hay is good, since poor-quality hay is too coarse, and thus needle-like. The similarity of a needle and a poor-quality hay fiber is similar to how it’s difficult to discern if a statistical outlier represents a business insight or a data quality issue.

Although data quality is a big concern for big data, noise is sometimes over-identified with poor-quality data. In their book Made to Stick: Why Some Ideas Survive and Others Die, Chip Heath and Dan Heath explained that “an accurate but useless idea is still useless. If a message can’t be used to make predictions or decisions, it is without value, no matter how accurate or comprehensive it is.”

Therefore, noise can also be high-quality data that’s not relevant to your current analytical goals.

It’s not always easy to differentiate hay fibers from needles (i.e., noise from signal), or differentiate a needle from a golden needle (i.e., high-quality, but analytically useless, data from a relevant data insight). Although big data requires you to exercise better data management, having high-quality data still leaves you with an analytical challenge that’s comparable to finding a needle in a needle stack.

Categories:
Location: Blogs Parent Separator Jim Harris

Search Big Data Blogs

Tags

Big Data (126)
Analytics (66)
Pervasive (50)
DataRush (33)
Hadoop (31)
Industry trends (22)
predictive analytics (20)
Scalability (20)
Multicore (15)
Data Mining (12)
Parallelism (10)
Java (9)
Jim Harris (9)
KNIME (9)
Cloud (8)
Cyber Security (8)
MapReduce (8)
big data analytics (7)
Data Volumes (7)
Data Warehouse (7)
RushAnalytics (7)
Volumes (7)
Actian (6)
Algorithms (6)
Cost-effective (6)
David Loshin (6)
Decision Support (6)
Julie Hunt (6)
RushAnalyzer (6)
analytics tools (5)
Dataflow (5)
machine learning (5)
Data Science (4)
Forrester (4)
Google (4)
Green IT (4)
Healthcare (4)
Phil Simon (4)
YARN (4)
analytics processes (3)
Big Data Science (3)
BigQuery (3)
Bloor (3)
data centers (3)
data integration (3)
Data Preparation (3)
data tools (3)
data-driven (3)
DataMatcher (3)
machine generated data (3)
Malstone B (3)
Mike Hoskins (3)
Opera Solutions (3)
Retail Analytics (3)
Security (3)
Smart Grid (3)
software (3)
Solutions (3)
telecommunications (3)
transportation analytics (3)
Age of Data (2)
analytics accuracy (2)
architecture (2)
Austin (2)
Bloor Research (2)
Business Intelligence (2)
data management (2)
Data Rush (2)
David Inbar (2)
David Norris (2)
fraud (2)
fraud detection (2)
Gartner (2)
GigaOM (2)
Hadoop Summit (2)
IntegrationWorld (2)
intelligent transportation systems (2)
internet of things (2)
McKinsey (2)
meetup (2)
ParAccel (2)
Pervasive DataRush (2)
Rexer Analytics (2)
smart meters (2)
#FollowFriday (1)
a (1)
Amazon (1)
analytics workflow (1)
Application Development (1)
automation (1)
Benchmarks (1)
best practices (1)
Cloud Analytics Summit (1)
cloud computing (1)
Cloudera (1)
contests (1)
cost (1)
cyber security issues (1)
data flow architecture (1)
Data Integrator - Hadoop Edition (1)
data quality (1)
data visualization (1)
digital marketing (1)
Door64 (1)
easy big data analytics (1)
Ericson (1)
Esri (1)
Facebook (1)
Fuzzy Matching (1)
Goverment (1)
Hadoop User Group (1)
Hadoop World (1)
hardware (1)
HBase (1)
HDFS (1)
industrial internet (1)
Jazoon (1)
Jim Falgout (1)
MalStoneB (1)
Mansour Raad (1)
Neil Raden (1)
Netflix (1)
NetFlow (1)
operational intelligence (1)
Paige Roberts (1)
para (1)
PIG (1)
pilot projects (1)
Predictive Analytics World (1)
psychohistory (1)
Public Sector (1)
Redshift (1)
Robin Bloor (1)
ROI (1)
Rosaria Silipo (1)
RushAccelerator (1)
RushLoader (1)
Sampling (1)
Signal and Noise (1)
SmartDataCollective (1)
spatial analytics (1)
speed (1)
sports (1)
Stephen Swoyer (1)
Steve Shine (1)
Strata (1)
SXSW (1)
Telecom Analytics (1)
Telecommunications Industry Association (1)
TIA (1)
Transportation (1)
TurboRush (1)
VectorWise (1)
Zementis (1)

Latest Posts

Actian Big Data & Analytics Blog has MOVED!
Big Data Phrenology
Big Data, Simpson's Paradox and Sufficient Tools
Data Science and the Art of Data Visualization

Big Data Blog Archives

Archive
<August 2014>
SunMonTueWedThuFriSat
272829303112
3456789
10111213141516
17181920212223
24252627282930
31123456
Monthly
Go

Accelerating Big Data 2.0™