Big Data Blog

Big Data Blog

Hadoop Implementers, Take Some Advice From My Grandmother

Jan 18

Written by:
1/18/2013 5:41 AM  RssIcon

My grandmother grew up during the Great Depression of the 1930’s. She would wash and reuse plastic wrap, aluminum foil, plastic containers and glass jars, something I found odd as a kid. No one else I knew did that back then. When I would ask her about it, she’d say, “Waste not, want not.” She grew up with the habit of wasting nothing, and lived with the benefits of those frugal habits the rest of her life.

Now, I have a special bin to put that sort of thing in, so that someone else can wash it, melt it, and form it into new products to be reused. What was second nature to my grandmother, we are just beginning to relearn. Waste is expensive. As a society, we can’t afford it.

But both for me, and for my society, bad habits are difficult to break. It took a lot of hard lessons of overwhelmed landfills and energy shortages, and declining air and water quality, for things to change. We had to pay a high price to learn that lesson. We had to actually feel the deep cost of waste, and see the benefits of recycling and reusing, before, as a society, we realized that we had to do something differently. On a personal level, it took money coming out of my pocket to change my habits, despite my personal beliefs. Old habits are hard to break. My city has a limit on how much I can throw away without paying extra, but no limit on how much I can recycle. That did it.

Big data analytics platforms are still very much in the early days as an industry. Data centers using Hadoop to distribute compute loads across multiple commodity servers in a cluster configuration are still a relatively rare occurrence in most businesses. But they’re becoming less so as data volumes and analytics demands grow, and the need for inexpensive compute power becomes more pervasive.

Recycle Hadoop Power

In January, the birth of a new year, we think about how to make the new year better than the last. Now is the birth of a new technology revolution. Now is the time to resolve to learn from the costs associated with data centers already in production. Current data centers waste over 80% of compute power, waiting for peak loads and using single-threaded serial software designed for the machines of the last millennium. That’s the kind of bad habit we simply can’t afford if the new big data technology strategies are going to become widespread. That much waste is not only unforgivable in the microcosm of a company’s IT budget, but unsustainable in the macrocosm of the world economy.

We don’t have that much energy to waste.

It may be idealistic of me, but I would love to see more efficient compute methods implemented BEFORE this new technology gets entrenched in hundreds of thousands of businesses. It’s far easier to establish good habits from the beginning than it is to break bad old ones. And, in order to break entrenched habits, we first have to pay a high cost. Let’s save a lot of money, energy, and time, and get off on the right foot at the beginning.

The Hadoop distributed computing concept is inherently parallel and, therefore, should be friendly to better utilization models. But parallel programming, beyond the basic data level, the embarrassingly parallel level, requires different habits. MapReduce is already heading us in the wrong direction. Most Hadoop data centers aren’t doing any better when it comes to usage levels than traditional data centers. There’s still a tremendous amount of energy and compute power going to waste.

YARN gives us the option to use other compute models in Hadoop clusters; better, more efficient compute models, if we can create them. It’s up to the software industry to get our heads out of the sand, or, if not us, then maybe the open source community will come up with the programming models that will take off, so we can build new distributed analytics data centers from the ground up with the right habits.

Use every type of parallelism possible. Use all available hardware, not just one core per node, and balance the work evenly so that even peak analytics loads don’t require a bunch of extra hardware. Use it all, not 20%.

Waste not, want not.

Related Posts: Green IT: Your Data Center is Killing Our Planet

Pervasive Big Data & Analytics

Search Big Data Blogs


Big Data (126)
Analytics (66)
Pervasive (50)
DataRush (33)
Hadoop (31)
Industry trends (22)
predictive analytics (20)
Scalability (20)
Multicore (15)
Data Mining (12)
Parallelism (10)
Java (9)
Jim Harris (9)
Cloud (8)
Cyber Security (8)
MapReduce (8)
big data analytics (7)
Data Volumes (7)
Data Warehouse (7)
RushAnalytics (7)
Volumes (7)
Actian (6)
Algorithms (6)
Cost-effective (6)
David Loshin (6)
Decision Support (6)
Julie Hunt (6)
RushAnalyzer (6)
analytics tools (5)
Dataflow (5)
machine learning (5)
Data Science (4)
Forrester (4)
Google (4)
Green IT (4)
Healthcare (4)
Phil Simon (4)
YARN (4)
analytics processes (3)
Big Data Science (3)
BigQuery (3)
Bloor (3)
data centers (3)
data integration (3)
Data Preparation (3)
data tools (3)
data-driven (3)
DataMatcher (3)
machine generated data (3)
Malstone B (3)
Mike Hoskins (3)
Opera Solutions (3)
Retail Analytics (3)
Security (3)
Smart Grid (3)
software (3)
Solutions (3)
telecommunications (3)
transportation analytics (3)
Age of Data (2)
analytics accuracy (2)
architecture (2)
Austin (2)
Bloor Research (2)
Business Intelligence (2)
data management (2)
Data Rush (2)
David Inbar (2)
David Norris (2)
fraud (2)
fraud detection (2)
Gartner (2)
GigaOM (2)
Hadoop Summit (2)
IntegrationWorld (2)
intelligent transportation systems (2)
internet of things (2)
McKinsey (2)
meetup (2)
ParAccel (2)
Pervasive DataRush (2)
Rexer Analytics (2)
smart meters (2)
#FollowFriday (1)
a (1)
Amazon (1)
analytics workflow (1)
Application Development (1)
automation (1)
Benchmarks (1)
best practices (1)
Cloud Analytics Summit (1)
cloud computing (1)
Cloudera (1)
contests (1)
cost (1)
cyber security issues (1)
data flow architecture (1)
Data Integrator - Hadoop Edition (1)
data quality (1)
data visualization (1)
digital marketing (1)
Door64 (1)
easy big data analytics (1)
Ericson (1)
Esri (1)
Facebook (1)
Fuzzy Matching (1)
Goverment (1)
Hadoop User Group (1)
Hadoop World (1)
hardware (1)
HBase (1)
HDFS (1)
industrial internet (1)
Jazoon (1)
Jim Falgout (1)
MalStoneB (1)
Mansour Raad (1)
Neil Raden (1)
Netflix (1)
NetFlow (1)
operational intelligence (1)
Paige Roberts (1)
para (1)
PIG (1)
pilot projects (1)
Predictive Analytics World (1)
psychohistory (1)
Public Sector (1)
Redshift (1)
Robin Bloor (1)
ROI (1)
Rosaria Silipo (1)
RushAccelerator (1)
RushLoader (1)
Sampling (1)
Signal and Noise (1)
SmartDataCollective (1)
spatial analytics (1)
speed (1)
sports (1)
Stephen Swoyer (1)
Steve Shine (1)
Strata (1)
SXSW (1)
Telecom Analytics (1)
Telecommunications Industry Association (1)
TIA (1)
Transportation (1)
TurboRush (1)
VectorWise (1)
Zementis (1)

Latest Posts

Actian Big Data & Analytics Blog has MOVED!
Big Data Phrenology
Big Data, Simpson's Paradox and Sufficient Tools
Data Science and the Art of Data Visualization

Big Data Blog Archives

<October 2014>

Accelerating Big Data 2.0™