Appliance Scale Performance
on Hadoop

Million dollar performance for less than $100K – Impossible, right?

Not impossible. Pervasive.

With Pervasive RushAnalytics you can, access data, check, cleanse, transform and persist it, then analyze it, all in one workflow with one product. Then you can execute that workflow on virtually any hardware environment, including Hadoop clusters, without re-designing. 

How long do workflows like that take to execute?

Performance varies wildly, of course, based on your data, your hardware, your cluster or server configuration, and a hundred other factors. But here are some numbers to get you into the right ballpark. On less than $100K of industry standard hardware in a 3 node cluster with HBase:

  • Ingest (Flume), parse, persist (HBase), check, transform and analyze NetFlow data for network optimization and cybersecurity at
    3 million events/sec.
  • Do a simple query on log files for operational intelligence at
    >40 million recs/sec. (No, that’s not a typo.)
  • Do an ETL Load of 18 billion row TCP-H line item data at 3 TB/hour; then full table scan and aggregate Query 1 for business intelligence at
    >30 million recs/sec.
  • Analyze Malstone B weblog data at 4 TB/hour, aka 
    >10 million recs/sec.

This is the level of performance that customers of DataRush powered solutions like RushAnalytics take for granted, while the rest of the world says, "That's impossible." What could you accomplish with power like this at your disposal?

Download Free Trial of Pervasive RushAnalytics

Machine Learning

We did a direct comparison, training 3 machine learning algorithms written in 3 different technologies, on the same Hadoop cluster, using identical data: 

  • R - Execution time 3 hrs 15 mins
  • Mahout - Execution time 17 mins
  • DataRush - Execution time 5 mins

Download Free Trial of Pervasive RushAnalytics

Pervasive DataRush vs Apache PIG testing:

  • Used TPC-H data
  • Generated 1TB data set in HDFS
  • Ran several standard “queries” coded in DataRush and PIG
  • Run times in seconds (smaller is better)

Cluster Configuration:

  • 5 worker nodes
  • 2 X Intel E5-2650 (8 core)
  • 64GB RAM
  • 24 X 1TB SATA 7200 rpm
Pervasive Analytics Software [enlarge]

Download Free Trial of Pervasive RushAnalytics

Nice Numbers, But What Do They Mean?

All of these performance numbers aren't just about clock speed. That order of magnitude jump in execution speed brings you to whole new levels of what is possible. Time savings in training an algorithm plus time savings in executing an algorithm adds up to allow more iterations in testing, as well as the ability to train on much larger data sets. That means higher predictive accuracy. And you can refresh and retrain algorithms as often as needed, to maintain and improve accuracy over time.

It also means you get answers you need 10X or more faster than before. That means you can ask more questions. You can ask questions that were previously too prohibitive in time to be worth asking. And you can get the answers while they still matter.

Speed AND accuracy. On reasonably priced industry standard hardware.

Not impossible. Pervasive.

Download Free Trial of Pervasive RushAnalytics

If Pervasive can do that level of terascale processing on a tiny, inexpensive little cluster, what can you accomplish on the hardware in your data center that you previously thought was impossible?

Thinking about the future? Would you prefer to:

Spend a million dollars on big iron and get terabytes/hour processing speed …

Or, spend less than $100K on commodity hardware, use Pervasive, and get terabytes/hour processing speed?

Keep in mind, data volumes keep getting bigger. Scaling up that appliance will cost another million dollars, and months to re-design everything you’ve done on it from scratch. You’ll need about $10K in hardware to add another node to your Hadoop/Pervasive cluster, and you won’t have to re-design your analytics. Pervasive will automatically take advantage of the additional compute power.

Impossible?

Not impossible. Pervasive.

Download Free Trial of Pervasive RushAnalytics

If you thought you’d have to hire an army of MapReduce programmers to get this level of performance, think again. Pervasive workflows are designed in the KNIME visual point, click, and configure interface. No MapReduce coding required.

Download Free Trial of Pervasive RushAnalytics

“That scale of processing speed is necessary for smart grid and other internet of things related projects. The next challenge in the internet of things you’ll need is the ability to look at data on the fly and determine if it’s valid data. Should it be part of an automated decision? That’s lacking today and to a large extent, isn’t being thought about for the future.”

Joseph A di Paolantonio
VP and Principal Analyst
Constellation Research

 

More Resources

 

Hadoop has opened people’s eyes to the true power of industry standard hardware.

Mike Hoskins
CTO & General Manager
Pervasive Big Data & Analytics
Pervasive Software

Accelerating Big Data 2.0™