Pervasive DataRush Performance Testing Results Wow HUG
Mar
1
Written by:
3/1/2013 6:07 AM
Pervasive Big Data & Analytics Chief Technologist Jim Falgout recently had an opportunity to speak to the Bay Area Hadoop User Group (HUG), along with Mukund Madhugiri and Baljit Deot of Yahoo! and Hari Shreedharan of Cloudera. Jim discussed the major barriers to effective Hadoop deployments in the enterprise – complexity and the steep learning curve of MapReduce.
He detailed how Pervasive Big Data & Analytics solves these issues through a visual workbench integrated with Apache Hadoop that enables data scientists and analysts to build and execute complex big data workflows for Hadoop with minimal training and without MapReduce knowledge. A long-time evangelist of the DataFlow approach to big data, which is woven into the Pervasive DataRush framework, Jim discussed the key concepts behind it – libraries of pre-built operators, the use of directed graphs, pipeline parallelism and its “share-nothing" architecture – and provided the specific enterprise benefits of Pervasive DataRush, Pervasive RushAnalytics and our accelerator for KNIME.
TPC-H Performance Testing: Pervasive DataRush vs. Apache PIG
A highlight of Jim’s discussion came when he showed the results of TPC-H performance testing* in which Pervasive DataRush showed superior performance over comparable Apache PIG scripts. For a number of HUG participants the results were eye-opening. They may be for our blog readers, too:

Click to enlarge
If you’d like to learn more about the testing, please contact Pervasive Big Data & Analytics.
*Additional Details on Pervasive DataRush vs Apache PIG testing:
- Used TPC-H data
- Generated 1TB data set in HDFS
- Ran several “queries” coded in DataRush and PIG
- Run times in seconds (smaller is better)
Cluster Configuration:
- 5 worker nodes
- 2 X Intel E5-2650 (8 core)
- 64GB RAM
- 24 X 1TB SATA 7200 rpm
Resources of Interest
YouTube Video:
Jim’s February 2013 Bay Area HUG presentation
Slideshare:
“A Visual Workbench for Big Data Analytics on Hadoop”
Jim’s article on Dataflow in Dr. Dobbs:
“Dataflow Programming: Handling Huge Data Loads Without Adding Complexity”
