Analytics Engine For Parallel Data Processing: Actian DataRush

What is Actian DataRush?

Actian DataRush is a patented application framework and analytics engine for high speed parallel data processing. 

What is it for? Some of our customers have used DataRush for risk analysis applications, fraud detectionhealthcare claims managementcybersecurity, network optimization, telecom call detail record analysis for optimizing customer service, and we're doing some pioneering work with utility companies on smart grid optimization. 

Why is it better than other options? Most software designed today, even modern analytics software, is single-threaded, designed to do its work serially, not in parallel on multiple cores. Yet, even the basic laptop I’m typing this on has two cores. Inexpensive industry standard servers, such as the ones often used in Hadoop clusters, have between four and sixteen cores each, yet most software is designed to use only one.

This means that even in an economically wise Hadoop cluster, 80% of hardware spend gets wasted because of inefficient software

Actian DataRush is the solution to that problem.

The reason most software still isn’t designed to take advantage of multi-core hardware is simple: writing multi-threaded applications is difficult and time consuming. It requires a high level of programming skill, and is very specific to the machine configuration.

Example: A highly skilled multi-threaded application developer guru writes a data processing application for your 2-core laptop. 

You then upgrade to a 4-core desktop machine, expecting that it will run twice as fast, right?  Wrong. It would run at the same speed, assuming the new cores were roughly equal in processing speed to the old ones. It would use only 50% of the available compute power. Why?

That highly skilled developer wrote the application to split all work two ways. To use all four cores and run twice as fast, that expensive highly skilled programmer would have to re-write the application again. 

Great, now it's twice as fast. 

You just got a shiny new 8-core server! 

Sigh. Back to the drawing board.

What about Hadoop?

The same principle holds to an even larger extent in distributed data processing in a cluster of computers. Not only does your guru developer need to understand the complexities of writing multi-threaded applications, but also the additional complexities of distributed computing. And since the main commercially viable distributed computing model available is Hadoop, that programmer better be really good at MapReduce.

Just like other parallel development, MapReduce workflows have to be designed for a particular cluster configuration. When the configuration of the cluster changes, such as adding more computers, the applications written in MapReduce must be re-written.

Just to make the programmer's life more interesting, not all computers in a cluster are necessarily identical. What if your cluster has ten older machines, with two cores each, twenty 4-core workhorses, and five brand new 16-core servers? You want to make a MapReduce programmer laugh (or possibly cry)? Tell him he has to write code for a cluster in that configuration.

Read more about Hadoop and Actian.

1. Auto-detect available cores at runtime.

Actian DataRush is a programming framework and data processing engine that detects the available cores, threads, CPU’s etc in any environment and adjusts the data processing workflow accordingly at runtime.

That last bit is important. Since the analytics workload isn't divided up until execution time, the application can be built one time, and executed on any hardware. In each hardware configuration, DataRush detects the available hardware, divides the work accordingly, and then uses every bit of power available.

Add more or better hardware, get more speed. No re-programming necessary.

2. Let the framework handle the complexity of multi-threaded programming, let the programmer focus on the program.

Much of the low level, tedious complexity of multi-core programming is handled by the DataRush framework. The developer can focus on the processing steps that are needed to accomplish the business goal, and the framework handles how that work is divided. This makes it far easier for developers to use than frameworks such as MapReduce, vastly reducing initial development time.

3. Detect available CPU's in every node in a cluster

Actian DataRush detects, not just the compute power available on individual machines, but on clusters of machines. Applications built on DataRush can detect at runtime the number of available cores, CPU’s, threads, etc on each machine in a cluster and divide the data analytics workload appropriately. 

While most data centers achieve 15% hardware usage at best, Actian DataRush clusters routinely experience 60-70% usage, with the capability to go as high as 90%. DataRush resource usage is often deliberately capped at 70% to leave overhead for other applications. Since, other applications rarely use more than 10-15% of the available hardware power, this resource sharing makes optimum use of any hardware. This provides game- changing processing speeds on modest, inexpensive hardware, as well as a huge savings in energy costs and carbon footprint.

4. Use every type of parallelism possible to get the best processing speed.

Horizontal Parallelism - DataRush, like MapReduce, makes use of horizontal partitioning, also known as data parallelism, or embarrassingly parallel processing.

Horizontal Partitioning Data Parallelism

Vertical Parallelism - DataRush, unlike most MapReduce implementations, or most other analytics frameworks, with the exception of some High Performance Computing models, uses vertical partitioning, also known as task parallelism. 

Vertical Partitioning Task Parallelism

Pipeline Parallelism - DataRush also uses a data flow paradigm which takes advantage of pipeline parallelism, which is also a framework normally only seen in High Performance Computing models.

Pipeline Parallelism

Not every project can use every type of parallelism, but DataRush always uses a dataflow architecture, and automatically divides the workload as many ways as possible in the available environment.

Add it all together

This means that applications developed on DataRush can be built once, and deployed anywhere. A DataRush developer can write and test an application on a 2-core laptop, and it will run at a near linear speed increase on a 384 core super server or a 300 node cluster. This has given us the fastest, most efficient and economical analytics data processing engine on the planet. Fastest to develop on, fastest to deploy, and fastest to process.

See some Actian DataRush Performance Metrics

Request Free Trial of Actian DataRush

We built it. What did we use it for?

Pervasive first put DataRush to work in our own software, particularly our data quality related software. Data profiling , fuzzy matching and de-duplication can become a massive drain on the compute power of any hardware, even when data volumes are not huge. As data volumes grow, or the number and complexity of rules you need to check grows, that challenge becomes exponentially more difficult. DataRush has solved that problem for us nicely. 

Pervasive has had the fastest Data Profiler and Data Matcher in the world for years. How do we know that? Because, no matter what hardware you put it on, it will go as fast as that hardware can possibly support, because it's built on Pervasive DataRush. Certainly, someone could build a faster one, but it would only be faster in one particular hardware configuration, and when the hardware shifted, it wouldn't be able to shift with it. DataRush applications are future-proof.

Read more about Pervasive Data Quality Tools

Read more about High Speed Data Quality and Data Preparation for Analytics

Also, we decided that data analysis workflow creation and testing, crazy idea here, ought to be accessible to data analysts, not just programmers. So, we hooked up with an outstanding open source data mining platform, KNIME, that had an easy to use and sensible user interface, and created Pervasive RushAnalytics™.

Read more about Pervasive RushAnalytics

So, that's how we use DataRush. How can you use it?

Some of our customers have used DataRush for risk analysis applications, fraud detection, healthcare claims management, cybersecurity, network optimization, telecom call detail record analysis for optimizing customer service, and we're doing some pioneering work with utility companies on smart grid optimization.

What could you use the world's fastest, easiest to use, most cost-effective and most future-proof analytics engine for?

Request Free Trial of Pervasive DataRush

Analytics data sets have been growing exponentially for years. Data analysts have struggled to find ways to get business value from those datasets with software not really designed to handle them. Now, even the compromises and long standing workarounds are starting to bog down in the modern explosion of bigger and more complex data.  

Processing challenges:

  • Performance bottlenecks negatively impact business analytics agility - so answers come too slow to be valuable
  • Settling for using sample data to reduce analysis time - so answers may be inaccurate or missing key insights
  • Disappointing performance expectations - so you get used to getting answers that are stale and late 
  • Forced to consider complex development projects - so you have to invest in relatively inexpensive clusters that require expensive, hard to find MapReduce developers, or consider outrageously expensive appliances 

DataRush dramatically reduces the time and money necessary to do complex analysis on large datasets. That's the whole point, really. Higher data throughput per dollar spent.

Example: A global financial bank used DataRush to get their risk management solution processing time down from 15+ hours to about 20 minutes while reducing the hardware required by 50%.

With that power, you can run analytics on entire huge datasets, instead of a tiny sample. Use more variables and attributes in analytics to provide deeper insights, broader operational intelligence, and more reliable results. Run analyses more often and get answers right when you need them.

What all this really means is better decision-making, faster responses to market changes, fraud and cybersecurity breach prevention, optimization of processes, and timely competitive strategies, without the corresponding high cost.

Request Free Trial of Actian DataRush

DataRush runs on a variety of operating systems and hardware platforms, Windows, Linux, etc. Virtually any machine with a JVM.

Boosts Performance of Apache Hadoop

  • Drop DataRush into existing Hadoop MapReduce jobs and boost execution speed by making more efficient use of multicore on clusters than native Hadoop
  • Use DataRush in place of MapReduce for an even higher performance boost in many cases.
  • Expand the range of algorithms and applications that can be run using Hadoop by leveraging DataRush’s flexible dataflow model.
  • Achieve high-performance reading and writing to Hadoop’s distributed file system.
  • Access Hadoop data even faster with the DataRush HBase operator.

For more information on how we work with Hadoop, see our What is Hadoop? page.

Provides two core libraries: 

  • Actian DataRush Core Data Preparation Library
  • Actian DataRush Core Analytics Library

While these libraries of operators are often sufficient to build your analytics application, the Actian DataRush Java SDK allows the developer to build custom operators. In fact, the Actian DataRush Core Libraries are built using this same, powerful SDK.

A Javascript scripting API with code completion is the easiest, quickest path to fast custom operators. In addition, DataRush supports development in any JVM-based language: Java, Jython, JRuby, Scala, Groovy, etc.

For Data Preparation

  • Full array of data preparation operators including standard data processing functionality such as: sort, join, aggregation (data grouping), and transformations.
  • The means to stage data to disk in a very efficient format that supports parallel writing and reading. This is useful for staging data between phases of execution and can be a useful way of communicating large data between software components.
  • A full data profiling library of operators including the means to create a complex set of metrics to execute against input data
  • A full array of data quality profiling metrics and fuzzy matching operators for de-duplication.

Data Access

It all starts with reading in the data and ends with writing it out. Actian DataRush has readers and writers for multiple data types including SQL databases, flat files, delimited files, NoSQL databases including HBase, and flat files in any file system, including the Hadoop file system (HDFS) and the Amazon file system. For reuse of statistical and data mining models, Actian DataRush also reads and writes PMML (Predictive Model Markup Language).

For Analytics

  • Core set of parallelized data mining algorithms built on the Actian DataRush engine.
  • Algorithms are data scalable and built to work with any size of data, from a few thousand rows to many billions or more. There is no requirement to load all data into memory, so there's no need for expensive hardware with huge memory capacity.
  • Classification algorithms for predicting class of data:  Decision Trees, Naive-Bayes, KNN, SVM.
  • Clustering algorithms for basic analysis such as customer segmentation:  K-Means.
  • Unsupervised learning algorithms for finding unknown patterns in data:  ARM, Neural Networks.
  • Trending algorithms for understanding and predicting future growth:  Linear, Logistic, Polynomial, and Multi-variable Regression.
  • Feature Selection algorithms for discovering strong correlations:  Principal Component Analysis (PCA).
  • Exchange models with SAS, SPSS, and other tools via PMML import and export support.
  • Include R or Weka code within any DataRush execution flow.

Request Free Trial of Actian DataRush

Simplifies and Shortens Development Cycle

  • Javascript interface with code completion for easy application development
  • Also supports any other JVM based language - Java, Jython, JRuby, Scala, etc.
  • Handles complex low-level multi-threaded programming aspects automatically
  • Frees developers to focus on higher level work

Write Once, Deploy Anywhere

  • DataRush detects available compute power at runtime and distributes work accordingly
  • Move from desktop to server to Hadoop cluster without modifying your workflow
  • As you add more nodes to the cluster, or upgrade nodes, DataRush auto-scales, no need to re-code

Comprehensive data preparation

  • Data Quality operators measure data integrity, discard bad data up front
  • Fuzzy Matching identifies candidate records to merge or purge
  • Reads data from many sources/formats: SQL databases, HDFS, HBase, Amazon file system, local file systems ...
  • Comprehensive data prep functions include joining, aggregation, sorting, filtering, statistics, time series ...

Descriptive and Predictive Analytics Including Machine Learning

  • Build, test and deploy descriptive and predictive models
  • Includes basic clustering and association, as well as classifiers, regression and more
  • All analytics algorithms optimized for high speed distributed analysis
  • Reads and writes industry-standard PMML for predictive model import/export

Works with existing toolsets

  • Can call out to R and Weka functions, read SAS dataset files and more

Both data prep and analytics run natively on Hadoop

  • Not just a connector to HDFS like many applications that advertise Hadoop support
  • Actian directly leverages Hadoop’s parallel distributed processing model

No cost to get started: license for development and test is free

Request Free Trial of Actian DataRush

“Our unique value is in giving our customers the power of scalable, reusable Big Data analytics at the speed of business. Pervasive DataRush’s high-performance, seamlessly scalable capabilities, whether on a single server or a Hadoop cluster, helps us deliver on that value proposition.”

–Laks Srinivasan
Co-Chief Operating Officer
Opera Solutions

 

Example: A global bank used DataRush to get their risk management solution processing time down from 15+ hours to about 20 minutes while reducing the hardware required by 50%.

 

Accelerating Big Data 2.0™