Data Preparation for Data Mining:
ParAccel Hadoop ETL/DQ

Wait, You Can't Analyze That Data Yet!

Before you can start analyzing data, you have to join and sort your sources, profile your data, find and fix problems and duplicates, calculate means and averages, and a hundred other little transformations that turn raw data into something you can get real value from.

Preparing big data for analysis takes 60% to 80% of the total time allotted for the project.

Do any of these common frustrations sound familiar?

  • Initial data preparation setup takes weeks or months before you can even begin digging into the data.
  • When you need to add a dataset, or select a different set of columns, or any other change to the data, you have to wait for IT to make that change.
  • You have miscommunications between the two teams and end up with the wrong data, and then have to wait again.

What if you could run quick tests, realize you didn’t have quite the right data, tweak the data preparation, and run the test again, in minutes?

What if a single platform could do all of that, without coding?

Actian's ParAccel Hadoop ETL/DQ can help you:

  • Vastly reduce time spent preparing data.
  • Audit all data, not just samples. (Don’t miss unusual problems that can crash the system.)
  • Speed deployment with automatic scaling.
  • Reduce cost and complexity of deployment by using inexpensive industry standard servers.
  • Improve green efficiency and save energy dollars by making optimum use of available hardware.
  • Easily integrate with existing analytics: R, SAS, SPSS, etc.
  • Rapidly feed prepared data into high performance analytics databases: Actian Vectorwise, ParAccel Database, Actian Versant, etc.

ParAccel Hadoop ETL/DQ supports access to all standard databases, anything with a JDBC connector, flat files, and delimited files, as well as standard visualization and predictive model exchange files such as PMML and GEXF, so you can use your visualization tools of choice. Hadoop ETL/DQ can access data as fast as the source system can possibly feed it in, even if you're talking millions of records per second, and it can read and write any combination of sources simultaneously. 

ParAccel Hadoop ETL/DQ also reads and writes Hadoop (HDFS and HBase) data. But we're not just talking about a connector to Hadoop, which everyone seems to have these days. We're talking about the ability to read HDFS and HBase where they live, on a cluster, in a distributed, high speed, parallel fashion that means reading , writing and transforming massive amounts of Hadoop data can be done faster than you might think possible.

We have performance metrics for processing Hadoop on a tiny, inexpensive cluster that would blow your mind. Give Actian software more powerful hardware to run on, and Hadoop ETL/DQ accesses the data that much faster with near linear automatic scaling.

Data access has never been more efficient.

Before analysis, do you need to check aspects of your data against business rules? Check mins, maxes, averages and such?  Actian's ParAccel Hadoop ETL/DQ takes the limits off the number of metrics you can check without bogging down your hardware. 

Need to find matches in data with similar names, addresses, ID numbers?  The DataMatcher operators let you build a high speed solution for matching inaccurate, inconsistent and duplicate data.  

ParAccel Hadoop ETL/DQ also lets you fill in missing values, sort, aggregate, and any other transformations needed to improve data quality and prepare the data you need, whether its in a data store or streaming by on the fly. Whether you're dealing with big data or a normal analytics data set that might laughably be called "small," you can crunch through it fast, so you can move on to the real work: Data mining and analysis.

ParAccel Hadoop ETL/DQ integrates directly with many existing data mining and analytics toolsets. Over 1000 operators of data mining and preparation functionality are available in KNIME. KNIME nodes can be mixed with Actian nodes directly in the interface, even in the same workflow. And, naturally, Hadoop ETL/DQ has built-in native connectivity to Actian's own extremely high speed analytics databases: Vectorwise and ParAccel. 

ParAccel Hadoop ETL/DQ also uses the industry standard Predictive Model Markup Language (PMML) as either an input or output format, providing easy interfacing with tools such as SAS and SPSS.

Working with R for statistical computing? Hadoop ETL/DQ can do the heavy lifting on big data preparation and flow the output straight to your R code, vastly reducing overall execution time. You can also run snippets of R code as just another step of the Actian ParAccel workflow, open R views or even learn models within R. 

ParAccel Hadoop ETL/DQ is built on the patented ParAccel DataFlow Engine to ensure scalability and future-proof your data preparation workflows. The DataFlow Engine utomatically detects and utilizes all cores and nodes available at runtime up to a settable limit. This means that execution of your Hadoop ETL/DQ workflow moves seamlessly from desktop to server to cluster, without the need to modify code or re-design. A simple settings change in the interface points at the new hardware location, and off it goes. (Read more about ParAccel DataFlow Engine)

For example, a data preparation workflow written on a 4-core desktop will automatically scale to take full advantage of the additional resources when executed on a 16-core server or a cluster of 100 8-core machines. Every bit of current hardware will be used to it's fullest capacity, nothing wasted. Organizations can simply add more compute resources to keep up with growing data volumes over time.

ParAccel Hadoop ETL/DQ runs independently of Hadoop on any operating system with a JVM: Windows, Mac, Linux and various flavors of UNIX.

ParAccel Hadoop ETL/DQ runs natively on all major Hadoop distributions:

  • Apache
  • Cloudera
  • Hortonworks
  • IBM BIgInsights

For more information, see our Hadoop Solutions page. 

Do you need a data preparation operator that Actian didn't build?

ParAccel Hadoop ETL/DQ includes a full set of data access and preparation operators that, because they are built directly with the ParAccel DataFlow Engine API, are fully optimized for running on multi-core and distributed systems, providing automatic scaling and extreme levels of performance.

If you need a parallel optimized dataflow operator that is not included in ParAccel Hadoop ETL/DQ or ParAccel Hadoop Analytics, you can use the DataFlow Engine API to develop custom operators using any JVM language, including Java, JRuby, Groovy, Jython, Scala, etc. Your custom operators will have the same automatic scaling and extreme performance, thanks to the patented ParAccel DataFlow Engine framework, and the framework will handle the complex multi-threading aspects making development far faster and easier.

Also, keep in mind that standard KNIME operators can be mixed with the Actian ParAccel parallel optimized operators in ParAccel Hadoop ETL/DQ to give a tremendous breadth of functionality. 

Of course, if you need data mining or predictive analytics operators,  then consider ParAccel Hadoop Analytics. It may already have the functionality you're looking for.



A healthcare company took 48 days using SAS to do data preparation for data mining on insurance claims to check for fraud. If the process failed, it had to be re-started, taking another 48 days.

With Actian software, they did the same thing in less than one day.

What could you accomplish if your data was prepared for data mining and analysis 48 times faster?


Accelerating Big Data 2.0™