What is Hadoop?

What's all the hoopla about Hadoop?

Hadoop is in the industry news a great deal these days. It's fascinating technology to the geeks of the world, but what does it do for a business? How does that  technology help you solve business problems like targeted marketing, cybersecuritymedical claims fraud detectionnetwork optimization, or smart grid energy usage optimization? How can it advance life sciences or pharmacology?

What is Hadoop exactly? Apache Hadoop is an open-source software framework that supports data-intensive distributed applications. But what does that really mean? How is Hadoop used? In what situations is Hadoop needed? How does it work?

And where does Actian fit in? What advantage is there to using Actian DataRush with Hadoop, as opposed to Hadoop alone? Is Actian a Hadoop replacement? Does it execute natively on Hadoop? What the heck does "execute natively on Hadoop" even mean? 

That's a lot of questions. Let's see if we can shed a little light on the answers.

Apache Hadoop Matters Because ...

Historically, large data sets for business analytics and scientific advancement were processed on expensive specialized servers that many companies couldn’t afford, and that were difficult to scale up as data volumes rapidly grew. Hadoop is a different strategy for processing massive amounts of data in small amounts of time. What makes it different?

It's Affordable - Hadoop is a software framework that supports dividing data processing across multiple networked computers, aka distributed processing. These groups of computers are called clusters, and generally consist of inexpensive industry standard machines, not expensive high performance super computers or appliances. Hadoop itself is open source, minimizing software license fees. 

It's Resilient - The basic concept behind Hadoop is that all processing and data storage should be spread equally across the available computers in a cluster. If one computer fails, it does no harm because the data is stored redundantly on more than one system and the processing also happens in more than one location. This makes Hadoop clusters very resilient to failure. 

It's Scalable - As data volumes grow, compute and storage capacity can be added inexpensively by simply adding more standard servers (called nodes) to the cluster. This makes Hadoop clusters infinitely scalable so that as businesses and their data processing needs grow, the processing power grows right along with them in small affordable increments. 

Affordable, scalable, resilient data processing power is what makes Hadoop so exciting. The uses that power can be put to are wide and varied, but business and scientific analysis of massive amounts of data is the obvious sweet spot.

Most businesses do not need Hadoop, or Actian DataRush. 

As data volumes grow, and mining a wider variety of sources for data becomes more commonplace as a business practice, this could change. 

Right now, Hadoop is needed when data analysis needs become massively larger than traditional hardware and software are designed to handle. If you have traditional data warehouse-based or excel spreadsheet-based analytics in your business, and those are working fine for you, or even if they are almost working fine for you, you don't need Hadoop. (You may need to improve your data warehouse if it's only almost working.) 

When those traditional systems are overwhelmed by the data volumes you want to analyze for business value, then you may need a Hadoop style of processing platform. 

Signs you could use a Hadoop style cluster computing solution:

  • You have a serious need to improve a process, meet a goal, or solve a problem that analyzing huge amounts of data might be able to accomplish, and you have access to that data.
  • You have a massive backlog of data that your current analytics infrastructure can't handle.
  • You're being regularly forced to sample a tiny subset of available data for insight.
  • You have a new untapped data source that you see value in that your current system can't process.

One other situation where a Hadoop style solution may be useful is when you want to pull data from a massive dataset, cleanse, sort, de-dupe, and aggregate the essential bits out, and place those into a data warehouse or cube for further analysis. We refer to this as data preparation for analysis.

Under these conditions, you may need to consider a cluster computing option. If you find yourself in one of these situations, then you need to act quickly, and plan for the future. Data volumes can grow very rapidly, turning a minor problem into a major one in a relatively short time. 

If you're not sure, contact us for a Big Data Analytics Assessment

We would be happy to have a look at what you're up against and give you some advice.

Data Storage - Hadoop consists of several components, the first core component is a method of storing data distributed across multiple computers, Hadoop File System (HDFS). Another Hadoop data storage method is HBase, a distributed NOSQL database.  

Data Processing - MapReduce is a programming framework and data processing engine that splits data processing jobs across multiple machines. 

When people say Hadoop, they usually mean - MapReduce and HDFS are the core elements, and what people usually think of. Hadoop has many other components for various tasks. For example, Hive and Pig are methods of querying HBase data, and some people consider them part of Hadoop. When people say Hadoop, they may  mean some combination of the other components in the greater Hadoop ecosystem, such as Sqoop, Mahout, Zookeeper, etc. each of which serves a specific purpose.

Another way to look at it - Hadoop is sort of like an operating system that can only run one kind of application, applications written in MapReduce code.  MapReduce is ideal for processing web pages for search engines, since it was developed for exactly that purpose. It also works well for many other parallel processing tasks.

Hadoop Disadvantages

Inefficiency - MapReduce may not be the most efficient processing method in all cases. The performance of a cluster can be increased or decreased by orders of magnitude based on the data processing framework and how it is implemented.

Implementation Difficulty -  MapReduce is a very difficult programming framework to master. Developers with MapReduce skills are in high demand and hard to find.

Alternatives

The newest version of Hadoop supports a next-generation MapReduce concept called YARN (Yet Another Resource Negotiator.) YARN will separate the operating system aspect from the application aspect of Hadoop. This will make it much easier for MapReduce to be replaced by other application frameworks that are more efficient for specific types of distributed data processing tasks. 

Actian DataRush is an example of a programming framework and analytics engine that can be used in the place of MapReduce to improve data processing speed and efficiency on clusters, as well as simplify and shorten the development process. Since DataRush has been developed at about the same time as Hadoop to solve similar problems, Actian DataRush is the first such framework on the market, and already is being used in many production distributed applications. 

DataRush can be implemented by any developer familiar with any JVM language - Java, Jython, Scala, JRuby, Groovy, etc.  It even has a simple Javascript interface.

DataRush can be used in addition to MapReduce to boost processing speed in existing MapReduce applications. When we say DataRush "runs natively on Hadoop" that is what we mean.

When people mention DataRush as a "Hadoop replacement," they mean that it can be used instead of MapReduce to get an even more impressive processing speed boost and shorter development time for many distributed applications.

Read more about Actian DataRush

See some Actian DataRush Performance Metrics

Download a free trial of Actian DataRush

To make designing analytics and data preparation workflows on Hadoop more accessible to data analysts and data scientists, Actian partnered with an award winning open source data mining tool with an elegant, easy-to-use interface, KNIME. KNIME already comes with over a thousand data analysis and preparation operators, but in general they are, like most analytics software, single-threaded and designed to work serially.

Actian developed a version of the DataRush engine as a KNIME extension, and worked with KNIME to enhance their API to make more of their operators capable of taking advantage of a data flow type of architecture. The resulting product Actian RushAccelerator for KNIME allows existing KNIME users to take advantage of some of the speed boost of the DataRush engine.

Read more about Actian RushAccelerator for KNIME

However, many useful data access, preparation, cleansing and analysis algorithms are not designed to be run in a highly parallel distributed architecture such as Hadoop. Actian made DataRush distribued analytics operators available in the KNIME point, click, configure style of interface. This means that entire end-to-end analytics workflows can be designed without coding, and still executed at stunningly high speeds on any hardware, including Hadoop clusters, without re-doing the work. We call this analytics design and execution platform Actian RushAnalytics.

Read more about Actian RushAnalytics

Download a Free Trial of Actian RushAnalytics

Accelerating Big Data 2.0™