What Is Big Data?

Dispelling Confusion and Finding Solutions

What is big data? Big data is a general term for data sets that are too large, flowing too fast, or in too varied formats for standard data processing software and hardware to process within reasonable time frames at a reasonable cost. Obviously, everyone’s definition of “reasonable” is a little different, so big data is largely a subjective term.

Why does it matter? People say that data is the new oil. Big data is a massive, largely untapped resource lying in wait for businesses to extract the value from it and cash in. This metaphor has some truth to it. Using new technologies, companies are analyzing data sets that they never could before, and processing volumes of data that once were beyond them. From that data, businesses are extracting intelligence about their customers, their supply chain, their hardware operations, and all of the aspects of their business that affect both the top and bottom line. In addition, scientists are using this data to make new breakthroughs. The trouble faced by companies that want to get in on the action is that big data, by definition, is data that is difficult to analyze.

Unstructured data - Analyzing unstructured data in mass quantities, such as social media data for sentiment analysis, is a good example of a big data analytics challenge. Many people think of this example as the whole problem. That’s a mistake. 

Machine data - Machine generated data, such as network data, sensor data, and smart meter data is just as valid an example. Machine generated data is a data set with tremendous potential value that is a huge challenge to analyze due both to its sheer volume and to the speed that it gets generated. 

Scientific data - Analysis of odd formats, such as medical imaging data, or genomic data for bio sciences, is another face of the big data analytics challenge. 

Transactional data - Even standard transactional data can become a challenge to process in today’s mega-companies that try to do things like market basket analyses on data that includes millions of transactions every hour.

Some more specific examples:

  • Call detail records for improving call drop percentages in telecom companies
  • Netflow data for network usage optimization and cybersecurity in data centers
  • Product ratings for targeted marketing campaigns and recommender engines
  • Retail transaction and product data for market basket analysis and product categorization
  • Financial record data for loan risk analysis
  • Insurance claims data for fraud detection

There is an amazing variety of data in the business world today, and an equally staggering number of uses that data can be applied to.

Some industries that benefit from big data analysis

Healthcare -  analyze healthcare records to find potential infectious disease trends and  help improve clinical outcomes in patient care.

Telecom - analyze call detail records for call drop optimization and customer churn analysis.

Retail - analyze regional purchase records over time for market basket analysis, and product descriptions for categorization.

Insurance - analyze claims data to catch fraud.

Financial - analyze financial records to spot trends and evaluate risk.

Life Science - analyze DNA data to map genomes. 

Pharmaceutical - analyze research study data to evaluate the efficacy of drugs and catch potential side effects.

Manufacturing - analyze supply chain data to optimize processes and ensure supplies on hand when needed.

Utilities/Smart Grids - analyze smart meter data to minimize outages and improve efficiency.

Government - analyze netflow data for cybersecurity.

Uses of big data analysis:

  • Fraud detection/prevention - Example: A claims processing service took 26 days for 250 million claims. With the Actian DataRush engine’s ability to scale across all cores available, they did the same thing in less than a day, dramatically expanding their ability to detect fraud or claims mismanagement.
  • Cybersecurity - Example: For government cyber security analytics, they needed to process machine data at extreme scale on reasonable commodity hardware, in this case, a 40-node Hadoop cluster. Actian did 1.2 billion rows in 12 seconds, no problem.
  • Network optimization - Example: Netflow data streams are essential for network performance monitoring and problem diagnostics, but existing technology chokes on the volume and velocity. Actian captured, transformed into HBase and analyzed more than 1 million Netflow events/second, on a sustained basis, using just a 3-node cluster. They can now get precise usage measurements, catch network breaches, a machine down and cybersecurity spot attacks such as invalid requests, page redirects and SQL injection attacks.
  • Risk Analysis - Example: A global financial bank got their risk management solution processing time down from 15+ hours to about 20 minutes while reducing the hardware required by 50%.
  • Energy usage optimization
  • Research and Discovery
  • Customer Churn Analysis
  • Market Basket Analysis
  • Call Detail Record Analysis

Before an analysis can be done on any data set, it needs to be ready. It needs to be complete, accurate, include all the pieces needed, joined, sorted, grouped and de-duped. 

That may not sound like a big deal, but according to several research studies, data preparation takes up to 80% of the work and time involved in any analytics project

There are a lot of reasons for that, a lack of proper access, the complexity of business rules for validation, etc. But the main reason data preparation takes so long is the same reason other data processing tasks take far longer than they should. Most data preparation is done serially and single threaded, not parallelized and distributed for optimum data processing speed.

If data prep on your analytics projects is taking longer than ten minutes, we think that's too long.

Data analysts should be spending their valuable time analyzing data, not getting the data ready to be analyzed. 

Example: A claims processing service took 26 days to prepare 250 million claims for analysis. With the Actian DataRush engine’s ability to scale across all cores available, they did the same thing in less than a day, dramatically expanding their ability to detect fraud or claims mismanagement.

Read more about Data Preparation for Analysis 

Short answer - there really is no difference.

Data analysis has always involved mining data sets, finding the patterns and anomalies, gleaning the important trends. Data analysts have always had to deal with limitations of analysis technologies with larger datasets. A lot of standard practices, such as sampling and aggregation,  have been developed to deal with common problems of data sets that are too large to be analyzed in a reasonable time frame at a reasonable cost. (Sound familiar?)

The truth is that in many cases, standard analytics practices work fine with the data sets most businesses need analyzed. 

So, what's changed? Why the big emphasis on big data analytics in the industry? 

  1. New sources of valuable data are available now that can't be tackled with standard analytics technology
  2. Even the data sets that analysts are accustomed to working with are growing exponentially

That first one means that a lot of valuable data simply couldn't be mined until the technology caught up. That second one means more and more compromises. And slower and slower answers to business questions in a cutthroat economy where seconds can make the difference.

Solution: New technologies (such as Hadoop, Pervasive DataRush and Pervasive RushAnalytics) have been developed to economically analyze data in massive volumes at extremely high speeds.

These methods are essential when dealing with some of the new data sets, such as machine generated data, that dwarf the old school transactional data sets. And, even with less extreme data, the same technologies can give a speed boost that radically cuts the time to get answers to essential questions. They can also make compromises like sampling and aggregation less necessary, and make useful analyses such as anomaly detection more viable.

Example of data that couldn't be analyzed before: Netflow data streams are essential for network performance monitoring and problem diagnostics, but existing technology chokes on the volume and velocity. Pervasive captured, transformed into HBase and analyzed more than 1 million Netflow events/second, on a sustained basis, using just a 3-node cluster. The company can now get precise usage measurements, catch network breaches, a machine down and spot cybersecurity attacks such as invalid requests, page redirects and SQL injection attacks.

Example of accelerating an existing analysis: A global financial bank got their risk management solution processing time down from 15+ hours to about 20 minutes while reducing the hardware required by 50%.

Read more about Predictive Analytics

Historically, large data sets for business analytics and scientific advancement were processed on expensive specialized servers that many companies couldn’t afford, and that were difficult to scale up as data volumes continued to rapidly grow. 

It's Affordable - Hadoop is a software framework that supports dividing data processing across multiple networked computers, aka distributed processing. These groups of computers are called clusters, and generally consist of inexpensive industry standard machines, not expensive high performance super computers or appliances. Hadoop itself is open source, minimizing software license fees. 

It's Resilient - The basic concept behind Hadoop is that all processing and data storage should be spread equally across the available computers in a cluster. If one computer fails, it does no harm because the data is stored redundantly on more than one system and the processing also happens in more than one location. This makes Hadoop clusters very resilient to failure. 

It's Scalable - As data volumes grow, compute and storage capacity can be added inexpensively by simply adding more standard servers (called nodes) to the cluster. This makes Hadoop clusters infinitely scalable so that as businesses and their data processing needs grow, the processing power grows right along with them in small affordable increments. 

Affordable, scalable, resilient data processing power is what makes Hadoop so exciting. The uses that power can be put to are wide and varied, but business and scientific analysis of massive amounts of data is the obvious sweet spot.

Read more about Hadoop and Actian.

Most businesses do not need Hadoop, or Actian DataRush or any other big data related technology at this time.

Experienced Actian representatives can assess your needs and help you dig into the answers to some key questions, starting with the most important one: Would a big data business analytics project benefit your company?

If the answer is yes, then:

  • Which problem or goal should you tackle first to get the most value in the shortest time?
  • What hardware and software stack would you need?
  • How long would such a project take?
  • How much ROI could you expect?

If you’d like to get a handle on the costs, benefits and requirements of a big data business analytics project at your company, let us know, and we’ll contact you to help you find the most beneficial path forward.

Initial consultations are always free. If desired, a Actian representative can come to your business, see first hand what you're dealing with, and make recommendations for a flat fee plus travel expenses.

Read more about a Actian Business Analysis Need Assessment.

Accelerating Big Data 2.0™