Big Data Technology Series – Part 6

Big dataIn the last few installments of the Big Data Technology Series, we looked at the evolution of database management, business intelligence and analytics systems, and statistical processing software. In this installment, we will look at the modern advanced analytical platform for big data analytics that represents a confluence of the three evolutionary threads in the story of data management and analytics platforms.  We will look at the core capabilities of such platforms, and major vendors in the marketplace.

The graphic below provides a view of the core capabilities of a modern advanced analytical platform.  There is a wide range of analytical platforms in the market place, with each platform excelling in and specializing in a specific aspect of big data analytics capabilities. This picture provides a logical universe of the capabilities found in these platforms.  In other words, there isn’t one single platform that provides all the core capabilities described below in their entirety.

Big Data Analytics Platform - Capabilities

  1. Hardware –  Hardware includes data processing and storage components of the analytical platform stack, providing management and redundancy of data storage. As we saw in the Part 2 of the big data technology series, the database management platform and associated hardware have continued to evolve ever since the first databases appeared on the market in the 1950s.  Database hardware was once a proprietary component of the stack providing considerable value, however it is increasingly becoming a commodity.  Hardware innovation includes innovations in storage such as solid state devices and massively parallel node configurations connected by high speed networks.  Modern analytic platforms provide flexibility to use configurations such as these, as well as  configurations of commodity x-86 machines for managing lakes of massive unstructured raw datasets.
  2. Analytic Database – This is the software layer that provides the logic behind managing the storage of datasets across the node cluster, managing such aspects as partitioning, replication, and optimal storage schemes (such as row or column).  Analytical applications run most efficiently with certain storage and partitioning schemes (such as columnar data storage), and modern analytical platforms provide capabilities to configure and setup these data storage schemes.  Memory based analytic databases such as SAP HANA have added one more dimension to this – one that dictates how and when data should be processed in-memory and when it should be written to disk.  Advances in database management systems have enable modern analytic platform to have its disposal a range of tools and techniques to manage all data types (structured, semi-structured or unstructured) and all processing needs (data discovery, raw data processing, etc.)
  3. Execution Framework – The execution framework is a software layer that provides query processing, code generation capabilities and runtimes for code execution.  Advanced analytical applications frequently have complex query routines and a framework that can efficiently parse and process the queries is critical to the analytic platform.  Furthermore, modern analytical platforms provide capabilities to structure advanced analytical processing through the use of advanced programming languages such as Java and R.  The execution framework provides the logic to convert such higher level processing instructions into optimized query bits that are then submitted to the underlying analytical database management system.  Advances in analytical platforms, as we saw in Part 3 of big data technology series, have enabled these capabilities in the modern day analytic platform.
  4. Data Access and Adaptors – Modern analytic platforms provide prebuilt, custom developed and DIY connectors to a range of data sources such as traditional data warehouses, relational databases, Hadoop environments and streaming platforms.  Such connectors provide bi-directional data integration between these data repositories and the analytic data storage.  These connectors thus provide data visibility to the analytic platform no matter where and how the data is stored.
  5. Modeling Toolkit – The modeling toolkit provides design time functionality to develop and test code for running advanced analytics and statistical processing routine using higher level languages such as Java, Python and R.  This represents the third and final thread in our story of the evolution of big data analytic platforms – the evolution, rise and ultimately the convergence of statistical processing software into the logical big data analytic platform.  The toolkit provides a range of pre-built and independent third party libraries of routines for statistical processing, but also a framework that can be used and extended as needed to run custom statistical processing algorithms.
  6. Administration – Like any database management or traditional warehousing platform, the modern analytics platform provides strong administration and control capabilities to fine tune and manage the working of the platform.  The rise of horizontal scaling using commodity machines has put increased importance on being able to efficiently administer and manage large clusters of such data processing machines.  Modern analytic platforms provide intuitive capabilities to finely control data partitioning schemes, clustering methods, backup and restore, etc.

There are a range of players in the market for big data analytics platforms as depicted by the graphic below.

Big Data Analytics Platform - Market

There are roughly three categories of such product vendors:

  1. Type 1 (Traditional Data warehousing Vendors) – This category includes vendors such as IBM, SAP and Oracle that have traditionally done very well in the BI/data warehousing space.  These solutions have excelled in providing traditional analytic capabilities for mostly structured datasets. These vendors are rapidly extending their product capabilities to provide advanced analytical capabilities for big data sets, either indigenously or through acquisitions and joint ventures with niche vendors specializing in advanced big data analytics.
  2. Type 2 (SQL on Hadoop) – This category includes vendors that are providing solutions to extend traditional Hadoop environments to deliver big data analytics in a real time ad hoc manner using SQL.  Traditional Hadoop is well suited for large scale batch analytics; however the MapReduce architecture is not easily extensible to real time ad hoc analytics.  Some products in this space do away with the MapReduce architecture completely in order to overcome these limitations.
  3. Type 3 (Independent Players) – This category includes vendors that have come up with proprietary schemes and architectures to provide real time and ad hoc  analytic platforms.  Some such as 1010 data and Infobright have existed for some time, while others such as Google are newcomers that are providing new ways to deliver analytic capabilities (e.g. Google offers a web based service for running advanced analytics).

Below is a detailed description of the offerings from some of the major vendors in these three categories.

Type 1 (Traditional Data warehousing Vendors) – Traditional vendors of enterprise data warehousing platforms and data warehousing appliances that have acquired and/or developed capabilities and solutions for large scale data ware housing and data analytics.

Teradata

  • Teradata’s Aster database is a hybrid row and column data store that forms the foundation of its next generation data discovery and data analytic capability; the data management platform can be delivered as a service, on commodity hardware or an appliance form factor
  • Teradata Enterprise Data Warehouse is its data warehousing solution; the EDW is marketed as the platform for dimensional analysis of structured data and standard data warehousing functions as part of its Unified Data Architecture, Teradata’s vision for an integrated platform for big data management
  • Teradata delivers the HortonWorks Hadoop distribution as part of its Unified Data Architecture vision; Aster database supports native MapReduce based processing for bi-directional integration with the Hadoop environment; SQl-H provides SQL based interface for higher level analysis of Hadoop based data
  • The Aster data discovery platform provides capabilities for advanced statistical and data mining through pre-packaged function libraries, a development environment for custom analytic functions and an execution environment that can execute such analytic functions as part of standard SQL
  • Currently Teradata does not have known solution for event stream processing (it has announced it may enter into partnership with independent vendors of event stream processors)

Pivotal

  • Pivotal is an independent big data entity spun off from EMC after its acquisition of VmWare and Greenplum; Pivotal’s data analytics platform is powered by Greenplum database, a hybrid row and column, massively parallel data processing platform; the data management platform can be delivered as a service, on commodity hardware or an appliance form factor
  • Pivotal also offers an in-memory data management platform, GemFire, and a distributed SQL database platform, SQLFire ; Pivotal does not currently have a known solution for regular data warehousing
  • Greenplum Hadoop Distribution is a Greenplum supported version of Apache Hadoop; Greenplum database supports native MapReduce based processing for bi-directional integration with the Hadoop environment; Greenplum HAWQ provides SQL based interface for higher level analysis of Hadoop based data
  • Through partnerships with analytics vendors such as SAS and Alpine Data Labs, Greenplum platform provides capabilities for advanced statistical and data mining through pre-packaged function libraries, a development environment for custom analytic functions and an execution environment that can execute such analytic functions as part of standard SQL
  • Currently Pivotal does not have known solution for event stream processing

IBM

  • IBM’s big data management platform is powered by Netezza, a massively parallel data storage and distributed data processing appliance; Netezza enables data warehousing and fast analysis of mostly structured large scale data
  • IBM’s PureData System for Analytics provides the foundation for big data analytics; IBM PureData System for Analytics is a data warehouse appliance; IBM Netezza Analytics is an advanced analytics framework incorporating a software development kit for analytic model development, third party analytic libraries, and integrations with analytic solutions such as SAS, SPSS, etc. in support for in-database analytics
  • IBM PureData System for Operational Analytics focuses on analytics for operational workloads (as opposed to regular data ware housing workloads that are more long term strategic in nature)
  • IBM Big Data Platform Accelerators provide analytic solution accelerators i.e. pre-built examples and toolkits for video analytics, sentiment analytics that enable users to jumpstart their analytic development efforts
  • IBM provides a licensed and supported version of Apache Hadoop distribution as part of its InfoSphere BigInsights platform; BigInsights provides Jaql, a query and scripting language for unstructured data in Hadoop
  • IBM does not currently have a known solution for in-memory data management (like SAP HANA or Pivotal GemFire)
  • IBM provides InfoSphere Streams for data stream computing in big data environments

Oracle

  • Oracle’s big data management platform is supported by the Oracle database that provides columnar compression and distributed database management for analytic functions
  • Oracle offers a range of appliances for big data ware housing and big data analysis; Oracle Exadata offers an appliance for data ware housing based on Oracle database and Sun hardware
  • The Oracle Big Data Appliance is packaged software and hardware platform for managing unstructured data processing; it provides a NoSQL database, Cloudera Hadoop platform and associated management utilities, and connectors that enable integration of the data warehousing environment with the Hadoop environment
  • Advanced analytics are provided by Oracle R Enterprise, which provides database execution environment for R programs, and Oracle Data Mining, which provides data mining functions callable from SQL and executable within the Oracle data appliance
  • Oracle Exalytics also provides an in-memory database appliance for analytical application similar to SAP HANA
  • Oracle Event Processing and Oracle Exalogic provide capabilities for event stream processing

Type 2 (SQL on Hadoop) – Independent (i.e. not traditional data warehousing vendors) solution providers that provide big data warehousing and analytics platforms and products that are architected using a proprietary design, delivered as software solution, managed service or cloud offering (although some are offering appliances as well), and focusing on a specific market niche.

Hadapt

  • Hadapt enables an analytical framework for structured and unstructured data on top of Hadoop by providing SQL based abstraction for HDFS, Mahout, and other Hadoop technologies
  • Hadapt also integrated with third party analytic libraries and provides a development kit to enable development of custom analytic functions
  • Hadapt encourages deployment on configurations of commodity hardware (as opposed to proprietary appliances and platforms encouraged by Type 1 appliance vendors)

CitusData

  • An analytic database based on PostgreSQL database that offers SQL querying capabilities
  • Also offers SQL querying capabilities for data in Hadoop clusters
  • Offers a software solution that can run on commodity hardware

Other Type 2

  • A number of vendors are providing tools that enable SQL processing on top of Hadoop so as to enable higher level analytics and processing by business analysts (who may not have the ability or time to code complex MapReduce functions)
  • Hive is a data warehousing solution for Hadoop based data that provides a SQL like language
  • Greenplum HAWQ, Aster Data SQL-H and Cloudera Impala all aim to achieve higher performance of standard SQL on Hadoop by trying to rectify shortcomings and limitations of Hadoop MR and Hive

Type 3 (Independent Players) – Independent (i.e. not traditional data warehousing vendors) solution providers that provide big data analytics platforms and products that are architected using a proprietary design for big data analysis (i.e. non-Hadoop), delivered as software solution on commodity hardware configuration, managed service or cloud offering; also includes niche players

1010data

  • Proprietary database is a columnar, massively parallel data management system with an advanced, dynamic in-memory capability for mostly structured data analytics
  • Delivers the solutions in a cloud and in a hosted environment
  • Capabilities to perform granular statistical and predictive analytic routines that can be extended 1010data’s proprietary language and interface
  • Started in the financial services space, and is now expanding to manufacturing and retail

paraccel

  • Software solution for columnar, compressed, massively parallel relational data management that is capable of all-in-memory processing (provides connectors to major traditional data ware housing platforms, operational systems, and Hadoop)
  • Supports on-premise and cloud based deployment; on-premise deployment is supported on select commodity hardware configurations
  • Provides advanced in-database analytic solutions and libraries for a range of common and industry specific use cases through partnership with Numerix and Fuzzy Logix (vendors of analytic solutions)

Infobright

  • Offers a columnar, highly compressed data management solution (integrate s with Hadoop)
  • Niche focus on analytics for machine generated data
  • Delivered as a software solution and as an appliance

LexisNexis

  • Provides Roxie, an analytic database and data warehousing solution, and a development and execution environment based on a proprietary querying language ECL
  • Provides pre built analytics products and solutions for government, financial services, insurance as well as third party analytic packages
  • Software solution delivered on certified hardware configurations (managed service and cloud offerings are on the way)
  • Focused on providing analytics related to fraud and other risk management applications

Google Cloud Platform

  • As part of its cloud computing platform, Google has released BigQuery, a real time analytics service for big data that is based on Dremel, which is a scalable, interactive ad-hoc query system for analysis of large datasets
  • Other projects modeled after Dremel include Drill, an open source Apache project led by MapR for interactive ad hoc querying and analysis of big data sets as part of its Hadoop distribution

CitusData

  • An analytic database based on PostgreSQL database that offers SQL querying capabilities
  • Also offers SQL querying capabilities for data in Hadoop clusters
  • Offers a software solution that can run on commodity hardware

Real World Rapid Experimentation with Big Data

Decision-making, the art and science of making business decisions, not just good but timely decisions, has become increasingly important in a competitive environment that is increasingly “time-based”: competitive advantage accrues to those players that make good decisions fast. As George Stalk and Thomas M. Hout wrote in their 1990 book, Competing Against Time, “Time-based competitors create more information and share it more spontaneously… The cycles of creating information, then acting and acting again, are the heart of business, and time-based companies push hard so that everything they do…will be geared toward collapsing these cycles… [T]he competitor who acts on information faster is in the best position to win.”. One of the key reasons big data is increasingly garnering attention of managers is that it enables capabilities to rapidly run real-world experiments and thus collapse these information cycles. Indeed, “data driven decision making” is the new mantra making rounds in management circles and academia.

data-driven-decisions_feature3Process, Process

Decision-making happens at all levels within the organization: strategic, operational and tactical. There are many methods that can be used to aid in the decision-making process. For example, a company can conduct surveys or run focus groups to understand if a particular product or product feature would be valuable to consumers. Controlled experiments, in which real world experiments are run across control and experimental groups, offer the most insight and information that can help in decision-making. Running such experiments, however, is costly and time-consuming. One of the key hurdles organizations face with traditional processes for collecting data and testing ideas is the length of time they take to execute. By the time results from these traditional processes come in, the competition is already onto the next big idea. Furthermore, such processes are not scalable: it is simply not feasible to marshal organizational resources to execute multiple controlled experiments in parallel.

Data Scales, Process Does Not

This is now changing, thanks to increased instrumentation which allows organizations to deploy technology to collect data, and advanced analytics, which allows them to make predictions based on data analysis through advanced simulation and models. Indeed big data allows organizations to “bring the science into management”, by enabling them to test hypotheses behind decisions and to understand if the impact of those decisions is statistically significant. But what is different with big data-enabled experimentation is the speed and scalability: technology is increasingly enabling organizations to cheaply conduct large scale experiments, sometimes several of them in parallel. Companies such as Google and Amazon conduct hundreds of such experiments every day, using such techniques as A/B testing, to test new features on their websites.

Let a Thousand Experiments Bloom

While the Internet companies are the trailblazers and have employed experiments at a much larger scale, they are not alone. Retailers, pharma companies, and even casino businesses have successfully used the power of data to understand implications of their actions to better inform decision-making. McDonalds, as an example, has deployed in-store technology to track customer interactions, foot traffic and other data points to understand the impact of restaurant design and menu changes. Pharma companies are beginning to use advanced predictive models and simulations to understand the efficacy of new compounds in the R&D stage, thus reducing the need to conduct lengthy and expensive clinical trials. Retailers have successfully leveraged in-store shopper behavior data for merchandizing and pricing optimization. Under CEO Gary Loveman, an MIT-trained economist and former Harvard Business School professor, Caesars Entertainment has successfully leveraged analytics to guide business strategy, operations and innovation.

The use of data to compress information cycles is turning up in some unexpected places as well. Popular crowdfunding sites, such as Indiegogo and Kickstarter, are doubling up as sophisticated market research platforms (Crowdfunding Isn’t Just for the Little Guys). When Marvel Technology Group, a semiconductor producer, wanted to test an idea for a software and hardware toolkit product, it turned to Indiegogo. Marvel used Indiegogo to launch a small fundraising drive for its product idea, not to raise money, but to gain valuable customer insight. Running this little experiment allowed Marvel to not only get results much faster than what a focus group could provide, but it also allowed Marvel executives to gauge serious market interest from people willing to put up their money behind the idea.

So, You Want To be a Data-Driven Decision Maker?

All this is well and good, however, a number of supporting capabilities are required if organizations are to successfully leverage data in their decision-making. Perhaps the biggest hurdle is culture: organizations make decisions based on gut, not data. Furthermore, processes need to be adapted as well to enable and implement the insights generated from rapid experimentation. Agile methods, devops, and continuous integration are but a few components of an overall process architecture that is required to fully realize the potential enabled by data-driven rapid experimentation.