Big Data Technology Series – Part 4

Big dataIn the last installment of the Big Data Technology Series, we looked at the second thread in the story of big data technology evolution: the origins, evolution and adoption of systems/solutions for managerial analysis and decision-making.  In this installment, we will look at the third and the last thread in the story:  the origins, evolution and adoption of systems/solutions for statistical processing.  Statistical processing solutions have been evolving independently since the 1950s to support analyses and applications in social sciences and agribusiness.  Recently, however, such solutions are increasingly being applied in commercial applications in tandem with traditional business intelligence and decision-making solutions, especially in the context of large unstructured datasets.  This post is an attempt to understand the key evolutionary points in the history of statistical computing with our overarching goal of better understanding today’s big data technology trends and technology landscape.

See the graphic below that summarizes the key points in our discussion.

Statistical Packages History

The use of computers to manage statistical analysis began in the 1950s when FORTRAN was invented, making it possible for mathematicians and statisticians to leverage the power of computers.  Statisticians appreciated this new-found opportunity to leverage computers to run analyses, however, most of the programs were developed in a labor intensive and heavily customized one-off fashion.  In the 1960s, work commenced to use languages such as FORTRAN and ALGOL to formulate high level statistical computing libraries and modules among the scientific and research community.  This work resulted in the emergence of the following popular statistical packages in the 1960s:

  • Statistical Package for Social Sciences (SPSS) for social sciences research
  • Biomedical Package (BMD) for medical and clinical data analysis
  • Statistical Analysis System (SAS) for agricultural research

These packages rapidly caught on with the rest of scientific and research community.  The increasing adoption prompted the authors of these packages to incorporate companies to support commercial development of their creations.  SAS and SPSS were thus born in the early 1970s.  These statistical processing solutions were developed and adopted widely in academia and also in industries such as pharmaceuticals.  The rapid adoption of software packages for statistical processing thus gave rise to the “statistical computing” industry in the 1970s.  Various symposia/societies, conferences and journals focusing on statistical computing emerged during that time.

Statistical processing packages expanded and developed greatly through the 1970s, however, they were still difficult to use, as well as limited in their application due to their batch-oriented nature.  Efforts were undertaken in the 1970s to provide a more real-time and easy to use programming paradigm for statistical analysis.  These efforts gave rise to the S programming language, which provided a more interactive alternative to traditional FORTRAN-based statistical subroutines.   The emergence of personal computing and sophisticated graphical functionality in the 1980s were welcome developments that enabled real-time interactive statistical processing.  Statistical package vendors such as SAS and SPSS extended their product suites to provide this interactive real-time functionality e.g. SAS came out with their JMP suite of software in the 1980s.

Another major related development that happened in the 1980s was the emergence of expert systems and other artificial intelligence (AI) based techniques.  AI had been in development for some time, and in the 1980s received much hype as a set of new techniques to solve problems and create new opportunities.  A field of AI, machine learning, emerged and developed in the 1980s as a way to predict outcomes and results based on prior datasets that a computer could analyze and “learn from”.  The application of such machine learning techniques to data gave rise to the new disciplines of “knowledge discovery in databases ” (KDD) and ultimately “data mining”.

AI did not live up to its hype into the 1990s and experienced much criticism and funding drawdown.  However, some AI/machine learning techniques such as decision trees and neural networks found useful application.  These techniques were developed and productized by several database/data mining product vendors in the 1990s. Data mining solutions started appearing in the marketplace along side traditional business intelligence and data warehousing solutions.

The open-source movement appearing in the 1990s as well as the rapid advancement of the Web impacted the world of statistical computing.  The R programming language, an open source framework for statistical analysis modeled after the S programming language, emerged in the 1990s, and has become wildly successful since, giving rise to a plethora of open source projects for R-based data analysis.  The increasingly large and unstructured datasets that started emerging in the 1990s/2000s prompted the rise of natural language processing and text analytics.  The modern analytic platforms that emerged in 2000s incorporated these new developments as well as new and advanced machine learning and data classification techniques such as support vector machines.

The statistical processing platforms and solutions continue to evolve today.  As computers have become cheaper and increasingly more powerful, several product vendors have adapted niche traditional statistical processing techniques and tools to increasingly varied and large datasets. Through open source libraries, development environments and powerful execution engines running across massively parallel databases, the modern analytic platform of today provides capabilities to meld traditional data analysis with statistical computing tools and techniques.  We will witness more of this convergence and integration as these analytic platforms and supporting technologies continue to evolve.

Having examined in  detail the three threads in the story of big data technology, we are now in a position to better understand the current trends and makeup of the modern analytic platforms.   In the next installment of the Big Data Technology Series, we will shift gears and focus on current trends in big data analytic technology market place and core capabilities of a typical big data analytic platform.