Big Data Technology Series – Part 5

Big dataIn the last three installments of the big data technology series, we looked at the historical evolution and key developments in databases, BI/DW platforms and statistical computing software.  This discussion will provide us with a good foundation to understand some of the key trends, developments, solution landscape and architectures that are taking shape today in response to the big data challenges.  This installment in the series will focus on understanding an outline of major trends in the data management and analytics platform  space.

As we saw in Part 2, innovations and recent advances have greatly changed the database management technology platforms, with the emergence of data stores for unstructured data and distributed large-scale data management architectures.  Part 3 focused on how the traditional BI/DW technology platform appliances have emerged to be a critical component of a corporation’s enterprise architecture supporting management decision-making and reporting needs.   Finally, Part 4 discussed how statistical computing platform have evolved on their own to support the advanced analytic and data mining needs of the business.  Technical developments in these three area are being increasingly intertwined,  with those in one area affecting and reinforcing the ones in another.  Falling cost of hardware, increasing sophistication of the software, and the rise of big data sets is driving new paradigms and thinking on technologies and their architecture for how data should be managed and analyzed.  This new thinking is challenging and extending the way things have been done traditionally.

The graphic below describes some of the key trends in the data management and analytics tools landscape.

Big Data Analytics Platform Trends

Enterprise data management architecture is changing and evolving in various ways due to emergence of big data processing and supporting tools, however there are a few key takeaways about big data architectures:

1) Open Scale-out Shared Nothing Infrastructure

As the demands for data storage and processing grew with the advent of the modern-day Internet, vertical scaling began to be used for managing the higher storage requirements.  In vertical scaling, resources such as processing power or disk are added to a single machine to match the higher processing requirements.  New architectures, such as database clustering in which data is spread out amongst a cluster of servers, were adopted.  MPP appliances provided scalability to process massive data sets across a cluster of high-end proprietary servers.  Hardware improvements that have happened over the past many decades however brought down the price/performance ratio of x-86 servers to the point where companies started using these machines to store and process data for day-to-day operations.  The usage of cheap x-86 machines for data processing was pioneered by new age information companies such as Google and Amazon to store and manage their massive data sets. Modern day scale-out architectures leverage x-86 servers with open source standard configurations using industry standard networking and communication protocols.  In fact, many modern-day data analytics platforms are basically a software platform that are certified to run on a cluster of commodity servers with a given configuration.

2) Tailored Data Management Architecture

The hugely successful relational model forms the basis of a majority of enterprise data computing environments today.  In spite of the variety of uses cases that the relational model has been used for, it has its set of shortcomings.  Database innovation in recent years has focused on tools and techniques to store unstructured data using non-relational techniques.  A raft of database management tools for such data have emerged in the past decade.  Alternative forms of data storage are being increasingly used e.g. columnar databases that storage data indexed by columns rather than rows.  Similarly, a number of innovative data storage solutions such as SSD based storage have come out.  These innovations have created a plethora of data management system options each of which is optimized to handle a specific set of use cases and applications.  Enterprise data management architectures are moving from using “one size fits all” relational database systems to using a “tailored” combination of relational/non-relational, row-oriented/column oriented, disk based/memory based etc. solutions as guided by data workloads’ characteristics and processing needs.

3) Logical Enterprise Data Warehouse

Traditional BI and DW platforms have been successful at delivering decision support and reporting capabilities with structured data to answer  pre-defined questions.  Advanced analytics solutions have traditionally been delivered using proprietary software and high-end hardware platforms.  Relational databases have typically been used to manage transactional data.  This picture is slowly evolving due to falling hardware  costs and rise of big data needs, and the consequent emergence of unstructured data processing solutions and new big data analytic platforms.  Unstructured data stores such document stores are slowly making their way into the enterprise to mange unstructured data needs.  The new analytic platforms provide a powerful suite of tools and libraries based on open source technologies to run advanced analytics supported by a processing layer and query optimizer that leverages scale-out distributed architectures to process data.  The enterprise data architecture is thus slowly evolving and increasing in complexity as companies leverage myriad data storage and processing options to manage their data needs.  In response to these developments, Gartner coined the concept of “logical data warehouse”, essentially an architecture in which the concept and scope of the traditional warehouse has been expanded to include the new data processing tools and technologies, all abstracted by a data virtualization layer.

The database and analytic platform market continues to evolve, and successful enterprise data architecture patterns to manage the big data needs are just emerging.  In the next installment of the big data series, we will look at some of the key capabilities of a big data analytics platform and some major players in the market.


Lab as a Service

The Wall Street Journal this week featured an article on Silicon Valley startups that are employing software and robotics to bring to market new models for managing discovery and pre-clinical research (Research Labs Jump to the Cloud).  It was interesting to read about companies such as Emerald Therapeutics that are offering cloud-based services that can provide end-to-end and precise design and execution of common pre-clinical phase experiments, analyses and assays (Emerald recently closed a Series B funding round with Peter Thiel’s Founder’s Fund).  Investor interest in companies such as Emerald indicates how there is serious promise in new technologies of disrupting even staid industries and functions otherwise thought to be impervious to technological advances.


Discovery and pre-clinical research in the drug development process is a phase before clinical trials.  Pre-clinical research is concerned with understanding the feasibility, toxicology and side effects of drug molecules, with the ultimate goal of building a feasibility and safety profile of potential molecules for further development in the clinical testing phase which typically invovles conducting experiments on human subjects.  The pre-clinical phase involves running finely controlled and detailed experiments, both in vitro (in which specific cells or tissue in test tubes and petri dishes are used to study the effects of drug molecules) and in vivo (in which experiments are conducted on entire living entities).  As such, these experiments involve a lot of iterative testing with common routine setup, execution and analysis of results.  Outfits such as Emerald hope to offer outsourced services that provide a way to automate these repetitive tasks thus improving turnaround time for running pre-clinical experiments and also researcher productivity by providing higher order services such as data analysis and reporting.

The potential of such “lab as a service” offerings is promising due to the confluence of three technology trends in life sciences and pharma industries: robotics, lab automation software and data analytics.  In the lab, machines such as shakers have been in use for a number of years, however, where robots come in is that they can take on increasingly complex and precise tasks traditionally performed by lab technicians, such as sample preparation and liquid handling.  Sophisticated robotic systems can now provide end-to-end automation for a complete procedure such as performing and analyzing a polymerase chain reaction.  Thanks to falling cost of hardware and sensors, rise of such technologies as 3D manufacturing, and smart software, robots have become a much more central piece of the lab automation systems.  The second key trend is the increasing sophistication of lab software.  Lab software has been used to automate different lab management processes such as specimen set up, data collection and analysis.  For example, software solutions such as Electronic Lab Notebooks provide a way for researchers and technicians to easily capture handwritten notes and analyses in a digital form.  These systems have traditionally been developed in a stand alone fashion, and it is only now that efforts are being made to integrate these into one so as to enable end-to-end processing.  Increasing automation does not just produce productivity benefits; it provides the ability to precisely capture and analyze data for the various variables that go into making an experiment and the outputs that are produced.  Sophisticated data analyses on data thus collected can provide insight into how variables impact the results and reproducibility of experiments.  This coupled with advanced simulation and predictive technologies can greatly inform planning of subsequent iterations, thus cutting down time to completion of the research phase.

Such lab-as-a-service offerings have the potential of democratizing access to expensive lab resources in the future; anyone with a credit card and an Internet connection will be able to source such resources to conduct experiments and get results.  In a world struggling to tame the scourge of ever evolving diseases and infections, this would be welcome development.