Big Data Technology Series – Part 3

Big dataIn the last installment of the Big Data Technology Series, we looked at the first thread in the story of big data technology evolution: the history of database management systems.  In this installment, we will look at the second thread in the story:  the origins, evolution and adoption of systems/solutions for managerial analysis and decision making. The sophisticated business intelligence and analytic platforms and solutions of today are the product of many years of interplay of and cross-influences between theory, primarily management thinking on how best to leverage data/information for business effectiveness, and practice, in particular developments of data/information management and delivery solutions and their common uses and adoption.  Complex as this story is, this post is an attempt to understand the key evolutionary points with our overarching goal of better understanding today’s big data technology trends and technology landscape.

See the graphic below that summarizes the key points in our discussion.


The usage of computers for managerial analysis and decision-making traces its roots back to the 1950s-60s when the theory of management information systems (MIS) came into being in business management academic and practitioner circles as a body of knowledge and thinking on how best to leverage computers for corporate data management and its use.  The power of computers in managing and processing data, already evident in the corporate world by that time, was not being fully utilized, in the view of MIS thinkers.  They envisioned an environment which elevated the use of computers from a clerical administrative data processing responsibility to one that provided a corporate wide integrated interactive “bucket of facts” that managers could dip into for analysis and decision-making.  Fundamental and foundational to this capability was a “data base” or a “data hub”, a place where all corporate wide data could be stored in an integrated manner.

A lot of this thinking however predated the required technology; technology in the 1960s was not developed and sophisticated enough to deliver clean data integration and interactive delivery that the MIS thinkers had envisioned.  The MIS systems in those days were mainframe-based batch oriented report generators and querying modules that provided structured periodic reports to managers.  As we saw in the previous instalment of the Big Data Technology Series, there was file management software in use in the “data processing” department (that is what IT used to be called back then), but those solutions were meant to ease the burden of data analysts and technical staff, and were not oriented for business managers.  Even simple report generation out of the transactional and accounting systems required heavy involvement of the technical staff.  The so called MIS solutions fizzled in the marketplace.

Technology evolved in the 1960s with arrival of more powerful and cheaper minicomputers, and researchers with access to these computers began implementing solutions for decision support and modelling.  Concurrently, the MIS theory evolved and expanded to understand human-computer interaction and managerial decision-making, specifically how technology could a play role in that.  The confluence of these developments was behind the rise of the concept of Decision Support Systems (DSS) in the late 1960s-early 1970s.  DSS was defined as a computer information system that provided support for unstructured/semi-structured managerial decision-making.  Several prominent thinkers and researches conceptualized and implemented early model-based personal DSS solutions for such applications as product and pricing and portfolio management during the early 1970s (See the Wikipedia page on DSS for a taxonomy of DSS types).

By the early 1980s, a number of researchers and companies had developed interactive information systems that used data and models to help managers analyze semi-structured problems. It was recognized that DSS could be designed to support decision-makers at any level in an organization across operations, financial management and strategic decision-making. DSS theory and thinking expanded in the 1980s, giving rise to such concepts and supporting solutions as Group DSS for organizational decision-making, and Executive Information/Support Systems for a bird’s eye organizational view for executive management. Advances in artificial intelligence led to the development of expert systems and knowledge-based DSS solutions.  The rise of personal computers, databases and spread-sheet software in the 1980s supported the development of these DSS solutions.  Spreadsheets in particular began to be used for data reporting and running basic descriptive analytics and understanding trending in data.

The 1980s also marked the beginning of the emergence of structured methodologies for information and data design and engineering.  James Martin, the influential author of a series of tomes on Information Engineering, wanted to bring traditional engineering like discipline and rigor to the design of data and information delivery systems.  This was also the time that computer aided software engineering (CASE) methods and 4GL language concepts were developed.  All these developments were reflections of increasing importance of data analysis in the business commercial environment and the need to manage the increasing complexity in a structured and logical manner.

Early 1980s also marked the beginning of the emergence of a new class of DSS solutions, the data driven DSS that encompassed solutions that allowed users to quickly look at data, analyse data in a given database, or look at a series of databases.  The emergence of data driven DSS was driven by theory around executive information systems (EIS), which were envisioned as solutions that provide data on critical success factors to top management, and also technology in the form of relational databases that provided capabilities for data storage and manipulation.  The concept of EIS and the technical foundation supporting it went through a number of developments in the 1980s, as new methodologies to measure the state of the organization emerged (e.g. the Balanced Scorecard), and as the technical underpinnings were formalized in new mechanisms (e.g. multi-dimensional data structures and online analytical processing (OLAP)).  Data-driven DSS became an important market in their own right, and rechristened loosely as “business intelligence (BI)” solutions in the late 1980s-early 1990s.

The development of the BI market in 1990s was driven to a large extent by the relational database vendors who had undergone growth and achieved strong market presence after the “database wars” in early part of the decade.  As the BI market developed, there was a need to provide infrastructure to manage enterprise wide data in an integrated and cleansed manner, which gave rise to a market for “data warehouse” solutions, again a market that came to be dominated to a good extent by traditional database vendors.  The 1990s also marked the rise of web-based analytics, and shift in the complexity, scope and frequency of traditional BI applications of alerting, reporting, and dash boarding.  As business operations increased in pace and complexity, BI solutions were required to provide integration for ever-increasing variety of data sets on a more real time basis.  Towards the end of the decade, BI moved from the realm of aiding decision-making to actually driving real change in organization through “enterprise performance management” (EPM) which centered on driving change in organizational processes, management and accountability structures through capture, analysis and interpretation of data and information across all levels of the organization.  Complex event processing and other solutions as part of “operational intelligence” umbrella were adopted for managing real time analysis for broader event pattern identification and definition in long running business processes.

In the 2000s, with the rise of unstructured data, availability of cheap networked computers, and technology advances in networking and storage, “raw data processing platforms” such as Hadoop came to be commonly adopted.  New predictive analytics techniques became feasible due to improving price-performance ratios of hardware that gave rise to “analytic platforms” (such as AsterData) which provided a hardware and software infrastructure for performing complex descriptive and predictive analytics in a “pre-packaged appliance” form factor.  As we will see in the next installment, such analytic platforms are increasingly converging with statistical and machine learning processing solutions to provide a full end to end capability for advanced analytics.

In summary, the nature of adoption of BI/decision-support solutions has evolved along three dimensions: 1) from reactive to prescriptive: the batch data reporting of the 1960s was meant to provide a rear-view mirror of what happened, as opposed to today when BI solutions are used to provide recommendations on future actions, 2) from strategic long-range to operational short-term: from using reports and analysis to understand long term trends to using solutions to provide real-time feedback, and 3) from internal function-oriented to external integrated: from using siloed views of organizational information to driving views that are integrated across functions and organizations.  The market for managerial data analysis and decision-making solutions continues to evolve as the cost of technology falls further, and as the business needs around management of data become more varied and complex.

In the next installment of the Big Data Technologies Series, we will examine the third thread in the story of big data technology evolution: the origins and history of packages and solutions for statistical processing and data mining.









The Platformization of Robotics


It was a treat to read the latest Economist special report on advances in robotics (Immigrants From the Future, March 29 2014).  Robotics is one of the technologies that suffer from “high visibility of its promises and near-invisibility of its successes”.  When people think about robots, they invariably picture something that is incomplete and bug-prone.  Yet, as the report discusses, robots are slowly but surely making their way into businesses and households.  The potential of robotic technology is highlighted by a recent string of acquisitions of robotics companies by Google, and Amazon’s announcement in November 2013 of their intention to use robotic drones for household package delivery.  Drones are already being extensively used by US military for ISR (Intelligence, Surveillance, and Reconnaissance) operations.

Robotics is getting industrialized:  robots are evolving from being quaint one-off creations to being standardized products of industrial technology.  This big push into robotics will be driven to a great extent by the falling cost of sensors, processors and other hardware that goes into making a robot.  However, what is going to accelerate the process is platformization of robotic technology that will form the foundation of cheap mass production in the future.  There are many evolutionary parallels here, between robotics and other modern technologies.  For example, the modern-day digital platform enabled by such technologies as cloud computing, open source software, and social networks has evolved to provide a quick and easy way for the proliferating technology startups to rapidly bring to market a variety of solutions through plug and play and a Lego like assembly of basic technology blocks.  Until recently, robot development was a cottage industry, demanding expertise in a number of fields such as artificial intelligence, sensor and engineering technology, and electronics.  Increasingly, however, robots can be designed, assembled and tested in a standardized and an automated manner.

New developments across the entire cycle of robotic development, including design, prototyping and operation, will enable this push into platformization.  Robots are nothing more than assemblies of various hardware and electronic modules controlled and coordinated by software.  Standardized robotic design will increasingly be aided by such developments as the emergence of the Robot Operating System (ROS), which provides a uniform way to enable the software-based control and coordination.  Open Source Robotics Foundation, a not for profit that manages the ROS, provides a forum for open source collaboration and development that will further drive the standardization and adoption of such building blocks in robotic design. Prototyping and testing is an important step in robot development since this is where the rubber meets the road.  Increasingly, teams are using sophisticated simulation software to predict actual performance of their designs, many times totally circumventing the need to actually build a prototype.  In the cases where there is such a need, teams have used 3D manufacturing techniques to quickly manufacture and assemble robotic parts, which has greatly improved the lead times and cost of building and testing robotic assemblies.  Finally, the operation of robots will increasingly be standardized and automated.  The “Internet of Things” and cloud based technologies will increasingly enable collaboration and functionality externalization that will lead to leaner and simpler operation: robots will collaborate and “learn” from each other as well as other connected devices, as well be able to tap into the vast data and online knowledge trove for all aspects of their functioning such as object recognition and decision-making.  Indeed, “cloud robotics” is an emerging field that envisions this convergence between robotics and cloud computing technologies.

Robotics is a fascinating field in that it offers unique insights into the human psyche and consciousness.  We are still afar from seeing a real life C-3PO among ourselves.  The developments in the field thus far, however, seem promising in that robots will continue to delight and surprise us in new ways for years to come.

Big Data Technology Series – Part 2

Big data
 In the first introductory installment of the Big Data Technology Series, we looked at some of the drivers behind recent innovation in the  database management and analytic platform market.  We also looked at a high level scheme to classify the plethora of technologies and solutions  currently in the market based on four distinct environments.  Before we start getting into the details of this technology landscape, it will be  worthwhile to take a step back and understand a bit of history.  As Winston Churchill once said, “The farther you can look into the past, the farther  you can see into the future”.  Once we understand history, we will shift our focus to understanding some of the characteristics, needs and current  technology trends in big data management in a bit more detail.  This will then set us up nicely to understand core capabilities and modules in a typical big data platform, and key vendor offerings in this space.

The story of big data technology evolution has three major threads in it. First, database management technology has continued to evolve and improve over time, both from a hardware perspective as well as from a logical modeling perspective.  A number of innovations in the big data landscape are innovations in database management platforms, so it is important to understand how databases originated in the first place and how they have evolved. The second thread is related to development of business intelligence and analytic platforms, which have their origins in the concept of decision support systems of the past.  The concept of decision support systems and executive information systems originated in management circles, and technology to fully realize the vision originated only in the 1990s with the rise of the modern-day data warehousing and business intelligence platforms.  Such platforms form an integral part of the big data landscape and have continued to develop, offering analytic capabilities ranging from strategic long-range descriptive style to operational and real-time predictive and prescriptive style.  The third thread is related to technologies and packages for statistical processing, which originated in such fields as agricultural research and social sciences, and have slowly made their way into commercial business applications.  Statistical processing as applied to big data sets is becoming increasingly feasible due to falling cost of technology, and forms an important trend in the big data evolution story.  These three threads are increasingly getting intertwined, at least from a logical perspective if not from a physical one.  We will examine how each one of these threads has evolved over time, starting with the database management platform (see the graphic below).

Database Platform History

In the 1960s, mainframe computer programs were required to manage batch transactional data processing.  Report generators and file processing software were the “database management systems” of the early 1960s, handling batch mode file manipulation and reading tasks offloaded from the expensive mainframe computers.  Data in those days was stored as flat files in slow magnetic tape based systems that provided serial data access.   Report generators and file processing software provided routinization of the data manipulation and other tasks, however applications programs still had intimate knowledge of how data was structured at the physical level.  Because of this strong coupling between programming logic and data design, it was extremely cumbersome to change and extend programs.  Each program was written to manage its own data, and data sharing across programs or modules was very limited.  As the data management needs of the programs increased, it became increasingly cumbersome to manage and administer the data subsystem.  The emergence of random access based magnetic disk drive further complicated the management and administration of data.  This multiplying complexity led to the emergence of more sophisticated data subsystems such as General Electric’s IDS and IBM’s IMS around mid 1960s.  The concept of a “database management system” had not been defined yet, although solutions such as IDS and IMS were early embodiment of the concept.

The concept of the database management system (DBMS) was outlined by CODASYL (an industry consortium) in the late 1960s.  The concept envisioned extending the existing file management and data sub systems to an integrated platform that provided a single corporate information store, a ” data base” in support of Management Information System (MIS) capabilities that provided online integrated data creation, manipulation and reporting.  CODASYL’s conceptual outline of the DBMS was a major influence on the evolution of the independent DBMS software industry in the subsequent decade which developed as a result of IBM’s introduction of System/360 that standardized the operating system software for IBM product lines, and IBM’s decision to un-bundle software from its hardware, which all happened in mid to late 1960s.

CODASYL’s vision of  the DBMS was based on the so-called “navigational model” in which data elements in the data structure were linked with each other as part of a linked list.  As the independent software industry developed through the 1970s, several DBMS solutions based on CODASYL specifications arrived on the market.  CODASYL’s model had its advantages, but it was extremely complex to develop and manage.  These drawbacks, and specifically the lack of an effective way to search for data elements, prompted E. F. Codd, an IBM researcher, to develop an alternate data model called the “relational model” in the early 1970s.  The relational model modeled the data in a very different way, and paved the way for higher level abstraction in data manipulation and querying.  While the CODASYL DBMS market was in full bloom in the 1970s, a number of upstart solutions such as IBM’s System R and Ingres that were based on the relational model started appearing in parallel.

The relational model quickly became popular, and a number of independent relational database vendors appeared on the market in mid to late 1970s, notable among them the future Oracle Corporation.  Throughout the 1970s and 1980s, the relational database model was successively developed and refined through development of query languages, database indexing techniques, storage management and query optimization and execution management.  As the relational model was developed, the market witnessed major commercialization of the relational database management systems (RDBMS).  The RDBMS solutions, due to their flexibility and ease of use, started challenging the market dominance of the navigational DBMS solutions, and became a serious contender as the market developed and solutions matured by mid to late 1980s.  The 1980s also saw the rise of the PC and personal databases, for which the relational model was perfectly suited.

The 1980s was an important time period in the history of the database management system for another reason: it marked the beginning of the development of the “performance gap”: a sizeable spread in the speed and processing capacity of the CPU and that of database storage, mainly disk.  Moore’s law has incessantly driven the exponential performance increase of processor chips for decades giving rise to ever more powerful CPUs, however the magnetic disk drive has been lagging in performance, so much so that disk is still today several orders of magnitude slower than a modern-day CPU (see graphic below).

Performance Gap

This performance gap started becoming prominent beginning in the 1980s, and it had a significant impact on the design and architecture of database management platforms in the 1980s and 1990s.  To bridge the performance gap and to maximize the use of expensive CPU resources, database management systems had to develop intricate management and optimization techniques around caching, disk/memory management and data movement.  As powerful computers became cheap in the 1990s, distributed computing across commodity machines started becoming popular.  The Internet too entered the business mainstream in the 1990s, resulting in increased data needs of businesses, all of which imposed very high performance requirements on RDBMS platforms. All this gave rise to various distributed data management architectures in RDBMS in the 1990s, such as vertical database scaling as part of SMP designs, database clustering, etc. that provided ways to enhance RDBMS performance.  The 1990s also witnessed the rise of alternative data modeling techniques such as the object-oriented data model, gaining some acceptance but not enough adoption to be able to challenge the hegemony of the relational model.

Increasing complexity led to increasing headache and cost in the implementation and management of the modern database management system.  Vendors responded by innovating in product development and configuration by introducing database appliances that allowed customers to buy pre-configured solutions requiring minimal setup.   The 2000s witnessed the emergence of new players in the information based economy, notably Google and Amazon.  The data management needs of these new data driven companies led to the development of the so-called “post relational database models” such as key-value stores and document oriented databases.  New data management architectures involving horizontal scaling (massively parallel processing or MPP) and supporting innovations in the logical data management layer (in the form of Google’s MapReduce) were invented.  Falling cost of hardware allowed database vendors to bring to market a range of new database appliances such as those using in-memory computing and flash storage.  Finally, cloud computing technologies enabled vendors to bring to market new offerings around Database As a Service.

The database management platform continues to evolve as the cost of technology falls further, and as the business needs around management of data become more varied and complex.  We have not yet seen an alternative to the hugely successful relational model, however, increasingly the relational model will be complemented with alternative data management models as guided by specific needs and opportunities.  The database management technology market has evolved from a “one size fits all” state to one with an “assorted mix” of tools and techniques that are best of breed and fit for the purpose.  We will surely continue to witness yet more interesting developments in this important market.

In the next installment of the Big Data Technologies Series, we will examine the second thread in the story of big data technology evolution: business intelligence and analytic platforms that provide information analysis and information delivery capabilities.  We will start with the origins of decision support systems, and understand how technology has evolved to provide such capabilities as alerting, dash boarding and analytical reporting.


The Real Buzz Behind Bitcoin

BitcoinBitcoin is in the news again.  The cryptocurrency, after making a splash in 2012, has of late earned the ire of investors and governments.  The digital currency has been one of the worst performing assets year to date per a recent article published on LinkedIn.  A popular Bitcoin trading exchange based in Japan abruptly went bust earlier  this year, taking with it millions of investor money.  Investors, as some like to say, have been “bitconned”.

While Bitcoin the currency has run into a raft of regulatory, fiscal and technical challenges, the enthusiasm around the potential of Bitcoin the platform remains unabated.  The Bitcoin platform that underpins the digital currency is essentially an automated, distributed, self-policing platform for managing ownership (See this series of YouTube videos on Bitcoin by Campbell Harvey, Professor of Finance at Duke’s Fuqua School of Business).  Essentially, Bitcoin’s platform provides foundational infrastructural services (such as encryption, non-repudiation , reconciliation etc.) for managing transactions related to ownership of digital assets in a distributed  and automated manner through the use of public key cryptography and distributed computing.  It offers a distributed consensus-driven model, so no central coordinating authority or overseer is required to validate and track transactions.  And it is automated, managed transparently by a distributed network of machines coordinating the work with each other.  If you are still confused as to what Bitcoin is and why people are so gung-ho about it, I don’t blame you.  The concept of Bitcoin is nifty with wide ranging and complex implications.  Perhaps an analogy will help.

The emergence and evolution of the Bitcoin platform is not unlike that of the modern day Internet.  The early days of computing were characterized by a prevalence of giant computing machines (the mainframes and the minicomputers) that packed all the computing power and resources for networking and storage.  Networking in those days was closed and proprietary, with each vendor managing their own stack of hardware and related software.  Getting computers to connect and talk to each other required a lot of intermediation, manual setup and ongoing management due to a centralized approach to computing and lack of clear and common communication standards.  The evolution and ultimate emergence of Internet changed all that.  The evolution of the TCP/IP protocols and standardization of other networking mechanisms provided a common platform for standardized communication.  The Internet took the complexity out of the task of creating complex distributed systems by providing foundational infrastructural services and guarantees.  Internet thus allowed us to transition from a centralized ‘command and control’ computing paradigm to one involving secure distributed computing.

Like the Internet, the Bitcoin platform attempts to provide infrastructural services and guarantees to the task of managing ownership transactions in an automated and distributed manner.  The institutions, contracts and arrangements needed today to manage ownership of such assets as say stocks or bonds (think of custodians and clearing agencies) are reminiscent of the mainframe based computing and networking era in our analogy.  The Bitcoin platform envisions to fundamentally change this picture from the ground up.  Essentially, the Bitcoin platform envisions to provide us an Internet like distributed platform for managing digital asset ownership.  As the Bitcoin platform and its protocols evolve and go through the growing pains, we will increasingly have a stable and solid platform for managing ownership of digital assets in an automated and decentralized manner.  Just as the Internet allowed the market to focus on creating higher order value-added stacks, products and services (think of the world-wide-web or email protocols such as SMTP), the Bitcoin platform has potential to provide the infrastructure not just for digital currency, but anything that can be digitized as an asset. Imagine the range of possibilities that a Bitcoin like platform can enable in the “Internet of Things” in which all things physical will have digital identities interconnected in a networked world.

The Bitcoin platform has already unleashed a wave of innovation, as demonstrated for example by the rise of other virtual currencies such as Colored Coins, which provides an abstraction layer to encode information for ownership of real world physical assets such as property, stocks, or bonds. The Bitcoin platform is still in its infancy, and there are a number of technical kinks related to security and scalability that need to be worked out.  Irrespective, being a platform technology, the Bitcoin phenomenon is a potentially major disruptive force that like the Internet can have far reaching consequences for entire industry structures and value chains.