Building in a Continuous Integration Environment

Continuous Integration (CI), which spans practices, processes and tools to drive continuous development and integration of software code, is a key building block of an organization’s DevOps methodology. An important CI component is the software build process.  The software build process traditionally has been a second-class citizen of the software development world, relegated to the background  as organizations spend limited resources on customer facing and project management functions.  Software development and delivery is inherently fragile, but one of the most fragile parts is the software build process because development managers have traditionally lacked clear visibility and control of the build process.  Too often, software builds break easily, are difficult to change, and resource intensive to troubleshoot.  With increasing pace of business change and higher delivery pressures, however, every link in the software development and delivery chain will need to be streamlined, and one of the key areas organizations will need to focus on as part of their CI journey is the software build process.

blue devops_4Building is the process of compiling raw software code, assembling and linking various program components, loading of external/third party libraries, testing to ensure that build has executed successfully, and packaging the code into a deployable package.  While this may seem simple and straightforward, building in a big program is complex enough that an overall build architecture is usually required to be defined first along with a dedicated team and infrastructure for ongoing management.  Building usually happens at multiple levels for different consumers: development builds focused on single component development and testing, integration builds across multiple components, system builds for all the system components, and release builds focusing on building customer releases.  Build architecture deals with making specific choices around what to build, how often to build, and how to build it.  Inefficient builds that take too long to finish, challenges with isolating shipped product bugs and issues and replicating them in the development environment, or challenges integrating multiple code streams into release builds efficiently are all symptoms of an inefficient build architecture.  Architecting the build process starts with identifying what to build in a software configuration management system.  Code branching and merging, dependency management, management of environment configuration and property files, and versioning of third party libraries, are all critical components of software configuration management that are closely tied to identifying what software components are needed in a build.  “How often to build” involves defining the build schedules for development, integration, system and release builds, depending upon a number of factors such as number of development work streams, the frequency of releases required, and the capacity of the project team to name a few.  The build schedule clearly identifies how the builds are promoted through the development pipeline, through successive stages of development, testing, quality assurance and deployment.  Last but not the least, having appropriate build process tooling and infrastructure allows building to be declaratively controlled, automated, and efficiently executed.  Build script automation, build parallelization and tracking and reporting of build metrics are all usually managed by a modern build management platform.

With so many moving parts to the build process, it is easy to see where and how things can go haywire.  While there may be multiple factors contributing to inefficient and slow software development and delivery, ineffective building almost always is one of the contributing factors.  Traditional build management suffers from a number of usual-suspect issues, and most of them are process issues, not tool issues.  One of the most common approaches to building has been the big-bang style of executing integration builds, where independent development work streams bring together work stream codebases together as part of an infrequent big bang event.  Compounding the problem is the tendency of development teams to throw code over-the-wall to the build team which is then painstakingly tasked with assembling sources, data, and other inputs to begin the building process.  The big-bang integration brings a big-bang of integration issues and broken builds, a.k.a. “integration hell” in the industry parlance.   Cultural and management issues play into this as well: build teams are not empowered to exercise and implement build discipline with development teams, which in turn do not always action on feedback from build team on broken builds in a timely manner, leading to lengthened cycle time for builds.  Lack of exhaustive pre-commit testing in the development phase, either because development teams are not incentivized to unit test or because effective unit testing harnesses are not available in the first place, leads to “bad commits” and downstream integration issues, putting pressure on the build team.  Many build issues can be traced to just poor build process management.  For example, complex source code branching and componentization schemes complicate the build tasks and make the build process error-prone.  Management of dependencies and build configuration frequently lacks sophistication which leads to challenges in implementing incremental builds, leading to build issues. Inadequate build automation and infrastructure can lead to a host of issues as well.  For example, manual environment setup and data management complicate the build tasks, making building error-prone and lengthening cycle times.  Build infrastructure often times is not adequate to execute complex from-the-scratch builds, which can take hours to complete, thus lengthening build cycle times.

Build management as part of CI aims to get around these challenges to streamline and turbo charge the build process, and ultimately improve the development process overall.  And it begins by fundamentally changing the traditional development and delivery mindset.  Whereas the usual approach involves “software craftsmen” working independently to create perfect fully functional modules that are then integrated over time, building in a CI environment espouses a much more agile approach in which team members come together to develop base product functionality as quickly as possible, incrementally building to deliver the full product over time.  Cultural change to drive close integration between development, testing/QA and build teams is key: in a continuous building environment, development team works hand in hand with the build and testing teams, and a “buildmeister” has the authority to direct development teams to ensure successful build outcomes.  Successful continuous building starts with overhauling the development effort, enacting such practices as test-driven development, and other precepts of Extreme Programming such as feedback loops that encourage development team to take ownership of ensuring successful testing and building. Development teams check-in and build often, sometimes frequently in a day.  And they follow strict practices around check-in, bug fixing and addressing broken builds.  A continuous building environment is characterized by the presence of a “build pipeline” – a conceptual structure which holds a series of builds, spanning the life cycle of the build development process, beginning with a developer’s private build all the way to a fully tested release build, each ready to be pulled by any team as needed.  To enable this, a CI server is used to automate and manage the build management process.  The CI server is a daemon process that continually monitors the source code repository for any updates and automatically processes builds to keep the build pipeline going.  The CI server allows builds to be pulled out of the pipeline by individual teams and for builds to be advanced through the pipeline as build promotion steps are successfully completed.  With each promotion step, the build becomes more robust and complete, and thus moves closer to being shipped to the customer.

To achieve success with continuous building, practitioners recommend a set of best practices across build management, processes and tools.  A few key ones are related to software configuration and environment management, specifically, that development be managed from one “global” source code branch (or reduce branching) and that all build components, including property and configuration files, be managed in a versioning system. Then there are process-related best practices, which deal with development teams following pre-commit testing and actioning on feedback for broken builds on a priority basis.  Automation is a key aspect of continuous building best practices as well: a CI server that manages builds and the build pipeline is a key component of the build automation infrastructure and is central to implementing continuous building.

Build management is a key part of CI, and one where a number of issues with traditional software development and delivery methodologies lie.  Achieving full CI, however, requires change to other key parts as well, for example, testing and deployment management.  In the future DevOps related posts, we will look at these other aspects of CI.

A New Kid on the Blockchain

Fintech, the application of information technology to the world of finance, is the topic of discussion in The Economist’s latest special report on banking (Special Report on International Banking).  Bitcoin was featured in one of the articles, but this time the focus is not on bitcoin the currency per se, but the blockchain, Bitcoin’s underlying protocol that enables distributed ledger management using cryptography and powerful computers spread across world’s data centers.

Bitcoin

The blockchain, since its invention by Satoshi Nakamoto (the obscure inventor behind Bitcoin and its protocol), has taken the world of fintech by storm.  The blockchain is being touted as the next big thing, not unlike the Internet and its underlying protocols for communication, that has the potential to revolutionize everything from money transfer to real estate transactions and the internet of things. Blockchain, as a concept, is being bastardized to serve multiple applications, including communication, agreements, asset transfers, record tracking etc.   Numerous startups are cropping up that provide value added services on top of the original Bitcoin blockchain, such as CoinSpark, an Israeli startup that has devised a technology to add information and metadata to the blockchain, one application of which is being able to provide “notary” services to agreements and documents recorded on the blockchain.  There are other outfits, however, that are fundamentally trying to re-architect the original blockchain to make it better or to make it work for specific purposes.

Colored Coins, for instance, enables the storage and transaction of “smart property” on top of the blockchain. Smart property is property whose ownership is controlled via the blockchain using “smart contracts,” which are contracts enforced by computer algorithms that can automatically execute the stipulations of an agreement once predetermined conditions are activated. Examples of smart property could include stocks, bonds, houses, cars, boats, and commodities. By harnessing blockchain technology as both a ledger and trading instrument, the Colored Coins protocol functions as a distributed asset management platform, facilitating issuance across different asset categories by individuals as well as businesses. This could have a significant impact on the global economy as the technology permits property ownership to be transferred in a safe, quick, and transparent manner without an intermediary. Visionaries see many other exciting opportunities too, including linking telecommunications with blockchain technology. This could, for example, provide car-leasing companies the ability to automatically deactivate the digital keys needed to operate a leased vehicle if a loan payment is missed.

Ethereum is another outfit that has created technology to develop blockchain based smart contracts.  Ethereum—an open-source development project that provides a platform for developers to create and publish next-generation distributed applications—uses blockchain technology to facilitate the trading of binding smart contracts that can act as a substitute to conventional business documents. The technology allows the contracts to be traced and used to confirm business deals without the need to turn to the legal system.  Then there are outfits such as Ripple Labs that are devising their own blockchain like protocols to facilitate quick and secure money transfer.

Other blockchain innovation involves combining blockchain technology with conventional technologies.  IBM and Samsung are developing a blockchain-powered backbone for Internet of Things products called ADEPT that combines three protocols – BitTorrent (file sharing), Ethereum (smart contracts) and TeleHash (peer-to-peer messaging).  ADEPT is a blockchain powered secure communication and transaction protocol for devices.  When a washing machine, for example, is bought by a consumer, ADEPT will allow the washing machine to be automatically registered in the home-based network of things, not just sending out to/receiving messages from other registered devices, but also automatically initiating and fulfilling transactions on its own for say replenishing the washing powder by placing an order with the local grocery store.

These innovations are at the leading edge of the blockchain technology and it will be several years before their use will be become widespread, if at all.  In the meantime, more mundane application of blockchain has great potential to flourish. Future fintech entrepreneurs should not discount considering the blockchain as the grounds of their creative pursuits.  All that is needed is a “killer app” that will niftily apply the concept to solve a present-day problem.  Just as Marc Andressen’s NetScape Navigator unleashed a wave of innovation in the history of the Internet, so too will a blockchain startup in the world of distributed ledgers and asset registers.

The DevOps Movement

The DevOps movement has been resurgent in the past few years as companies look to improve their delivery capabilities to meet rapidly shifting market needs and business priorities.  Many have been preaching how companies should actually become not just robust and agile, but in fact “anti fragile” with the ability to expect failures and adapt to them.  The likes of Google, Amazon and Netflix embody this agile and anti-fragile philosophy, and traditional business houses facing increasingly uncertain and competitive markets want to borrow a chapter from their books and become agile and anti-fragile as well, and DevOps is high on their list as a means to achieve that.

blue devops_4

DevOps is a loose constellation of philosophies, approaches, work practices, technologies and tactics to enable anti fragility in the development and delivery of software and business systems.  In the DevOps world, traditional software development and delivery with its craft and cottage industry approaches is turned on its head.  Software development is fraught with inherent risks and challenges, which DevOps confronts and embraces.  The concept seems exciting, a lot of companies are talking about it, some claim to do it, but nobody really understands how to do it!

The much available literature on DevOps talks about everything being continuous in the DevOps world: Continuous Integration, Continuous Delivery and Continuous Feedback.  Not only does this literature fail to address  how the concept translates into reality, but also it takes a overly simplistic view of the change involved: use Chef to automate your deployment, or use Jenkins continuous integration server to do “continuous integration”. To be fair, the concept of DevOps is still evolving.  However, much can be done to educate the common folk on the conceptual underpinnings of DevOps before jumping to the more mundane and mechanistic aspects.

DevOps is much more of a methodology, process and cultural change than anything else. The concept borrows heavily from existing manufacturing methodologies and practices such as Lean and Kanban and extends existing thinking around lean software development to the enterprise.  Whereas the traditional software development approach is based on a “push” model, DevOps focuses on building a continuous delivery pipeline in which things are “pulled” actively by different teams as required to keep the pipeline going at all times.   It takes the agile development and delivery methodologies such as Scrum and XP and extends them into operations so as to enable not just agile development, but agile delivery as well.  And it attempts to address the frequently cantankerous relationship between those traditionally separated groups of development and operations into a synergistic mutually supportive one. Even within the development sphere, DevOps aims to bring various players including development, testing & QA, and build management together by encouraging teams to take on responsibilities beyond their immediate role (e.g., development taking on more of testing) and empowering traditionally relegated roles to positions of influence (e.g., build manager taking developers to task for fixing broken builds).

We are still in early days with the DevOps movement, and until we witness real life references and case studies of how DevOps has been implemented end-to-end, learning about DevOps will be a bit of an academic exercise.  Having said that, some literature does come close to actually articulating what it means to put in practice such concepts as Continuous Delivery and Continuous Integration.  To the curious, I would recommend the Martin Fowler Signature Series of books on the two topics.  Although agonizingly technical, the two books do a good job of getting down to the brasstacks. My future posts on DevOps will be an attempt to synthesize some of the teachings from those books into management summaries.

BIG DATA TECHNOLOGY SERIES – PART 7

As we sBig dataaw in the second installment of the Big Data Series (Big data Technology Series – Part 2), the database management system market continues to evolve with falling cost of hardware, rising need to process distributed massive data sets, and emergence of cloud-based service models.  It used to be that relational database management systems were the be all and end all of database management architectures.  The limitations of the relational model in handling internet-scale data and computing requirements gave rise to NoSQL and other non-relational database management systems which are now being used to handle specialized cases where the relational model fails.  Database management architectures have thus evolved from a “one size fits all” state to one with an “assorted mix” of tools and techniques that are best of breed and fit for the purpose.  Given the plethora of database management tools and technologies, how does one begin to create such “fit for purpose” architecture? What key trade-offs does a database architect need to make while selecting the tools to manage data? To refresh what we discussed in the first introductory installment of the Big Data Series, database management systems fall in the “Operational Environment” (see the graphic below). We will delve a bit deeper into this operational environment in this post.

Pic 1

When selecting an appropriate database management system in an operational distributed data environment, several dimensions come into play.  Data consistency obviously is one key dimension (and one at which the relational model excels), but in a distributed environment, other dimensions such as availability and partitioning become key.  Described below is a list of key dimensions grouped in three buckets that are critical while evaluating a database management architecture.  These dimensions need to be traded off based on specific requirements to arrive at a solution that is fit for purpose.  For example, relational databases provide good consistency and performance for OLTP like workloads, but may not be well suited to handle multi-join queries that span multiple entities and nodes (so high data processing complexity and scope).

Dimensions

In addition to the traditional RDBMS database clusters and appliances, there are now several classes of database management products available now in a database architect’s arsenal: NewSQL databases, Document Stores, and Column Stores to name a few.  How these different solutions compare and contrast can best be seen through the lens of aforementioned dimensions.  Described below are the following classes of database management systems seen through the lens of this framework: 1) NewSQL databases, 2) Key-Value Stores, 3) Document Stores, 4) Column Family Stores, and 5) Graph Databases

NewSQL

Key Value

Document Stores

Column Family

Graph

Econinformatics?

For most business executives, the term “economics” conjures images of either simplistic supply-demand graphs they may have come across in Economics 101, or theoreticians devising arcane macroeconomic models to study the impact of interest rates and money supply on national economies.  Although businesses such as banks and financial institutions have maintained armies of economists on their payrolls, the economist’s stature and standing even in such institutions has been relatively short, limited to providing advice on general market based trends and developments, as opposed to actionable recommendations directly impacting the business bottom-line.  Even the most successful business executive would be stumped when faced with the question of how exactly economics is really applied to improving their day-to-day business.  All this may now be changing, thanks to a more front and center role of economics in new age businesses that now routinely employ economists to sift through all kinds of data to fine tune their product offerings, pricing and other business strategies.  Text book economists of yore are descending down from their ivory towers and taking on a new role, a role that is increasingly being shaped by availability of new analytic tools and raw market data.

images

Economists, especially the macro kind, are a dispraised bunch, with a large part of criticism stemming from their inability to predict major economic events (economists missed anticipating the 2008 market crash).  For this and other reasons (not least the Lucas Critique), macroeconomic modeling focused on building large scale econometric models has been losing its allure for some time.  Microeconomic modeling enabled by powerful data-driven micro econometric models focused on individual entities has been transforming and expanding over the past few decades.  The ever expanding use of sophisticating micro-models on large data datasets has caused some to perceive this as paving the foundation for “real time econometrics”.  Econometrics, the interdisciplinary study of empirical economics combining economics, statistics and computer science, has continually evolved over the past several decades thanks to advances in computing and statistics, and is yet again ready for disruption – this time due to availability of massive data sets and easy-to-procure computing power to run econometric analyses.  The likes of Google, Yahoo and Facebook are already applying advanced micro econometric models to understand causal statistics surrounding advertising, displays and impressions and their impact on key business variables such as clicks and searches. Applied econometrics is but one feather in the modern economist’s cap: economists are at the forefront in the sharing economy and “market design”.

A celebrated area of economic modeling and research that has found successful application in business is “market design” and “matching theory”, pioneered by Nobel prize-winning economists Al Roth and Lloyd Shapley.  Market design and matching theory is concerned with optimizing the pairing or matching of providers and suppliers in a market place based on “fit” that is driven by dimensions that go beyond just price, for example. Al Roth successfully applied game theory based market design and matching algorithms to improving a number of market places, including placement of New York City’s high school students, the matching of medical students with residency schools, and kidney donation programs.  The fundamentals of matching theory are being widely applied by economists today: many modern day online markets and sharing platforms such as eBay, Lyft etc. are in the business of matching suppliers/providers and consumers, and economists employed by these outfits have successfully applied those fundamentals to improving their businesses, increasingly with the aid of multi-dimensional data that is available in real time. Other market places, including LinkedIn (workers and employers) and Accretive Heath (doctors and patients) have applied similar learnings to improve their matching quality and effectiveness.  Airbnb economists analyzed data to try to figure out why certain hosts were more successful than others in sharing their space with guests, and successfully applied their learnings to help struggling hosts and also to better balance supply and demand in many of Airbnb markets (their analysis pointed out that successful hosts shared high quality pictures of their homes, which led Airbnb to offer a complementary photography service to its hosts).

Beyond market design, economics research is changing in a number of areas thanks to availability of large data sets and analytic tools in various ways as Liran Einav and Jonathan Levin of Stanford University outline in “The Data Revolution and Economic Analysis“.  One such area is measurement of the state of the economy and economic activity and generation of economic statistics to inform policy making. The issue with macroeconomic measurement is that the raw data produced by the official statistical agencies comes with a lag and is subject to revision.  Gross domestic product (GDP), for example, is a quarterly series that is published with a two-month lag and revised over the next four years.  Contrast this with the ability to collect real time economic data, as is being done by the Billion Prices Project, which collects vast amounts of retail transaction data in near real time to develop a retail price inflation index.  What’s more, new data sets may allow economists to shine the light on places of economic activity that have been dark heretofore.  Small businesses’ contribution to the national economic output, for example, is routinely underestimated because of exclusion of certain businesses. Companies such as Intuit, which does business with many small outfits, now have payroll transactional data that can be potentially analyzed to gauge the economic contribution of such small businesses.  Moody’s Analytics has partnered with ADP, the payroll software and services vendor, to enhance official private sector employment statistics based on ADP’s payroll data.

Conservatives and the old guard may downplay the role of data in applied economics, reveling in their grand macroeconomic models and theories.  To be fair, empirical modeling would be lost without theory.  However, data’s “invisible hand” in shaping today’s online markets and business models is perceptible, if not openly visible.  Economists of all stripes will be well advised to pay attention to the increasing role of data in their field.  Next time you see an economist, ask them to go take a course on Machine Learning in the Computer Science department, to pass on Google Chief Economist Hal Varian’s counsel – it will be time worth spent.

The Promise of Geospatial and Satellite Data

Stories of traders vying to compete and finding new ways to beat the market and eke out profits are nothing new.  But what has been interesting of late is the lengths and levels to which trading houses are now able to take the competition to, thanks to real-time data analytics supporting trading indicators and signals, purveyors of which include companies like Genscape and Orbital Insight.  Orbital Insight, a startup specializing in analytics and real-time intelligence solutions, was featured in a recent Wall Street Journal writeup (Startups Mine Market-Moving Data From Fields, Parking Lots, WSJ, Nov 20, 2014).  Genscape, a more established player, is another outfit that  employs sophisticated surveillance and data-crunching technology to supply traders with nonpublic information about topics including oil supplies, electric-power production, retail traffic, and crop yield. Genscape and Orbital are but a couple of players in a broad developing market of “situational intelligence” solutions that provide the infrastructure and the intelligence for rapid real-time data driven decision-making.  These two companies, however, are particularly interesting because they provide a view into the promise of geospatial and satellite imagery data and how it can be exploited to disrupt traditional operational and tactical decision-making processes.

savi-globalstar-textured-earth-small

Geospatial data is simply data about things and related events indexed in three-dimensional geographic space on earth  (with temporal data being collected too for events taking place across time).  Geospatial data sources include those of two types: GPS data that is gathered through satellites and ground-based navigation systems, and remote sensing data that involves specialized devices to collect data and transmit it in a digital form (sensors, radars and drones fall in this type).  Geospatial data is of interest to private corporations and public entities alike.  When triangulated with traditional data sources, personal data, and social media feeds, it can provide valuable insight into real-time sales and logistics activities, enabling real-time optimization.  On the public side, geospatial data can provide valuable information on detecting and tracking epidemics, migration of refugees in a conflict zone, or intelligence of geopolitical significance.  These are but a handful of use cases that can be made possible through the use of such data.

Once the preserve of secretive governments and intelligence agencies worldwide, geospatial and satellite imagery data is slowly but surely entering commercial and public domains, spawning an entire industry comprising outfits that build and manage the satellite and sensor infrastructure, to manufacturers and suppliers of parts and components that make up the satellites, and not least entities such as Orbital Insight that add value to the raw data by providing real-time actionable information to businesses.   Orbital Insight, for example, leverages sophisticated machine learning algorithms and analysis against huge volumes of satellite imagery made available by DigitalGlobe’s Geospatial Big Data™ platform, allowing for accurate, verifiable information to be extracted.  Outfits such as DigitalGlobe, Planet Labs, and Blackbridge Geomatics are examples of companies that are making investments to launch and manage the satellite and sensor infrastructure to collect detailed real-time geospatial data.  Google, not to be left behind in the space race, jumped into the market with its acquisition of SkyBox Imaging earlier this year.  SkyBox intends to a build a constellation of twenty-four satellites that will collect anything and everything across the globe.  What’s more, Skybox, unlike other companies such as DigitalGlobe, intends to make available all the data it will collect through its satellite constellation for public and commercial use.  But even companies such as SkyBox are not blazing the trail in the satellite business – there are numerous other start-ups that are vying to put into orbit low-cost and disposable nano satellites that will be much more smaller and cheaper to launch and manage.  These developments are only going to create and open up an even wider range of applications for private and public use than has been possible heretofore.

These are still very early days for commercial application of geospatial and satellite imagery data, and exciting developments are still ahead of us.  For one, the number and kinds of data sources that such applications may possibly need to be able to handle in the future will be exponentially higher: imagine a fleet of satellites, aerial drones, quadcopters and ground-based sensors all providing various kinds of data that could potentially be collated and flanged together.  So too will new algorithms and ways of storing and manipulating streaming data at mind-boggling scales, all of which may require a level of thinking beyond what we may currently have.

 

What is a “Container” Anyway?

After reading a recent announcement by Google about enhancing Google App Engine to provide support for Docker, a popular containerization technology, I started wondering exactly what containers were and why they were taking the world of computing by storm.  The hallowed Container is being touted as the next king of the virtualization world, getting ready to displace the mighty virtual machine (VM).

downloadContainers are basically light-weight VMs that provide VM like functionality with some constraints and conditions, but do so in a manner that is more efficient than a VM.  A traditional VM is supported by a hypervisor, which is basically a software layer that abstracts the full underlying physical hardware, a key reason why the traditional VM infrastructure is resource intensive.  The result is that it is possible to run only so many VMs at a given time on a hardware platform with finite resources (such as CPU and memory).  Where containers prove to be clever is that they are supported by a container engine that abstracts only bits and pieces of the underlying platform, relying on the operating system for other required functionality.  This is a double-edge sword however: the simplicity and light weight nature of containers does indeed mean that many more containers (and therefore application workloads) can be run on a piece of hardware than can traditional VMs, however, since there is tight coupling between the container and the native operating system, all containers running on a given hardware are forced to share the same operating system, with the result that containers do not allow operating system multiplicity i.e. it is not possible to run application workloads on different operating systems on the same physical platform.

Containers in computing are not unlike the containers in the physical shipping and transportation world.  Before containers arrived on the scene, transportation and trade of goods was handled manually, where  longshormen handled break bulk cargo in and out of the ship’s hold, playing a game of maritime Tetris.  The manual nature of loading and packing severely limited efficiencies and speed of transportation and trade.  This situation of yore is akin to developing, deploying and managing applications in a purely physical environment, where application code developed had to be painstakingly tested for functionality and performance on a raft of major platforms and their operating system variants.  To be clear, trade and transportation was not all manual before standard shipping containers came on the scene: containerization had begun in bits and pieces as early as the 1800s.  This containerization was however heavily fragmented, with each trading company devising its own practices, standards, and supporting infrastructure for loading, unloading and securing the containers.  This containerization enabled them to achieve efficiencies for the specific types of cargo and the specific modes (rail, road, or sea) they dealt with.  This efficiency however came at a cost: the cost of having to maintain and manage their own infrastructure, practices and standards for loading/unloading, packing/unpacking, and transporting.  This is not unlike the situation we have today with VMs where although there is flexibility of running workloads in different operating systems, there is the overhead of managing all the supporting infrastructure to enable that flexibility.

With the emergence of containers (the  computing kind), we are moving into the realm of standardization, as the container of the physical world did when Malcolm McLean, the American trucking magnate, devised his revolutionary intermodal shipping container, which standardized the shipping container industry on a common set of standards, practices and intermodal transfer and handling of goods.  Companies such as Docker are trying to do what Malcolm McLean did: standardize and streamline the way workloads can be encapsulated, ported and run on a set of commonly used operating systems.  Malcolm McLean’s intermodal container transformed the shipping industry.  It is yet to be seen if the likes of Docker can replicate success to that scale in the computing world.

Data Platforms Set Off a Cambrian Explosion

The Jan. 18, 2014, edition of The Economist featured “A Cambrian Moment,” a special report on tech start-ups. The report discusses how the digital marketplace is experiencing a Cambrian explosion of products and services brought to market by an ever-increasing plethora of new start-ups. Precipitating this explosion is the modern “digital platform” – an assembly of basic building blocks of open-source software, cloud computing, and social networks that is revolutionizing the IT industry. This digital platform has enabled technology start-ups to rapidly design and bring to market a raft of new products and services.

Cambrian_Explosion

A similar evolution is taking form in the world of data. Anabelle Gawker, a researcher at Imperial College Business School, has argued that “platforms” are a common feature of highly evolved complex systems, whether economic or biological. The emergence of such platforms is the ultimate result of evolving exogenous conditions that force a recombination and rearrangement of the building blocks of such systems. In the data world, this rearrangement is taking place thanks to the falling cost of information processing, the standardization of data formats, and maturation of large-scale connectivity protocols. The emergence of such data platforms has significant implications for a range of industries, not the least of which are data-intensive industries like healthcare and financial services. Like the digital platform, the data platform will give rise to an explosion of data-enabled services and products.

Early evidence already can be seen in the pharmaceutical and drug manufacturing industries. Drug manufacturers need thousands of patients with certain disease states for late-stage clinical trials of drugs under development. This patient recruitment process can take years. The Wall Street Journal reported that to speed up the recruitment process, drug manufacturers have turned to entities such as Blue Chip Marketing Worldwide, a drug-industry contractor that provides data-enabled solutions based on consumer data sets obtained from Experian, a data broker and provider of consumer data. Blue Chip uses sophisticated big data analyses and algorithms to identify individuals with potential disease states. Blue Chip’s services have already enabled a drug manufacturer to cut patient recruitment time from years to months.

Trading is another industry where early indications of such movements can be seen. Also reported in The Wall Street Journal, in their never-ending quest to gather market-moving information quicker, traders have turned to outfits such as Genscape, a player in the growing industry that employs sophisticated surveillance and data-crunching technology to supply traders with nonpublic information about topics including oil supplies, electric-power production, retail traffic, and crop yield. Founded by two former power traders, Genscape crunches vast amounts of sensor data, video camera feeds, and satellite imagery to draw patterns and make predictions on potential movements in supply and demand of commodities such as oil and electricity.

We are in the early days of this evolutionary process. The evolution of the data platform will most likely mirror the evolution of the digital platform, although it is expected to proceed at a faster pace. As competing technologies and solutions evolve and mature, we will see consolidation and emergence of just a few data platforms that will benefit from the tremendous horizontal economies of scale. Information industry incumbents such as Google will be in the pole position to be leaders in the data platform space. For example, Google already has a massive base of web and social interaction data thanks to Google Search, Android, and Google Maps, and it is making aggressive moves to expand into the “Internet of things,” the next frontier of big data.

Concurrently, we will witness a broad emergence of a long tail of small vendors of industry-oriented data products and services. Blue Chip and Genscape are first movers in a nascent market. As the economics of harvesting and using data become more attractive, an increasing number of industry players will want to leverage third-party data services, which in turn will create opportunities for data entrepreneurs. The Cambrian explosion will then be complete and the start-up garden of the data world will be in full bloom.

 

Google’s Lessons Applied to The Consulting Profession

I recently read a review of a newly released book “How Google Works” by Eric Schmidt and Jonathan Rosenberg (Don’t Be Modest).  This is an interesting book by the architects of Google on what makes Google so successful, and how can other companies emulate its success factors and experience the growth rate it has over the past fifteen years.  The authors describe three critical factors that are central to Google’s success: 1) Thinking Big 2) Failing Fast and 3) Leveraging  Data.  As I was reading the review of the authors’ treatment of the subject, I could not help but think the applicability of these success factors to the management consulting profession.  When I come across successful management consulting professionals in my day to day professional life, I see them possessing all these success factors.

First and forhow-google-works-2014-10-08emost, “Thinking Big”, or in the parlance of Silicon Valley, the “moonshot”. Google’s management is not happy with employees striving for incremental improvements; they want employees to think up ideas that will deliver 10x gains.  And so it is with management consulting professionals, or at least the good ones.  Consultants are paid big bucks not to provide advice that merely enhances existing strategies, but to think up big ideas that change the nature of the game itself.  Truly successful consultants however do not stop there: they are actually able to convince their clients to launch such game-changing strategies, and subsequently also help them execute those strategies.  “Thinking Big” is applicable to the more mundane and normal as well.  If the client has asked for a cost reduction strategy that delivers 10% cost improvement, there is no reason not to strive to overachieve and over-deliver for a 15% reduction if possible.  And of course, “Thinking Big” in a consulting context means having the big picture.  Many a time consultants get too frequently caught up in the weeds and the nitty-gritty detail of the data analyses, and fail to extend their findings to truly altering the big picture. Thinking Big sets a high bar to over-achieve and over-deliver, and what can be better than that for someone in a services industry such as consulting.

Second, “Failing Fast”, which refers to the cultural aspect encouraging people to rapidly experiment to test the viability of their ideas, adapt and evolve the ideas based on the results, or move on if those ideas are not workable.  “Iteration is the most important part of the strategy”, as the authors of the book advise.  Google is well known for its ability to push out hundreds of incremental product updates every day as part of this iterative strategy.  This agile iterative approach is in Google’s DNA.  And the significance of this piece of advice can be seen if one considers how Agile based approaches have been successfully applied to managing complex projects and delivering new ground-breaking products.  Iteration is at the heart of Agile. An agile iterative approach to problem solving is an absolutely necessary tool in any good consultant’s toolbox.  This is particularly important in a strategy and advisory setting, where often information at hand is incomplete and the problem to be solved is ambiguous. Even the sharpest consultants will fail if they cannot rapidly adapt their approach as conditions change and new information emerges during the course of the project.

Finally, “Leveraging Data”, which refers to the primacy of data over a HIPPO, group think or collective experience, intuition or gut.  Google conducts hundreds of data-driven experiments to gauge how a particular product feature or enhancement would be received by the market place.  Such data-driven decision-making is at the core of a number of successful companies in other industries.  This exactly is the essence of the fact-based hypothesis-driven approach that all good management consultants swear by.  All good scientific approaches to solving management problems start with defining hypotheses, and then subsequently collecting and analyzing data to test those hypotheses.  Good consultants may start with their favorite hypotheses, but will be the first ones to throw them out the window if the data show those hypotheses to be untenable, no matter how much effort has been spent in defining the hypothesis and no matter how near and dear the hypothesis is to them.  Data, no matter how insignificant a role it plays, lends a certain credibility to strategic analysis that plain intuition and “experience” cannot.  Distinctive consulting skill comes from the ability to frame up the data analysis in a way that is effective and efficient, as well as the ability to interpret the results in a way that is convincing and yet simple.

Providing the foundation for these critical factors is Google’s culture of employee empowerment and enfranchisement.  Google hires the best and the brightest, but goes to great extent to ensure that employees are happy and engaged.  And culture perhaps is the greatest factor that distinguishes great consulting firms from the ones that are mediocre.  A culture that encourages healthy debate and intellectual challenge to authority, no matter how low down the totem pole you are, and even one that goes beyond to make it a mandatory requirement, an “obligation to dissent”.  A culture that is fair and ruthlessly meritocratic, yet one that is fun.  But most importantly, a culture that patiently cares about the development of the company’s most important asset by way of providing strong mentorship, support and professional guidance at the right place and at the right time.

Big Data Technology Series – Part 6

Big dataIn the last few installments of the Big Data Technology Series, we looked at the evolution of database management, business intelligence and analytics systems, and statistical processing software. In this installment, we will look at the modern advanced analytical platform for big data analytics that represents a confluence of the three evolutionary threads in the story of data management and analytics platforms.  We will look at the core capabilities of such platforms, and major vendors in the marketplace.

The graphic below provides a view of the core capabilities of a modern advanced analytical platform.  There is a wide range of analytical platforms in the market place, with each platform excelling in and specializing in a specific aspect of big data analytics capabilities. This picture provides a logical universe of the capabilities found in these platforms.  In other words, there isn’t one single platform that provides all the core capabilities described below in their entirety.

Big Data Analytics Platform - Capabilities

  1. Hardware –  Hardware includes data processing and storage components of the analytical platform stack, providing management and redundancy of data storage. As we saw in the Part 2 of the big data technology series, the database management platform and associated hardware have continued to evolve ever since the first databases appeared on the market in the 1950s.  Database hardware was once a proprietary component of the stack providing considerable value, however it is increasingly becoming a commodity.  Hardware innovation includes innovations in storage such as solid state devices and massively parallel node configurations connected by high speed networks.  Modern analytic platforms provide flexibility to use configurations such as these, as well as  configurations of commodity x-86 machines for managing lakes of massive unstructured raw datasets.
  2. Analytic Database – This is the software layer that provides the logic behind managing the storage of datasets across the node cluster, managing such aspects as partitioning, replication, and optimal storage schemes (such as row or column).  Analytical applications run most efficiently with certain storage and partitioning schemes (such as columnar data storage), and modern analytical platforms provide capabilities to configure and setup these data storage schemes.  Memory based analytic databases such as SAP HANA have added one more dimension to this – one that dictates how and when data should be processed in-memory and when it should be written to disk.  Advances in database management systems have enable modern analytic platform to have its disposal a range of tools and techniques to manage all data types (structured, semi-structured or unstructured) and all processing needs (data discovery, raw data processing, etc.)
  3. Execution Framework – The execution framework is a software layer that provides query processing, code generation capabilities and runtimes for code execution.  Advanced analytical applications frequently have complex query routines and a framework that can efficiently parse and process the queries is critical to the analytic platform.  Furthermore, modern analytical platforms provide capabilities to structure advanced analytical processing through the use of advanced programming languages such as Java and R.  The execution framework provides the logic to convert such higher level processing instructions into optimized query bits that are then submitted to the underlying analytical database management system.  Advances in analytical platforms, as we saw in Part 3 of big data technology series, have enabled these capabilities in the modern day analytic platform.
  4. Data Access and Adaptors – Modern analytic platforms provide prebuilt, custom developed and DIY connectors to a range of data sources such as traditional data warehouses, relational databases, Hadoop environments and streaming platforms.  Such connectors provide bi-directional data integration between these data repositories and the analytic data storage.  These connectors thus provide data visibility to the analytic platform no matter where and how the data is stored.
  5. Modeling Toolkit – The modeling toolkit provides design time functionality to develop and test code for running advanced analytics and statistical processing routine using higher level languages such as Java, Python and R.  This represents the third and final thread in our story of the evolution of big data analytic platforms – the evolution, rise and ultimately the convergence of statistical processing software into the logical big data analytic platform.  The toolkit provides a range of pre-built and independent third party libraries of routines for statistical processing, but also a framework that can be used and extended as needed to run custom statistical processing algorithms.
  6. Administration – Like any database management or traditional warehousing platform, the modern analytics platform provides strong administration and control capabilities to fine tune and manage the working of the platform.  The rise of horizontal scaling using commodity machines has put increased importance on being able to efficiently administer and manage large clusters of such data processing machines.  Modern analytic platforms provide intuitive capabilities to finely control data partitioning schemes, clustering methods, backup and restore, etc.

There are a range of players in the market for big data analytics platforms as depicted by the graphic below.

Big Data Analytics Platform - Market

There are roughly three categories of such product vendors:

  1. Type 1 (Traditional Data warehousing Vendors) – This category includes vendors such as IBM, SAP and Oracle that have traditionally done very well in the BI/data warehousing space.  These solutions have excelled in providing traditional analytic capabilities for mostly structured datasets. These vendors are rapidly extending their product capabilities to provide advanced analytical capabilities for big data sets, either indigenously or through acquisitions and joint ventures with niche vendors specializing in advanced big data analytics.
  2. Type 2 (SQL on Hadoop) – This category includes vendors that are providing solutions to extend traditional Hadoop environments to deliver big data analytics in a real time ad hoc manner using SQL.  Traditional Hadoop is well suited for large scale batch analytics; however the MapReduce architecture is not easily extensible to real time ad hoc analytics.  Some products in this space do away with the MapReduce architecture completely in order to overcome these limitations.
  3. Type 3 (Independent Players) – This category includes vendors that have come up with proprietary schemes and architectures to provide real time and ad hoc  analytic platforms.  Some such as 1010 data and Infobright have existed for some time, while others such as Google are newcomers that are providing new ways to deliver analytic capabilities (e.g. Google offers a web based service for running advanced analytics).

Below is a detailed description of the offerings from some of the major vendors in these three categories.

Type 1 (Traditional Data warehousing Vendors) – Traditional vendors of enterprise data warehousing platforms and data warehousing appliances that have acquired and/or developed capabilities and solutions for large scale data ware housing and data analytics.

Teradata

  • Teradata’s Aster database is a hybrid row and column data store that forms the foundation of its next generation data discovery and data analytic capability; the data management platform can be delivered as a service, on commodity hardware or an appliance form factor
  • Teradata Enterprise Data Warehouse is its data warehousing solution; the EDW is marketed as the platform for dimensional analysis of structured data and standard data warehousing functions as part of its Unified Data Architecture, Teradata’s vision for an integrated platform for big data management
  • Teradata delivers the HortonWorks Hadoop distribution as part of its Unified Data Architecture vision; Aster database supports native MapReduce based processing for bi-directional integration with the Hadoop environment; SQl-H provides SQL based interface for higher level analysis of Hadoop based data
  • The Aster data discovery platform provides capabilities for advanced statistical and data mining through pre-packaged function libraries, a development environment for custom analytic functions and an execution environment that can execute such analytic functions as part of standard SQL
  • Currently Teradata does not have known solution for event stream processing (it has announced it may enter into partnership with independent vendors of event stream processors)

Pivotal

  • Pivotal is an independent big data entity spun off from EMC after its acquisition of VmWare and Greenplum; Pivotal’s data analytics platform is powered by Greenplum database, a hybrid row and column, massively parallel data processing platform; the data management platform can be delivered as a service, on commodity hardware or an appliance form factor
  • Pivotal also offers an in-memory data management platform, GemFire, and a distributed SQL database platform, SQLFire ; Pivotal does not currently have a known solution for regular data warehousing
  • Greenplum Hadoop Distribution is a Greenplum supported version of Apache Hadoop; Greenplum database supports native MapReduce based processing for bi-directional integration with the Hadoop environment; Greenplum HAWQ provides SQL based interface for higher level analysis of Hadoop based data
  • Through partnerships with analytics vendors such as SAS and Alpine Data Labs, Greenplum platform provides capabilities for advanced statistical and data mining through pre-packaged function libraries, a development environment for custom analytic functions and an execution environment that can execute such analytic functions as part of standard SQL
  • Currently Pivotal does not have known solution for event stream processing

IBM

  • IBM’s big data management platform is powered by Netezza, a massively parallel data storage and distributed data processing appliance; Netezza enables data warehousing and fast analysis of mostly structured large scale data
  • IBM’s PureData System for Analytics provides the foundation for big data analytics; IBM PureData System for Analytics is a data warehouse appliance; IBM Netezza Analytics is an advanced analytics framework incorporating a software development kit for analytic model development, third party analytic libraries, and integrations with analytic solutions such as SAS, SPSS, etc. in support for in-database analytics
  • IBM PureData System for Operational Analytics focuses on analytics for operational workloads (as opposed to regular data ware housing workloads that are more long term strategic in nature)
  • IBM Big Data Platform Accelerators provide analytic solution accelerators i.e. pre-built examples and toolkits for video analytics, sentiment analytics that enable users to jumpstart their analytic development efforts
  • IBM provides a licensed and supported version of Apache Hadoop distribution as part of its InfoSphere BigInsights platform; BigInsights provides Jaql, a query and scripting language for unstructured data in Hadoop
  • IBM does not currently have a known solution for in-memory data management (like SAP HANA or Pivotal GemFire)
  • IBM provides InfoSphere Streams for data stream computing in big data environments

Oracle

  • Oracle’s big data management platform is supported by the Oracle database that provides columnar compression and distributed database management for analytic functions
  • Oracle offers a range of appliances for big data ware housing and big data analysis; Oracle Exadata offers an appliance for data ware housing based on Oracle database and Sun hardware
  • The Oracle Big Data Appliance is packaged software and hardware platform for managing unstructured data processing; it provides a NoSQL database, Cloudera Hadoop platform and associated management utilities, and connectors that enable integration of the data warehousing environment with the Hadoop environment
  • Advanced analytics are provided by Oracle R Enterprise, which provides database execution environment for R programs, and Oracle Data Mining, which provides data mining functions callable from SQL and executable within the Oracle data appliance
  • Oracle Exalytics also provides an in-memory database appliance for analytical application similar to SAP HANA
  • Oracle Event Processing and Oracle Exalogic provide capabilities for event stream processing

Type 2 (SQL on Hadoop) – Independent (i.e. not traditional data warehousing vendors) solution providers that provide big data warehousing and analytics platforms and products that are architected using a proprietary design, delivered as software solution, managed service or cloud offering (although some are offering appliances as well), and focusing on a specific market niche.

Hadapt

  • Hadapt enables an analytical framework for structured and unstructured data on top of Hadoop by providing SQL based abstraction for HDFS, Mahout, and other Hadoop technologies
  • Hadapt also integrated with third party analytic libraries and provides a development kit to enable development of custom analytic functions
  • Hadapt encourages deployment on configurations of commodity hardware (as opposed to proprietary appliances and platforms encouraged by Type 1 appliance vendors)

CitusData

  • An analytic database based on PostgreSQL database that offers SQL querying capabilities
  • Also offers SQL querying capabilities for data in Hadoop clusters
  • Offers a software solution that can run on commodity hardware

Other Type 2

  • A number of vendors are providing tools that enable SQL processing on top of Hadoop so as to enable higher level analytics and processing by business analysts (who may not have the ability or time to code complex MapReduce functions)
  • Hive is a data warehousing solution for Hadoop based data that provides a SQL like language
  • Greenplum HAWQ, Aster Data SQL-H and Cloudera Impala all aim to achieve higher performance of standard SQL on Hadoop by trying to rectify shortcomings and limitations of Hadoop MR and Hive

Type 3 (Independent Players) – Independent (i.e. not traditional data warehousing vendors) solution providers that provide big data analytics platforms and products that are architected using a proprietary design for big data analysis (i.e. non-Hadoop), delivered as software solution on commodity hardware configuration, managed service or cloud offering; also includes niche players

1010data

  • Proprietary database is a columnar, massively parallel data management system with an advanced, dynamic in-memory capability for mostly structured data analytics
  • Delivers the solutions in a cloud and in a hosted environment
  • Capabilities to perform granular statistical and predictive analytic routines that can be extended 1010data’s proprietary language and interface
  • Started in the financial services space, and is now expanding to manufacturing and retail

paraccel

  • Software solution for columnar, compressed, massively parallel relational data management that is capable of all-in-memory processing (provides connectors to major traditional data ware housing platforms, operational systems, and Hadoop)
  • Supports on-premise and cloud based deployment; on-premise deployment is supported on select commodity hardware configurations
  • Provides advanced in-database analytic solutions and libraries for a range of common and industry specific use cases through partnership with Numerix and Fuzzy Logix (vendors of analytic solutions)

Infobright

  • Offers a columnar, highly compressed data management solution (integrate s with Hadoop)
  • Niche focus on analytics for machine generated data
  • Delivered as a software solution and as an appliance

LexisNexis

  • Provides Roxie, an analytic database and data warehousing solution, and a development and execution environment based on a proprietary querying language ECL
  • Provides pre built analytics products and solutions for government, financial services, insurance as well as third party analytic packages
  • Software solution delivered on certified hardware configurations (managed service and cloud offerings are on the way)
  • Focused on providing analytics related to fraud and other risk management applications

Google Cloud Platform

  • As part of its cloud computing platform, Google has released BigQuery, a real time analytics service for big data that is based on Dremel, which is a scalable, interactive ad-hoc query system for analysis of large datasets
  • Other projects modeled after Dremel include Drill, an open source Apache project led by MapR for interactive ad hoc querying and analysis of big data sets as part of its Hadoop distribution

CitusData

  • An analytic database based on PostgreSQL database that offers SQL querying capabilities
  • Also offers SQL querying capabilities for data in Hadoop clusters
  • Offers a software solution that can run on commodity hardware