Token Economics 101

Background

Much has been discussed about how modern internet based digital platforms such as Google, Facebook, and EBay to name a few, while they have created market value and consumer choice, are prone to rent seeking due to monopolistic winner-take-all effects (There is No Single Solution To Making Internet More Decentralized). Not just responsible for causing economic distortions, platforms are prime targets for cyber attackers, as recent incidents such as Equifax and Sony security breaches have demonstrated. Further, although such platforms have created consumer surplus, they have also led to consumer deficits of their own, by taking away privacy and control of one’s data. This view of the platforms is a part of a larger story of internet’s balkanization: emergence of independent domains thanks to platforms that do not interoperate, proprietary networks resulting from regulations that target net neutrality and geographic islands due to governments restricting free flow of information. Platform companies, which have built their platforms atop the internet, get blamed the most for this failure. The real failure, however, is that of the market that failed to create effective structures to govern and to incentivize appropriate development of the internet and the applications sitting on top of it.  This malaise is not just limited to Internet based markets, but extends to all existing intermediated centralized market systems of today. Rather serendipitously, the world may have discovered an alternative way to better manage the creation and development of efficient market systems: cryptocurrency or token economies. By aligning the incentives and needs of suppliers, consumers and service providers, token economies create value for all participants in an efficient manner. Indeed, it is not just Internet based solutions, but any market where goods and services are exchanged that token economies can bring efficiencies.

“Token Economy” As In Applied Psychology?

Token markets are those that whose workings are based on a cryptocurrency or token, such as Bitcoin. On one hand, such markets are guided by an explicitly defined crypto-economic system or token economy, but on the other hand, they also evolve dynamically based upon participants’ interactions. Originating in applied psychology, the term “token economy” is a system of incentives that reinforce and build desirable behaviors; essentially the token economy implements the theory of incentives used to explain the origins of motivation. Token economies are a useful tool in behavioral economics, where it has been applied both in lab and in the real world to study human decision making within economic contexts. In crypto markets, a token economy is implemented using digital assets as tokens and a framework of rules as the market protocol implemented in code using cryptography. Token economics in this sense refers to the study, design, and implementation of economic systems based on cryptocurrencies. Supporting the token economies of crypto-currency markets is distributed ledger technology (DLT), such as Blockchain. Tokenization is the process of converting some asset into a token unit that is recorded on the DLT: thus anything of economic value, such as real estate, commodities, currency markets, etc. can be tokenized, giving rise to a variety of different token economies.

What is Different?

By enabling costless verification and reducing the cost of networking, token economies streamline operational functioning of markets and open them up for broad innovation (Some Simple Economics of the Blockchain). Markets are formed when buyers and sellers come to together to exchange goods and services. Effective functioning of the markets depends upon effective verification of market transactions between the sellers and buyers, a function that market intermediaries take on as market complexity increases. The intermediary structure, however, brings its own set of issues: they can misuse information disclosed to them by buyers/sellers, conflicts of interest, agency problems, moral hazard may still persist, they can misuse their market power (as has happened with some tech giants recently) and their presence increases the overall cost structure of the market. Further, hurdles to developing trust between parties limits the extent to which business relationships develop thus limiting innovation: it is costly and time-consuming to develop and enforce contracts to guide the exchange of property rights. Through the use of DLT, token economies provide costless verification (and thus mitigate the need for intermediaries), and reduce cost of networking (and thus provides a way to efficiently exchange property rights). But what is fundamentally different about token economies is that they create a framework in which participants mutualize their interests and have strong incentives to continually improve the functioning of the economy.

Flywheels, Loops, and Knock On Effects

A key part of the token economy design is “mechanism design”, which is the design of system of incentives to encourage fruitful development of the token economy infrastructure as well as the overlying applications and services. Mechanism design addresses how tokens are used for payments to participants to manage the DLT network, service usage, profit sharing, governance and so on. An optimally designed token economy creates the appropriate mix of incentives to ensure a high performing market infrastructure, valuable services, and smooth running of the platform. As overlay protocols, applications and APIs are built on top of the base DLT (The Blockchain Application Stack), mechanism design ensures value is distributed equitably not just within a layer but across the layers. An appropriately designed token system unleashes powerful feedback loops that perpetuate desirable behaviors and actions in the marketplace – indeed, mechanism design can either make the token economy the El Dorado of all crypto economies, or be the death knell of the token even before it has had a fighting chance. Depending upon the specific market it is trying to make, each token economy will have a unique token design and a study of how feedback loops are expected to take hold, but there are some common fundamentals to how incentives work in a token economy.

Slide1

A new token system often comes to market through an ICO (initial coin offering) which allows the founding team to distribute tokens to raise capital for funding the enterprise. A lot of initial pull for the token is dependent upon the team’s vision and the soundness of the token economy. As the market’s perceived value of the token increases, participants, customers and speculators take note and invest in the token, thus increasing its demand. Typically in token economies, since the token supply has a ceiling, increasing token demand leads to increasing value (assuming token holders hold on to the token at least for some time period), which attracts even more participants, customers and speculators. Increasing number of participants in the network strengthens the network in two ways: it makes it more decentralized and secure, and participants work to improve the network and develop services on the platform. This increases the utility of the system, which attracts more customers, and further strengthens the feedback loop.

These same mechanisms can work in reverse, leading to the token economy’s death spiral if there is a perceived loss of value in the token economy.

Slide2

A token’s perceived utility can take a hit for a variety of reasons: if it does not create enough utility in the market, or if there is a security weakness that opens it up to a cyberattack. A perceived loss in the utility can lead to selloff in the market which depreciates the value of the token. As the token value depreciates, more speculators start dumping the token. A depreciating token value also disincentivizes participants from developing the token protocol and services, which reduces the token’s utility for consumers leading to reduced demand. An increased supply of tokens in the market further reduces the token’s value, perpetuating the negative feedback loop.

Token Economies Galore

Token markets originated with digital currencies for P2P payments, the first one to the market being Bitcoin, which was then followed by a number of alternative digital currencies (so called “alt-coins”) such as zCashLiteCoin and Monero as a means to address use cases for which Bitcoin was not the best solution. Tokenization moved to computing when Ethereum was launched for decentralized computing through smart contracts, and markets such as StorjFilecoin, and Sia came into being for decentralized storage, which is a challenge in its infancy for centralized cloud providers such as Amazon. Blockstack wants to go further by providing an entire decentralized infrastructure for building decentralized applications of the future, effectively replacing the current Internet based architecture. Not even current day platforms are safe: OpenBazaar offers a decentralized e-commerce marketplace, and Lazooz is a protocol for real time ride sharing. Most of the tokens these economies operate with are “utility tokens” that are used to access a service that such economies provide e.g., using decentralized storage or buying something in a e-commerce marketplace. Attention is now turning to “security/asset-backed tokens” which represent ownership in a company through cash flows, equities, futures, etc. or hard assets such as commodities or real estate. Asset-backed tokenization is in full swing, targeting commodities, precious metals and real estate (Digix and Goldmint for precious metals, D1and Cedex for diamonds, etc.). Security token offerings, like those enabled by Polymath, will provide investors exposure to token economies.

This is Just the Beginning

Just as the Cambrian period enabled the creation of a multitude of new life forms, the emergence of token economies is opening up a wide range of previously unavailable markets as well as new ways to compete against entrenched incumbents. Sure, many of the new token economies coming in to being today will die out, but many of the ones that survive will be epic. Designing the right economic model for tokens is crucial and token economics provides the necessary groundwork and framework for devising such models. More study in token economics is required especially since it involves technically complex mechanisms and market designs that are totally new. As the industry experiments with tokens across various markets, it will learn more about what is just great economic theory versus what really works in the complexity of real-world interactions.

Understanding The Building Blocks of a Distributed Ledger System

Introduction to DLTs

Distributed Ledger technology (DLT) is being hailed as a transformative technology with comparisons being drawn to the Internet in its potential to transform and disrupt industries.  As a “platform” technology for decentralized, trust-based peer-to-peer computing, DLT helps shape new “domain” capabilities, just as computer networking enabled the Internet and creation of capabilities across communication, collaboration and commerce. Like the Internet, it will have far reaching consequences for enterprise architectures of the future.  Not only will DLT transform the technology stack of established domains (witness how Blockchain is transforming identity management infrastructure in the enterprise), but it will also give rise to new architecture paradigms as computing moves to decentralized trust-based networks, for example, in how an enterprise interacts with its business partners, suppliers and buyers.  The Internet took 30 years to have disruptive effects in the enterprise, and DLT’s full impact is expected to play out over similar time frames.

DLT represents a generic class of technologies (Blockchain is a prominent example), but all DLTs share the concept of the distributed ledger: a shared, immutable database that is the system of record for all transactions, current and historic, which is maintained by a community of participating nodes that have some sort of an incentive (usually a token or a cryptocurrency) to maintain the ledger in good standing.  The emergence of DLT’s can be traced to back to the original blockchain applications, Bitcoin and Ethereum.  Various other distributed ledger applications have emerged to solve specific industry/domain issues: R3’s Corda in financial services, Ripple for payments, etc.  Innovation in the DLT space is proceeding at a feverish pace.  The well-established DLT based networks can be essentially segmented based on two dimensions: how ledger integrity is guaranteed through validation, and whether the ledger is private or public.

DLT and Enterprise Architecture

As participants to DLT based networks developed by industry utilities or consortiums, organizations may not have a strong need to master internal architecture design and trade-offs associated this such a platform.  However, the architecture community in those organizations will still be required to understand how the networks they are participating in work, to the extent required to understand the implications for their organizations.  Furthermore, as intra-company applications of DLT become mainstream, enterprise architects will be increasingly called to provide perspectives on most optimal design of the underlying technology.  As DLT moves from innovation labs into the mainstream enterprise, architects will need to  start preparing their organizations for accepting DLT-based applications into the organizational landscape.  A good place to start for the enterprise architects will be to understand just what the DLT technical architecture encompasses.  This involves understanding what building blocks comprise a DLT system, and what architectural decisions need to be made.

The Building Blocks of a DLT System

To understand a complex technology such as DLT, it may be helpful to draw parallels to the TCP/IP stack for computer networking, which Blockchain has been compared to in the past (The Truth About Blockchain).  While there may not be a straight one-to-one correspondence between the Internet’s OSI model and the DLT architecture, drawing the parallel helps one understand conceptually how the building blocks fit together.  The OSI model is a generic architecture that represents the several flavors of networking that exist today, ranging from closed, proprietary networks to open, standards-based. The DLT building blocks provide a generic architecture that represents the several flavors of DLTs that exist today, and ones yet to be born.

In theory, it should be possible to design each building block independently with well-defined interfaces for the whole DLT system to come together as one whole, with higher level building blocks abstracted from the lower level ones. In reality, architectural choices in a building block influence those in other building blocks e.g., choice of a DLT’S data structure influences the consensus protocol most suitable for the system.  As common industry standards for DLT architecture and design develop (Hyperledger is an early development spearheaded by The Linux Foundation) and new technology is proved out in the marketplace, a more standardized DLT architecture stack will perhaps emerge, again following how computer networking standards emerged.  There is value, nevertheless, in being able to conceptually view a DLT system as an assembly of these building blocks to understand the key architecture decisions that need to be made.

Key Architectural Tradeoffs in DLT Systems

Architecting a DLT system involves making a series of decisions and tradeoffs across key dimensions.  These decisions optimize the DLT for the specific business requirement: for some DLT applications, performance and scalability may be key, while for some others, ensuring fundamental DLT properties (e.g., immutability and transparency) may be paramount.   Inherent in these decisions are architectural tradeoffs, since the dimensions represent ideal states seldom realized in practice.  These tradeoffs essentially involve traversing the triple constraint of Decentralization, Scalability, and Security.

Decentralization reflects the fundamental egalitarian philosophy of the original Bitcoin/Blockchain vision i.e., the distributed ledger should be accessible, available and transparent to all at all times, and that all participating nodes in the network should validate the ledger and thus have the full ledger data.  Decentralization enables trustless parties to participate in the network without the need for central authorization.  Scalability refers to the goal of having appropriate level of transaction throughput, storage capacity of the DLT to record transaction data, and the latency for the transaction to be validated and recorded once it is submitted.  Scalability ensures that appropriate performance levels are maintained as the size of the network grows.  Finally, Security is being able to maintain the integrity of the ledger by warding off attacks or making it impossible to maliciously change the ledger for one’s benefit. Fundamentally, this dimension reflects a security design that is inbuilt into the fabric of how the ledger operates, and not rely on external ‘checking’ to ensure safety.

Bringing It Together: DLT Building Block Decisions and Architectural Tradeoffs

Applying the architectural decisions to the DLT system allows one to come up with different flavors of DLT systems, each making tradeoffs to navigate the triple constraint described above.  Traversing the sides of the triangle allows one to transcend different DLT architecture styles with the vertices of the triangle denoting most pure architectural states seldom realized in practice.  For example, systems like Bitcoin and Ethereum aim to tend toward Vertex A maximizing Decentralization through their decentralized P2P trustless model, and Security through their consensus building and validation methods that prevent malicious attacks (although both Bitcoin and Ethereum have been shown to have other security vulnerabilities), but sacrifice much in terms of Scalability (Bitcoin’s scalability woes are well-known, and Ethereum is only slightly better).  On the other hand, permissioned DLTs, such as Corda, aim to tend to Vertex C maximizing Scalability and guaranteeing Security, but sacrifice Decentralization (by definition, permissioned DLT’s are not transparent since they restrict access and also validation is provided only by a set of pre-authorized validating nodes), and also may suffer other security issues (both the trusted nodes and the central authority in a permissioned DLT system can be attacked by a nefarious party).  DLT variations such as Bitcoin Lightning Network and Ethereum Raiden tend toward Vertex B, aiming to use off-chain capabilities to improve Scalability of traditional Blockchain and Ethereum networks, while preserving Decentralization (despite some recent concerns that these networks have a tendency to become centralized in the long run), although their off-chain capabilities may require additional Security capabilities (they also partially move away from the Blockchain’s decentralized security apparatus).   Let’s examine how these tradeoffs come into play at the level of DLT building blocks.

Layer 3: Ledger Data Structure

Ledger Data Structure encapsulates decisions around how the distributed ledger is actually structured and linked at a physical level e.g., chain of blocks, a graph, etc.  Additionally, it captures decisions around how many ledger chains there are, and specifies if the nodes carry the entire or just a part of the ledger.  In traditional Blockchain, the ledger is structured as a global sequential linked list of blocks instances of which are replicated across all participating nodes.  This design goes hand in hand with the Proof of Work consensus protocol that traditional Blockchain has in ensuring high levels of Decentralization and Security- since each node has current instance of the global ledger chain, and there is decentralized consensus building for block validation (although, a few security vulnerabilities with Blockchain have come to the forefront and Proof Work is susceptible to centralization due to economies of scale in mining).  As we know, this design takes a toll on Scalability – Blockchain can process only a few transactions per minute and time required for processing a block is high (Bitcoin generates a new block every 10 minutes).

Some new designs are coming with alternate data structures that improve Scalability & Performance, such as NXT’s and SPECTRE’s DAG (directed acyclic graph) of blocks, which mine DAG blocks in parallel to allow for more throughput and lower transaction time, and IOTA’s Tangle, the so called “blockless” DLT’s that get rid of block mining altogether and rely on a DAG of transactions to maintain system state and integrity.  These new designs have to be implemented and used at scale, with many of these designs having their own set of challenges (some claim they will continue to rely on some form of centralization to gain scale, and also have security related challenges).  However, DLT community’s interest has been high: IOTA’s Tangle has been creating a buzz in the DLT circles has a possible serious contender in the IoT world (since its data structure and protocol is well suited for handling volumes of continual streams of data), and several blockless DLT startups have been born lately.

Tinkering with how the ledger data is stored across nodes represent another opportunity for gaining in Scalability.  For example, sharding, a concept fairly well established in the distributed database world, is coming to DLTs.  Applied to DLTs, sharding enables the overall Blockhain state to be split into shards which are then stored and processed by different nodes in the network in parallel – allowing higher transaction throughput (Ethereum’s Casper utilizes sharding to drive scalability and speed).  Similarly, Scalability can be improved by having multiple chains, possibly private,  to enable separation of concerns: “side chains” enable processing to happen on a separate chain without overloading the original main chain.  While such designs improve Scalability, they move away from DLT’s  vision of enabling democratic access and availability to all participants at all times, and also present Security related challenges, part of the reason why widespread adoption of sidechains has been slow.

Layer 2: Consensus Protocol

Consensus protocol determines how transactions are validated and added to the ledger, and the decision-making in this building block involves deciding which specific protocol to choose based on the underlying data structure and objectives related to the triple constraint. Proof of Work, the traditional Blockchain consensus protocol, requires transactions to be validated by all participating nodes, and enables high degree of Decentralization and Security, but suffers on Scalability.  Alternative protocols, such as Proof of Stake, provide slightly better Scalability by changing the inventive mechanism to align more closely with the good operation of the ledger.  Protocols such as those based on Byzantine Fault Tolerance (BFT), which have been successfully applied to other distributed systems, are applicable to private ledgers, and depend upon a collection of pre-trusted nodes.  Such protocols sacrifice Decentralization to gain in Scalability.

Ethereum’s Raiden and Bitcoin’s Lightning Network are innovations to drive scalability to Ethereum and Bitcoin respectively by securely moving transactions off the main chain to a separate transacting channel, and then moving back to the main chain for settlement purposes – the so called “Layer 2” innovations.  This design allows load to move off of the main ledger, however, since transactions occuring on the channel are not recorded on the ledger, it sacrifices Security as the transacting channels need additional security apparatus not part of the original chain, as well as Decentralization (since channel transactions are not accessible to participants).

A number of other protocols and schemes to improve scalability and security are in the works, many of which are variations of the basic PoW and PoS, and which envision a future comprising not one single ledger chain, but a collection of chains.  For example, Kadena, which uses a PoW on a braid of chains, EOS which uses a delegated PoS, and Cosmos Tendermint, which uses BFT-based PoS across a universe of chains.

Layer 1:  Computation and App Data

DLT resources such as storage and computation come at a premium, and it costs real money to submit transactions in a DLT systems.  In the topmost layer, therefore, the architectural decisions deal with providing flexibility and functionality related to data storage and computation – essentially how much of it should reside on-chain, and how much off-chain.  Additionally, this layer deals with decisions around how to integrate the DLT with events from the real world.

For computation, Bitcoin Blockchain and Ethereum provide constructs for putting data and business logic to be executed on-chain, and Ethereum is far advanced than Blockchain in this since it offers “smart contracts”, which is essentially code that is executed on the chain when certain conditions are met.  There are obviously advantages to doing all computation on chain: interoperability between parties and immutability of code, which facilitates trust building.  There is, however, a practical limit to how complex smart contracts can be, a limit that is easily reached.  Offloading complex calculation to off-chain capabilities allows one to leverage the DLT capabilities in a cost-effective and high performing manner.  TrueBit,  on online marketplace for computation, enables a pattern in which complex resource-intensive computation can be offloaded to a community of miners who compete to complete the computation for a reward and provide results that can be verified on-chain for authenticity.  While this provides upside in terms of Scalability and Decentralization, there are Security related implications of using off-chain computation, an area of active research and development.

What applies to computation, also applies to data storage in the DLT world.  While Blockchain and Ethereum provide basic capabilities for storing data elements, a more suitable design for managing large data sets in DLT transactions is through off-chain data infrastructure providers or cloud storage providers while maintaining hashed pointers to these data sets on-chain.  Solutions like Storj, Sia, and IPFS aim to provide a P2P decentralized secure data management infrastructure that can hook into DLTs through tokens and smart contracts, manage data and computation securely through such technologies as Secure MPC (multi party computation).  Similar to off-chain computation, off-chain storage has upside in terms of Scalability and Decentralization, however, there are security and durability related implications.

What provides immutability to the distributed ledger (its deterministic method of recording transactions) is also its Achille’s heel: it is difficult for the ledger to communicate with and interpret data it gets from the outside non-deterministic world.  Oracles, services which act as middle men between the distributed ledger and the non-DLT world, bridge that gap and make it possible for smart contracts to be put to real world use.  Various DLT oracle infrastructures are in development: ChainLink, Zap, Oraclize, etc.  that provide varying features; choosing the right oracle architecture is thus extremely crucial for the specific use case under consideration.  Similar to off-chain data, oracles provide upside in terms of Scalability and Decentralization, however there are security and data verifiability related concerns.

Untitled

Conclusion

These are still early days for the DLT technology, and the many improvements that need to happen to make DLT commercially implementable are yet to come.  Beyond scalability and security, DLTs face a number of hurdles in enterprise adoption, such as interoperability, complexity and lack of developer friendly toolkits.  The future is probably going to be a not just one ledger technology here or there, but a multitude, each optimized for the specific use case within an organization, and even superstructures such as chains of chains connected with oracles, middleware and such.  And these structures will not replace existing technology architecture either; they will exist alongside and will need to be integrated with legacy technologies.  Like networking, DLTs will give rise to new processes, teams, and management structures.  Enterprise architects will play a central role in facilitating the development of DLT as a true enterprise technology.

The Five D’s of Fintech: Introduction

FinTech“Fintech” (a portmanteau of financial technology that refers to the disruptive application of technology to processes, products and business models in the financial services industry) is coming of age: two of the most prominent Fintechers, OnDeck and Lending Club, have gone public, many more are processing transactions to the order of billions of dollars, and outfits providing market intelligence in fintech are cropping up – there is even a newly minted index to track activity in marketplace lending. Banks are increasingly taking note of the Fintech movement, partnering with startups, investing in them or even acquiring them outright. Venture funding in fintech grew by 300% in one year to $12 billion in 2014.  According to the Goldman Sachs’s “Future of Finance” report, the total value of the market that can potentially be disrupted by Fintechers is an estimated $4.3 trillion.

Fintech is a complex market, spanning a broad swath of finance across individual and institutional markets and including market infrastructure providers as well. It is a broadly defined category for upstarts who have a different philosophy around how finance should function and how it should serve individuals and institutions. While some Fintechers seek to reduce transaction fees and improve customer experience, others exist to provide more visibility into the inner working of finance. In spite of this diversity, there are some common threads and recurring themes around why Fintech firms exist and what their market philosophy is. The 5 D’s of Fintech – Democratization, Disaggregation, Disintermediation, Decentralization and De-biasing – represent common themes around the mission, business models, values, and goals of many of these firms. In this series of posts on Fintech, we will look at each of the 5 D’s of Fintech, starting with Democratization — the mission of many a Fintech firm.

The Five D’s of Fintech

Fintech Slides

Democratization

Technology has long enabled democratized access to financial services, however Fintech is taking the movement to another level by targeting specific market niches with customized value propositions. A central appeal of many Fintechers is their promise to bring to the masses resources and capabilities which heretofore have been the preserve of the wealthy, elite, or the privileged. This has been made possible by both by market opportunity and internal capability: market opportunity of serving a market whitespace, and the ability to do so economically through the use of data and advanced technologies.

The financial inclusion that Fintechers are now enabling is driven by their ability to clear obstacles, remove barriers, and enable access where none existed before, whether it is serving the unserved or underserved SMBs that have typically been shunned by traditional banks (Funding Circle), providing credit to the underbanked segment lacking the traditional credit scores (Kreditech), enabling investment advice without the need to rely on expensive financial advisors (Nutmeg or Betterment), or facilitating access to the capital markets by offering low-cost brokerage services (Robinhood). Financial services are now “for the people” and “by the people” as well: Quantiacs, a fintech startup with the aim of revolutionizing the hedge fund industry, is essentially a market place for quantitative trading strategies that enables anyone to market their quantitative skills and trading strategies. Or OpenFolio, which is an online community that allows one to link portfolios and measure investment performance against their communities and relevant benchmarks. Wealth management perhaps is a market ripest for democratization as shown by rapid emergence of a raft of outfits such as HedgeCoVest and iBillionaire (platforms that allow investors to mirror the trades of hedge funds and billionaires, respectively), Loyal3 (which offers no fee access to IPOs), Algomi and True Potential (which undo trading obstacles for investors).

As Vikas Raj with Accion Venture Lab notes, the real potential of fintech is in democratizing access to finance for the billions of low-income unbanked population in the emerging markets. The high complexity and low scale nature of this market is exactly the kind Fintechers are good at capitalizing on, and this is evident from the long list of companies that is emerging in this market beyond Silicon Valley and New York. Where traditional finance and government agencies have failed, Fintech has the promise and the potential to excel.

Other industries can learn a lot by observing how Fintech is driving democratization in finance. Whether it is healthcare, education, media or government services, there is potential value in market segments that are currently un/under served which a Fintech like movement can unlock. Adopting the technologies underlying Fintech is part of the story; what is needed first is the recognition of the potential for change, the support from the markets, and an entrepreneurial spirit to lead the movement.

 

 

A New Kid on the Blockchain

Fintech, the application of information technology to the world of finance, is the topic of discussion in The Economist’s latest special report on banking (Special Report on International Banking).  Bitcoin was featured in one of the articles, but this time the focus is not on bitcoin the currency per se, but the blockchain, Bitcoin’s underlying protocol that enables distributed ledger management using cryptography and powerful computers spread across world’s data centers.

Bitcoin

The blockchain, since its invention by Satoshi Nakamoto (the obscure inventor behind Bitcoin and its protocol), has taken the world of fintech by storm.  The blockchain is being touted as the next big thing, not unlike the Internet and its underlying protocols for communication, that has the potential to revolutionize everything from money transfer to real estate transactions and the internet of things. Blockchain, as a concept, is being bastardized to serve multiple applications, including communication, agreements, asset transfers, record tracking etc.   Numerous startups are cropping up that provide value added services on top of the original Bitcoin blockchain, such as CoinSpark, an Israeli startup that has devised a technology to add information and metadata to the blockchain, one application of which is being able to provide “notary” services to agreements and documents recorded on the blockchain.  There are other outfits, however, that are fundamentally trying to re-architect the original blockchain to make it better or to make it work for specific purposes.

Colored Coins, for instance, enables the storage and transaction of “smart property” on top of the blockchain. Smart property is property whose ownership is controlled via the blockchain using “smart contracts,” which are contracts enforced by computer algorithms that can automatically execute the stipulations of an agreement once predetermined conditions are activated. Examples of smart property could include stocks, bonds, houses, cars, boats, and commodities. By harnessing blockchain technology as both a ledger and trading instrument, the Colored Coins protocol functions as a distributed asset management platform, facilitating issuance across different asset categories by individuals as well as businesses. This could have a significant impact on the global economy as the technology permits property ownership to be transferred in a safe, quick, and transparent manner without an intermediary. Visionaries see many other exciting opportunities too, including linking telecommunications with blockchain technology. This could, for example, provide car-leasing companies the ability to automatically deactivate the digital keys needed to operate a leased vehicle if a loan payment is missed.

Ethereum is another outfit that has created technology to develop blockchain based smart contracts.  Ethereum—an open-source development project that provides a platform for developers to create and publish next-generation distributed applications—uses blockchain technology to facilitate the trading of binding smart contracts that can act as a substitute to conventional business documents. The technology allows the contracts to be traced and used to confirm business deals without the need to turn to the legal system.  Then there are outfits such as Ripple Labs that are devising their own blockchain like protocols to facilitate quick and secure money transfer.

Other blockchain innovation involves combining blockchain technology with conventional technologies.  IBM and Samsung are developing a blockchain-powered backbone for Internet of Things products called ADEPT that combines three protocols – BitTorrent (file sharing), Ethereum (smart contracts) and TeleHash (peer-to-peer messaging).  ADEPT is a blockchain powered secure communication and transaction protocol for devices.  When a washing machine, for example, is bought by a consumer, ADEPT will allow the washing machine to be automatically registered in the home-based network of things, not just sending out to/receiving messages from other registered devices, but also automatically initiating and fulfilling transactions on its own for say replenishing the washing powder by placing an order with the local grocery store.

These innovations are at the leading edge of the blockchain technology and it will be several years before their use will be become widespread, if at all.  In the meantime, more mundane application of blockchain has great potential to flourish. Future fintech entrepreneurs should not discount considering the blockchain as the grounds of their creative pursuits.  All that is needed is a “killer app” that will niftily apply the concept to solve a present-day problem.  Just as Marc Andressen’s NetScape Navigator unleashed a wave of innovation in the history of the Internet, so too will a blockchain startup in the world of distributed ledgers and asset registers.

The DevOps Movement

The DevOps movement has been resurgent in the past few years as companies look to improve their delivery capabilities to meet rapidly shifting market needs and business priorities.  Many have been preaching how companies should actually become not just robust and agile, but in fact “anti fragile” with the ability to expect failures and adapt to them.  The likes of Google, Amazon and Netflix embody this agile and anti-fragile philosophy, and traditional business houses facing increasingly uncertain and competitive markets want to borrow a chapter from their books and become agile and anti-fragile as well, and DevOps is high on their list as a means to achieve that.

blue devops_4

DevOps is a loose constellation of philosophies, approaches, work practices, technologies and tactics to enable anti fragility in the development and delivery of software and business systems.  In the DevOps world, traditional software development and delivery with its craft and cottage industry approaches is turned on its head.  Software development is fraught with inherent risks and challenges, which DevOps confronts and embraces.  The concept seems exciting, a lot of companies are talking about it, some claim to do it, but nobody really understands how to do it!

The much available literature on DevOps talks about everything being continuous in the DevOps world: Continuous Integration, Continuous Delivery and Continuous Feedback.  Not only does this literature fail to address  how the concept translates into reality, but also it takes a overly simplistic view of the change involved: use Chef to automate your deployment, or use Jenkins continuous integration server to do “continuous integration”. To be fair, the concept of DevOps is still evolving.  However, much can be done to educate the common folk on the conceptual underpinnings of DevOps before jumping to the more mundane and mechanistic aspects.

DevOps is much more of a methodology, process and cultural change than anything else. The concept borrows heavily from existing manufacturing methodologies and practices such as Lean and Kanban and extends existing thinking around lean software development to the enterprise.  Whereas the traditional software development approach is based on a “push” model, DevOps focuses on building a continuous delivery pipeline in which things are “pulled” actively by different teams as required to keep the pipeline going at all times.   It takes the agile development and delivery methodologies such as Scrum and XP and extends them into operations so as to enable not just agile development, but agile delivery as well.  And it attempts to address the frequently cantankerous relationship between those traditionally separated groups of development and operations into a synergistic mutually supportive one. Even within the development sphere, DevOps aims to bring various players including development, testing & QA, and build management together by encouraging teams to take on responsibilities beyond their immediate role (e.g., development taking on more of testing) and empowering traditionally relegated roles to positions of influence (e.g., build manager taking developers to task for fixing broken builds).

We are still in early days with the DevOps movement, and until we witness real life references and case studies of how DevOps has been implemented end-to-end, learning about DevOps will be a bit of an academic exercise.  Having said that, some literature does come close to actually articulating what it means to put in practice such concepts as Continuous Delivery and Continuous Integration.  To the curious, I would recommend the Martin Fowler Signature Series of books on the two topics.  Although agonizingly technical, the two books do a good job of getting down to the brasstacks. My future posts on DevOps will be an attempt to synthesize some of the teachings from those books into management summaries.

Econinformatics?

For most business executives, the term “economics” conjures images of either simplistic supply-demand graphs they may have come across in Economics 101, or theoreticians devising arcane macroeconomic models to study the impact of interest rates and money supply on national economies.  Although businesses such as banks and financial institutions have maintained armies of economists on their payrolls, the economist’s stature and standing even in such institutions has been relatively short, limited to providing advice on general market based trends and developments, as opposed to actionable recommendations directly impacting the business bottom-line.  Even the most successful business executive would be stumped when faced with the question of how exactly economics is really applied to improving their day-to-day business.  All this may now be changing, thanks to a more front and center role of economics in new age businesses that now routinely employ economists to sift through all kinds of data to fine tune their product offerings, pricing and other business strategies.  Text book economists of yore are descending down from their ivory towers and taking on a new role, a role that is increasingly being shaped by availability of new analytic tools and raw market data.

images

Economists, especially the macro kind, are a dispraised bunch, with a large part of criticism stemming from their inability to predict major economic events (economists missed anticipating the 2008 market crash).  For this and other reasons (not least the Lucas Critique), macroeconomic modeling focused on building large scale econometric models has been losing its allure for some time.  Microeconomic modeling enabled by powerful data-driven micro econometric models focused on individual entities has been transforming and expanding over the past few decades.  The ever expanding use of sophisticating micro-models on large data datasets has caused some to perceive this as paving the foundation for “real time econometrics”.  Econometrics, the interdisciplinary study of empirical economics combining economics, statistics and computer science, has continually evolved over the past several decades thanks to advances in computing and statistics, and is yet again ready for disruption – this time due to availability of massive data sets and easy-to-procure computing power to run econometric analyses.  The likes of Google, Yahoo and Facebook are already applying advanced micro econometric models to understand causal statistics surrounding advertising, displays and impressions and their impact on key business variables such as clicks and searches. Applied econometrics is but one feather in the modern economist’s cap: economists are at the forefront in the sharing economy and “market design”.

A celebrated area of economic modeling and research that has found successful application in business is “market design” and “matching theory”, pioneered by Nobel prize-winning economists Al Roth and Lloyd Shapley.  Market design and matching theory is concerned with optimizing the pairing or matching of providers and suppliers in a market place based on “fit” that is driven by dimensions that go beyond just price, for example. Al Roth successfully applied game theory based market design and matching algorithms to improving a number of market places, including placement of New York City’s high school students, the matching of medical students with residency schools, and kidney donation programs.  The fundamentals of matching theory are being widely applied by economists today: many modern day online markets and sharing platforms such as eBay, Lyft etc. are in the business of matching suppliers/providers and consumers, and economists employed by these outfits have successfully applied those fundamentals to improving their businesses, increasingly with the aid of multi-dimensional data that is available in real time. Other market places, including LinkedIn (workers and employers) and Accretive Heath (doctors and patients) have applied similar learnings to improve their matching quality and effectiveness.  Airbnb economists analyzed data to try to figure out why certain hosts were more successful than others in sharing their space with guests, and successfully applied their learnings to help struggling hosts and also to better balance supply and demand in many of Airbnb markets (their analysis pointed out that successful hosts shared high quality pictures of their homes, which led Airbnb to offer a complementary photography service to its hosts).

Beyond market design, economics research is changing in a number of areas thanks to availability of large data sets and analytic tools in various ways as Liran Einav and Jonathan Levin of Stanford University outline in “The Data Revolution and Economic Analysis“.  One such area is measurement of the state of the economy and economic activity and generation of economic statistics to inform policy making. The issue with macroeconomic measurement is that the raw data produced by the official statistical agencies comes with a lag and is subject to revision.  Gross domestic product (GDP), for example, is a quarterly series that is published with a two-month lag and revised over the next four years.  Contrast this with the ability to collect real time economic data, as is being done by the Billion Prices Project, which collects vast amounts of retail transaction data in near real time to develop a retail price inflation index.  What’s more, new data sets may allow economists to shine the light on places of economic activity that have been dark heretofore.  Small businesses’ contribution to the national economic output, for example, is routinely underestimated because of exclusion of certain businesses. Companies such as Intuit, which does business with many small outfits, now have payroll transactional data that can be potentially analyzed to gauge the economic contribution of such small businesses.  Moody’s Analytics has partnered with ADP, the payroll software and services vendor, to enhance official private sector employment statistics based on ADP’s payroll data.

Conservatives and the old guard may downplay the role of data in applied economics, reveling in their grand macroeconomic models and theories.  To be fair, empirical modeling would be lost without theory.  However, data’s “invisible hand” in shaping today’s online markets and business models is perceptible, if not openly visible.  Economists of all stripes will be well advised to pay attention to the increasing role of data in their field.  Next time you see an economist, ask them to go take a course on Machine Learning in the Computer Science department, to pass on Google Chief Economist Hal Varian’s counsel – it will be time worth spent.

The Promise of Geospatial and Satellite Data

Stories of traders vying to compete and finding new ways to beat the market and eke out profits are nothing new.  But what has been interesting of late is the lengths and levels to which trading houses are now able to take the competition to, thanks to real-time data analytics supporting trading indicators and signals, purveyors of which include companies like Genscape and Orbital Insight.  Orbital Insight, a startup specializing in analytics and real-time intelligence solutions, was featured in a recent Wall Street Journal writeup (Startups Mine Market-Moving Data From Fields, Parking Lots, WSJ, Nov 20, 2014).  Genscape, a more established player, is another outfit that  employs sophisticated surveillance and data-crunching technology to supply traders with nonpublic information about topics including oil supplies, electric-power production, retail traffic, and crop yield. Genscape and Orbital are but a couple of players in a broad developing market of “situational intelligence” solutions that provide the infrastructure and the intelligence for rapid real-time data driven decision-making.  These two companies, however, are particularly interesting because they provide a view into the promise of geospatial and satellite imagery data and how it can be exploited to disrupt traditional operational and tactical decision-making processes.

savi-globalstar-textured-earth-small

Geospatial data is simply data about things and related events indexed in three-dimensional geographic space on earth  (with temporal data being collected too for events taking place across time).  Geospatial data sources include those of two types: GPS data that is gathered through satellites and ground-based navigation systems, and remote sensing data that involves specialized devices to collect data and transmit it in a digital form (sensors, radars and drones fall in this type).  Geospatial data is of interest to private corporations and public entities alike.  When triangulated with traditional data sources, personal data, and social media feeds, it can provide valuable insight into real-time sales and logistics activities, enabling real-time optimization.  On the public side, geospatial data can provide valuable information on detecting and tracking epidemics, migration of refugees in a conflict zone, or intelligence of geopolitical significance.  These are but a handful of use cases that can be made possible through the use of such data.

Once the preserve of secretive governments and intelligence agencies worldwide, geospatial and satellite imagery data is slowly but surely entering commercial and public domains, spawning an entire industry comprising outfits that build and manage the satellite and sensor infrastructure, to manufacturers and suppliers of parts and components that make up the satellites, and not least entities such as Orbital Insight that add value to the raw data by providing real-time actionable information to businesses.   Orbital Insight, for example, leverages sophisticated machine learning algorithms and analysis against huge volumes of satellite imagery made available by DigitalGlobe’s Geospatial Big Data™ platform, allowing for accurate, verifiable information to be extracted.  Outfits such as DigitalGlobe, Planet Labs, and Blackbridge Geomatics are examples of companies that are making investments to launch and manage the satellite and sensor infrastructure to collect detailed real-time geospatial data.  Google, not to be left behind in the space race, jumped into the market with its acquisition of SkyBox Imaging earlier this year.  SkyBox intends to a build a constellation of twenty-four satellites that will collect anything and everything across the globe.  What’s more, Skybox, unlike other companies such as DigitalGlobe, intends to make available all the data it will collect through its satellite constellation for public and commercial use.  But even companies such as SkyBox are not blazing the trail in the satellite business – there are numerous other start-ups that are vying to put into orbit low-cost and disposable nano satellites that will be much more smaller and cheaper to launch and manage.  These developments are only going to create and open up an even wider range of applications for private and public use than has been possible heretofore.

These are still very early days for commercial application of geospatial and satellite imagery data, and exciting developments are still ahead of us.  For one, the number and kinds of data sources that such applications may possibly need to be able to handle in the future will be exponentially higher: imagine a fleet of satellites, aerial drones, quadcopters and ground-based sensors all providing various kinds of data that could potentially be collated and flanged together.  So too will new algorithms and ways of storing and manipulating streaming data at mind-boggling scales, all of which may require a level of thinking beyond what we may currently have.

 

What is a “Container” Anyway?

After reading a recent announcement by Google about enhancing Google App Engine to provide support for Docker, a popular containerization technology, I started wondering exactly what containers were and why they were taking the world of computing by storm.  The hallowed Container is being touted as the next king of the virtualization world, getting ready to displace the mighty virtual machine (VM).

downloadContainers are basically light-weight VMs that provide VM like functionality with some constraints and conditions, but do so in a manner that is more efficient than a VM.  A traditional VM is supported by a hypervisor, which is basically a software layer that abstracts the full underlying physical hardware, a key reason why the traditional VM infrastructure is resource intensive.  The result is that it is possible to run only so many VMs at a given time on a hardware platform with finite resources (such as CPU and memory).  Where containers prove to be clever is that they are supported by a container engine that abstracts only bits and pieces of the underlying platform, relying on the operating system for other required functionality.  This is a double-edge sword however: the simplicity and light weight nature of containers does indeed mean that many more containers (and therefore application workloads) can be run on a piece of hardware than can traditional VMs, however, since there is tight coupling between the container and the native operating system, all containers running on a given hardware are forced to share the same operating system, with the result that containers do not allow operating system multiplicity i.e. it is not possible to run application workloads on different operating systems on the same physical platform.

Containers in computing are not unlike the containers in the physical shipping and transportation world.  Before containers arrived on the scene, transportation and trade of goods was handled manually, where  longshormen handled break bulk cargo in and out of the ship’s hold, playing a game of maritime Tetris.  The manual nature of loading and packing severely limited efficiencies and speed of transportation and trade.  This situation of yore is akin to developing, deploying and managing applications in a purely physical environment, where application code developed had to be painstakingly tested for functionality and performance on a raft of major platforms and their operating system variants.  To be clear, trade and transportation was not all manual before standard shipping containers came on the scene: containerization had begun in bits and pieces as early as the 1800s.  This containerization was however heavily fragmented, with each trading company devising its own practices, standards, and supporting infrastructure for loading, unloading and securing the containers.  This containerization enabled them to achieve efficiencies for the specific types of cargo and the specific modes (rail, road, or sea) they dealt with.  This efficiency however came at a cost: the cost of having to maintain and manage their own infrastructure, practices and standards for loading/unloading, packing/unpacking, and transporting.  This is not unlike the situation we have today with VMs where although there is flexibility of running workloads in different operating systems, there is the overhead of managing all the supporting infrastructure to enable that flexibility.

With the emergence of containers (the  computing kind), we are moving into the realm of standardization, as the container of the physical world did when Malcolm McLean, the American trucking magnate, devised his revolutionary intermodal shipping container, which standardized the shipping container industry on a common set of standards, practices and intermodal transfer and handling of goods.  Companies such as Docker are trying to do what Malcolm McLean did: standardize and streamline the way workloads can be encapsulated, ported and run on a set of commonly used operating systems.  Malcolm McLean’s intermodal container transformed the shipping industry.  It is yet to be seen if the likes of Docker can replicate success to that scale in the computing world.

Data Platforms Set Off a Cambrian Explosion

The Jan. 18, 2014, edition of The Economist featured “A Cambrian Moment,” a special report on tech start-ups. The report discusses how the digital marketplace is experiencing a Cambrian explosion of products and services brought to market by an ever-increasing plethora of new start-ups. Precipitating this explosion is the modern “digital platform” – an assembly of basic building blocks of open-source software, cloud computing, and social networks that is revolutionizing the IT industry. This digital platform has enabled technology start-ups to rapidly design and bring to market a raft of new products and services.

Cambrian_Explosion

A similar evolution is taking form in the world of data. Anabelle Gawker, a researcher at Imperial College Business School, has argued that “platforms” are a common feature of highly evolved complex systems, whether economic or biological. The emergence of such platforms is the ultimate result of evolving exogenous conditions that force a recombination and rearrangement of the building blocks of such systems. In the data world, this rearrangement is taking place thanks to the falling cost of information processing, the standardization of data formats, and maturation of large-scale connectivity protocols. The emergence of such data platforms has significant implications for a range of industries, not the least of which are data-intensive industries like healthcare and financial services. Like the digital platform, the data platform will give rise to an explosion of data-enabled services and products.

Early evidence already can be seen in the pharmaceutical and drug manufacturing industries. Drug manufacturers need thousands of patients with certain disease states for late-stage clinical trials of drugs under development. This patient recruitment process can take years. The Wall Street Journal reported that to speed up the recruitment process, drug manufacturers have turned to entities such as Blue Chip Marketing Worldwide, a drug-industry contractor that provides data-enabled solutions based on consumer data sets obtained from Experian, a data broker and provider of consumer data. Blue Chip uses sophisticated big data analyses and algorithms to identify individuals with potential disease states. Blue Chip’s services have already enabled a drug manufacturer to cut patient recruitment time from years to months.

Trading is another industry where early indications of such movements can be seen. Also reported in The Wall Street Journal, in their never-ending quest to gather market-moving information quicker, traders have turned to outfits such as Genscape, a player in the growing industry that employs sophisticated surveillance and data-crunching technology to supply traders with nonpublic information about topics including oil supplies, electric-power production, retail traffic, and crop yield. Founded by two former power traders, Genscape crunches vast amounts of sensor data, video camera feeds, and satellite imagery to draw patterns and make predictions on potential movements in supply and demand of commodities such as oil and electricity.

We are in the early days of this evolutionary process. The evolution of the data platform will most likely mirror the evolution of the digital platform, although it is expected to proceed at a faster pace. As competing technologies and solutions evolve and mature, we will see consolidation and emergence of just a few data platforms that will benefit from the tremendous horizontal economies of scale. Information industry incumbents such as Google will be in the pole position to be leaders in the data platform space. For example, Google already has a massive base of web and social interaction data thanks to Google Search, Android, and Google Maps, and it is making aggressive moves to expand into the “Internet of things,” the next frontier of big data.

Concurrently, we will witness a broad emergence of a long tail of small vendors of industry-oriented data products and services. Blue Chip and Genscape are first movers in a nascent market. As the economics of harvesting and using data become more attractive, an increasing number of industry players will want to leverage third-party data services, which in turn will create opportunities for data entrepreneurs. The Cambrian explosion will then be complete and the start-up garden of the data world will be in full bloom.

 

Big Data Technology Series – Part 6

Big dataIn the last few installments of the Big Data Technology Series, we looked at the evolution of database management, business intelligence and analytics systems, and statistical processing software. In this installment, we will look at the modern advanced analytical platform for big data analytics that represents a confluence of the three evolutionary threads in the story of data management and analytics platforms.  We will look at the core capabilities of such platforms, and major vendors in the marketplace.

The graphic below provides a view of the core capabilities of a modern advanced analytical platform.  There is a wide range of analytical platforms in the market place, with each platform excelling in and specializing in a specific aspect of big data analytics capabilities. This picture provides a logical universe of the capabilities found in these platforms.  In other words, there isn’t one single platform that provides all the core capabilities described below in their entirety.

Big Data Analytics Platform - Capabilities

  1. Hardware –  Hardware includes data processing and storage components of the analytical platform stack, providing management and redundancy of data storage. As we saw in the Part 2 of the big data technology series, the database management platform and associated hardware have continued to evolve ever since the first databases appeared on the market in the 1950s.  Database hardware was once a proprietary component of the stack providing considerable value, however it is increasingly becoming a commodity.  Hardware innovation includes innovations in storage such as solid state devices and massively parallel node configurations connected by high speed networks.  Modern analytic platforms provide flexibility to use configurations such as these, as well as  configurations of commodity x-86 machines for managing lakes of massive unstructured raw datasets.
  2. Analytic Database – This is the software layer that provides the logic behind managing the storage of datasets across the node cluster, managing such aspects as partitioning, replication, and optimal storage schemes (such as row or column).  Analytical applications run most efficiently with certain storage and partitioning schemes (such as columnar data storage), and modern analytical platforms provide capabilities to configure and setup these data storage schemes.  Memory based analytic databases such as SAP HANA have added one more dimension to this – one that dictates how and when data should be processed in-memory and when it should be written to disk.  Advances in database management systems have enable modern analytic platform to have its disposal a range of tools and techniques to manage all data types (structured, semi-structured or unstructured) and all processing needs (data discovery, raw data processing, etc.)
  3. Execution Framework – The execution framework is a software layer that provides query processing, code generation capabilities and runtimes for code execution.  Advanced analytical applications frequently have complex query routines and a framework that can efficiently parse and process the queries is critical to the analytic platform.  Furthermore, modern analytical platforms provide capabilities to structure advanced analytical processing through the use of advanced programming languages such as Java and R.  The execution framework provides the logic to convert such higher level processing instructions into optimized query bits that are then submitted to the underlying analytical database management system.  Advances in analytical platforms, as we saw in Part 3 of big data technology series, have enabled these capabilities in the modern day analytic platform.
  4. Data Access and Adaptors – Modern analytic platforms provide prebuilt, custom developed and DIY connectors to a range of data sources such as traditional data warehouses, relational databases, Hadoop environments and streaming platforms.  Such connectors provide bi-directional data integration between these data repositories and the analytic data storage.  These connectors thus provide data visibility to the analytic platform no matter where and how the data is stored.
  5. Modeling Toolkit – The modeling toolkit provides design time functionality to develop and test code for running advanced analytics and statistical processing routine using higher level languages such as Java, Python and R.  This represents the third and final thread in our story of the evolution of big data analytic platforms – the evolution, rise and ultimately the convergence of statistical processing software into the logical big data analytic platform.  The toolkit provides a range of pre-built and independent third party libraries of routines for statistical processing, but also a framework that can be used and extended as needed to run custom statistical processing algorithms.
  6. Administration – Like any database management or traditional warehousing platform, the modern analytics platform provides strong administration and control capabilities to fine tune and manage the working of the platform.  The rise of horizontal scaling using commodity machines has put increased importance on being able to efficiently administer and manage large clusters of such data processing machines.  Modern analytic platforms provide intuitive capabilities to finely control data partitioning schemes, clustering methods, backup and restore, etc.

There are a range of players in the market for big data analytics platforms as depicted by the graphic below.

Big Data Analytics Platform - Market

There are roughly three categories of such product vendors:

  1. Type 1 (Traditional Data warehousing Vendors) – This category includes vendors such as IBM, SAP and Oracle that have traditionally done very well in the BI/data warehousing space.  These solutions have excelled in providing traditional analytic capabilities for mostly structured datasets. These vendors are rapidly extending their product capabilities to provide advanced analytical capabilities for big data sets, either indigenously or through acquisitions and joint ventures with niche vendors specializing in advanced big data analytics.
  2. Type 2 (SQL on Hadoop) – This category includes vendors that are providing solutions to extend traditional Hadoop environments to deliver big data analytics in a real time ad hoc manner using SQL.  Traditional Hadoop is well suited for large scale batch analytics; however the MapReduce architecture is not easily extensible to real time ad hoc analytics.  Some products in this space do away with the MapReduce architecture completely in order to overcome these limitations.
  3. Type 3 (Independent Players) – This category includes vendors that have come up with proprietary schemes and architectures to provide real time and ad hoc  analytic platforms.  Some such as 1010 data and Infobright have existed for some time, while others such as Google are newcomers that are providing new ways to deliver analytic capabilities (e.g. Google offers a web based service for running advanced analytics).

Below is a detailed description of the offerings from some of the major vendors in these three categories.

Type 1 (Traditional Data warehousing Vendors) – Traditional vendors of enterprise data warehousing platforms and data warehousing appliances that have acquired and/or developed capabilities and solutions for large scale data ware housing and data analytics.

Teradata

  • Teradata’s Aster database is a hybrid row and column data store that forms the foundation of its next generation data discovery and data analytic capability; the data management platform can be delivered as a service, on commodity hardware or an appliance form factor
  • Teradata Enterprise Data Warehouse is its data warehousing solution; the EDW is marketed as the platform for dimensional analysis of structured data and standard data warehousing functions as part of its Unified Data Architecture, Teradata’s vision for an integrated platform for big data management
  • Teradata delivers the HortonWorks Hadoop distribution as part of its Unified Data Architecture vision; Aster database supports native MapReduce based processing for bi-directional integration with the Hadoop environment; SQl-H provides SQL based interface for higher level analysis of Hadoop based data
  • The Aster data discovery platform provides capabilities for advanced statistical and data mining through pre-packaged function libraries, a development environment for custom analytic functions and an execution environment that can execute such analytic functions as part of standard SQL
  • Currently Teradata does not have known solution for event stream processing (it has announced it may enter into partnership with independent vendors of event stream processors)

Pivotal

  • Pivotal is an independent big data entity spun off from EMC after its acquisition of VmWare and Greenplum; Pivotal’s data analytics platform is powered by Greenplum database, a hybrid row and column, massively parallel data processing platform; the data management platform can be delivered as a service, on commodity hardware or an appliance form factor
  • Pivotal also offers an in-memory data management platform, GemFire, and a distributed SQL database platform, SQLFire ; Pivotal does not currently have a known solution for regular data warehousing
  • Greenplum Hadoop Distribution is a Greenplum supported version of Apache Hadoop; Greenplum database supports native MapReduce based processing for bi-directional integration with the Hadoop environment; Greenplum HAWQ provides SQL based interface for higher level analysis of Hadoop based data
  • Through partnerships with analytics vendors such as SAS and Alpine Data Labs, Greenplum platform provides capabilities for advanced statistical and data mining through pre-packaged function libraries, a development environment for custom analytic functions and an execution environment that can execute such analytic functions as part of standard SQL
  • Currently Pivotal does not have known solution for event stream processing

IBM

  • IBM’s big data management platform is powered by Netezza, a massively parallel data storage and distributed data processing appliance; Netezza enables data warehousing and fast analysis of mostly structured large scale data
  • IBM’s PureData System for Analytics provides the foundation for big data analytics; IBM PureData System for Analytics is a data warehouse appliance; IBM Netezza Analytics is an advanced analytics framework incorporating a software development kit for analytic model development, third party analytic libraries, and integrations with analytic solutions such as SAS, SPSS, etc. in support for in-database analytics
  • IBM PureData System for Operational Analytics focuses on analytics for operational workloads (as opposed to regular data ware housing workloads that are more long term strategic in nature)
  • IBM Big Data Platform Accelerators provide analytic solution accelerators i.e. pre-built examples and toolkits for video analytics, sentiment analytics that enable users to jumpstart their analytic development efforts
  • IBM provides a licensed and supported version of Apache Hadoop distribution as part of its InfoSphere BigInsights platform; BigInsights provides Jaql, a query and scripting language for unstructured data in Hadoop
  • IBM does not currently have a known solution for in-memory data management (like SAP HANA or Pivotal GemFire)
  • IBM provides InfoSphere Streams for data stream computing in big data environments

Oracle

  • Oracle’s big data management platform is supported by the Oracle database that provides columnar compression and distributed database management for analytic functions
  • Oracle offers a range of appliances for big data ware housing and big data analysis; Oracle Exadata offers an appliance for data ware housing based on Oracle database and Sun hardware
  • The Oracle Big Data Appliance is packaged software and hardware platform for managing unstructured data processing; it provides a NoSQL database, Cloudera Hadoop platform and associated management utilities, and connectors that enable integration of the data warehousing environment with the Hadoop environment
  • Advanced analytics are provided by Oracle R Enterprise, which provides database execution environment for R programs, and Oracle Data Mining, which provides data mining functions callable from SQL and executable within the Oracle data appliance
  • Oracle Exalytics also provides an in-memory database appliance for analytical application similar to SAP HANA
  • Oracle Event Processing and Oracle Exalogic provide capabilities for event stream processing

Type 2 (SQL on Hadoop) – Independent (i.e. not traditional data warehousing vendors) solution providers that provide big data warehousing and analytics platforms and products that are architected using a proprietary design, delivered as software solution, managed service or cloud offering (although some are offering appliances as well), and focusing on a specific market niche.

Hadapt

  • Hadapt enables an analytical framework for structured and unstructured data on top of Hadoop by providing SQL based abstraction for HDFS, Mahout, and other Hadoop technologies
  • Hadapt also integrated with third party analytic libraries and provides a development kit to enable development of custom analytic functions
  • Hadapt encourages deployment on configurations of commodity hardware (as opposed to proprietary appliances and platforms encouraged by Type 1 appliance vendors)

CitusData

  • An analytic database based on PostgreSQL database that offers SQL querying capabilities
  • Also offers SQL querying capabilities for data in Hadoop clusters
  • Offers a software solution that can run on commodity hardware

Other Type 2

  • A number of vendors are providing tools that enable SQL processing on top of Hadoop so as to enable higher level analytics and processing by business analysts (who may not have the ability or time to code complex MapReduce functions)
  • Hive is a data warehousing solution for Hadoop based data that provides a SQL like language
  • Greenplum HAWQ, Aster Data SQL-H and Cloudera Impala all aim to achieve higher performance of standard SQL on Hadoop by trying to rectify shortcomings and limitations of Hadoop MR and Hive

Type 3 (Independent Players) – Independent (i.e. not traditional data warehousing vendors) solution providers that provide big data analytics platforms and products that are architected using a proprietary design for big data analysis (i.e. non-Hadoop), delivered as software solution on commodity hardware configuration, managed service or cloud offering; also includes niche players

1010data

  • Proprietary database is a columnar, massively parallel data management system with an advanced, dynamic in-memory capability for mostly structured data analytics
  • Delivers the solutions in a cloud and in a hosted environment
  • Capabilities to perform granular statistical and predictive analytic routines that can be extended 1010data’s proprietary language and interface
  • Started in the financial services space, and is now expanding to manufacturing and retail

paraccel

  • Software solution for columnar, compressed, massively parallel relational data management that is capable of all-in-memory processing (provides connectors to major traditional data ware housing platforms, operational systems, and Hadoop)
  • Supports on-premise and cloud based deployment; on-premise deployment is supported on select commodity hardware configurations
  • Provides advanced in-database analytic solutions and libraries for a range of common and industry specific use cases through partnership with Numerix and Fuzzy Logix (vendors of analytic solutions)

Infobright

  • Offers a columnar, highly compressed data management solution (integrate s with Hadoop)
  • Niche focus on analytics for machine generated data
  • Delivered as a software solution and as an appliance

LexisNexis

  • Provides Roxie, an analytic database and data warehousing solution, and a development and execution environment based on a proprietary querying language ECL
  • Provides pre built analytics products and solutions for government, financial services, insurance as well as third party analytic packages
  • Software solution delivered on certified hardware configurations (managed service and cloud offerings are on the way)
  • Focused on providing analytics related to fraud and other risk management applications

Google Cloud Platform

  • As part of its cloud computing platform, Google has released BigQuery, a real time analytics service for big data that is based on Dremel, which is a scalable, interactive ad-hoc query system for analysis of large datasets
  • Other projects modeled after Dremel include Drill, an open source Apache project led by MapR for interactive ad hoc querying and analysis of big data sets as part of its Hadoop distribution

CitusData

  • An analytic database based on PostgreSQL database that offers SQL querying capabilities
  • Also offers SQL querying capabilities for data in Hadoop clusters
  • Offers a software solution that can run on commodity hardware