A colleague asked me for a brief high level overview of the Big Data software ecosystem, why his company should use it, what for and what’s in it. These are the tidied up notes I put together quickly.

Big Data is a convenient mnemonic to describe an ecosystem of computing tools, techniques and technologies that has grown up over the past decade to address many of the data processing challenges faces by modern enterprises with growing data assets.

This document tales a liberal view of what is inside and outside of a strict definition of the ecosystem because, in total, it represents the most complete set of freely available software options for many enterprises’ data processing needs.

Big Data usage has been associated with processing huge data volumes (terabytes and more) but most of the tools intrinsically do not require large volumes of data and work well with more modest volumes, even volumes currently managed using tools such as Excel.

Big Data software, usually running on the Linux operating system and often on Cloud resources, has been revolutionary within the IT industry facilitating many new ways of working with, and the processing of, data. The ecosystem continues to exhibit huge volatility and rapid expansion with many competing solutions - the market has not decided who the “winners” will be yet with de facto standards such as Hadoop under pressure from newer solutions like Spark.

Big Data has been associated particularly with commoditising advanced analytics (especially predictive analytics / machine learning) and visualisation technologies, bringing them within the reach of many enterprises and offering unprecedented input and insights to guide enterprise decision making.

The majority of the currently leading tools, techniques and technologies are open source and vendor independent / neutral. Although more and more suppliers are entering the Big Data market and selling premium solutions, the core tools, techniques and technologies are available freely from e.g. the Apache Foundation.

Many leading and well-known companies such as Yahoo, Twitter, Google, LinkedIn, Netflix and Facebook have donated key technologies to the public domain either directly on Github and/or under the umbrella of an Apache Foundation project.

Big Data does not have to be Big Ticket

Suppliers’ premiums for their valued-added offerings can be significant and act to deter small scale exercises / trials of elements of the ecosystem that appear to offer the best potential dividends.

This needn’t be the case. It entirely possible to start small and simple, and grow as needed. Its very easy to prototype using existing resources and scale up to e.g. the Cloud if, as and when needed: Big Data does not have to be Big Ticket.

Not only is the Big Data ecosystem often the only way to process large volumes of data, it is also the cheapest.

The benefits of using the Big Data software ecosystem

The benefits of using the Big Data software ecosystem are numerous. Almost everybody will have their own list, here are some of mine.

Scale with growing or existing large data volumes

Mainstream 20th Century tools, techniques and technologies for data processing ran out of steam when confronted with larger volumes of data. Even today, arguably the most popular tool for managing structured data, Microsoft’s Excel, can only hold a million rows.

Many enterprises may not be aware of how much data they have; its not unusual to be have terabytes if any websites are involved.

The Big Data software portfolio is design to scale to handle data volumes becoming commonplace in many enterprises.

Both Batch and Real Time Data Processing

Historically Big Data data processing has been predicated on Hadoop’s batch process (e.g. overnight jobs).

Today, with the advent of technologies such as Apache incubator project Storm, there are solutions to process incoming data in real time; and hybrid solutions in the continuum between the two extremes.

Supports Structured and Unstructured Data

Big Data has a rich set of solutions for acquiring, storing, processing and integrating both structured and unstructured data.

Big Data databases scale to far larger data volumes that commonly available commercial relational databases.

Big Data also has some of the best solutions for mining unstructured text.

Prototyping data analyses previously too expensive to attempt

The availability of free software and “pay as your go” capacity in the Cloud has made the case for performing new and/or more in-depth analyses of data assets easier to cost and justify.

Its easy to do a proof of concept in the Cloud, with constrained costs, to help inform the decision whether the benefit of a full-blown implementation will bring the expected benefits and justify the full costs.

Perform data analyses previously too expensive to run regularly

Big Data software by and large runs on Linux on commodity Intel (x86) processors. Whether using an enterprise’s own infrastructure or a Cloud provider, compute cycles are very cheap today.

Performing data analyses previously too difficult to do

The Big Data ecosystem includes a range of tools for performing advanced analytics and machine learning. Techniques which were once only accessible to enterprises with very specialist staff such as statisticians and data scientists are now almost commoditised, with advice, example implementations and guidance on approaches to use are easily available.

Performing new forms of data analysis

The innovations made in Big Data processing techniques by companies such as Twitter has opened the Industry’s eyes to whole new ways of working with data at scale.

Minimising Supplier Risk

The availability of the source code for most of the software mitigates a whole class of supplier risk.

Mainstream Language Support

Most Big Data software is written in one of the most common commercial programming language in use today: Java. But, and more importantly, in a language that runs on the Java Virtual Machine (JVM).

The JVM is important because many of the other most popular languages in use (e.g. Python, Ruby), and up and coming languages (e.g. Scala, Clojure) also run on the JVM facilitating easier interworking between code bases and effective “re-use”.

Automation

The Linux community has always strongly endorsed and supported automation tools and new tools and ways of working (DevOps) have been developed to create, support and manage infrastructures, notably for Big Data progressing, running on Cloud resources.

The outcome has been that even running 100s or 1000s of virtual servers in the Cloud is manageable at reasonable cost.

Cost

The majority of the most important parts of the Big Data ecosystem are freely available and with liberal licencing (e.g. the Apache Foundation v2.0 licence).

The use of Linux and extraordinarily powerful commodity x86 processors (either own or in the Cloud) offer unprecedented processing capacity at reasonable cost. Hardware costs are likely to fall even further with the advent of competing ARM v8 64-bit processors entering the market ~2014.

Support

Even though the customary support process is informal via the respective communities, many of the most established players do offer formal, paid support product offerings and plans (e.g. Cloudera, MapR and Hortonworks).

The popularity of Big Data software means that skilled people are relatively easily available in the market.

Documentation

The documentation for many Big Data projects is of remarkably good quality given it has been produced mostly by volunteers.

There are also a plethora of good books on many topics, sometimes written by senior contributors to the software, and providing insights into how the software works with working examples of how do many tasks.

Documentation and Books are backed by online question forums such as StackOverflow which has become almost the “go first” sites for questions on computing.

What’s in Big Data Software Ecosystem

The Big Data software ecosystem can seem complex jigsaw of tools, techniques and technologies. But the components of the ecosystem can be broken down into a small number of categories.

MapReduce

Big Data has been for some time almost synonymous with MapReduce - a technique for processing large volumes of data in parallel first articulated in Google’s seminal paper published in late 2004.

Hadoop is the Apache Foundation’s open source implementation of MapReduce.

Distributed Filesystems

Hadoop has its own high capacity, high performance and fault tolerant distributed filesystem called HDFS - the Hadoop Distributed File System.

HDFS stores data (files) across many commodity servers enabling the virtual filestore to be much bigger than capacity of an individual server.

NoSQL Databases

The Hadoop family also comes with its own NoSQL database called HBase - the Hadoop Database.

NoSQL databases are very different from the familiar commercial transactional databases such as Oracle or Microsoft’s SQL Server.

NoSQL databases are not relational and do not (directly) respond to SQL queries (although some can be made to do so - see below).

HBase has been designed to scale to terabytes of data - a stretch for conventional relational databases.

HBase is well-suited to read-heavy online analytical processing (OLAP) workloads. Another database in the Apache family, Cassandra, is tuned for write and is therefore particularly suitable for online transactional processing (OLTP) workloads, including real-time processing of new data.

SQL Support

Many of the exiting database-oriented tools in the enterprise use SQL to interrogate relational databases so access to NoSQL databases using SQL is an important consideration.

Hadoop has a solution called Hive offering a variant of SQL called HiveQL.

The performance of Hive is admittedly modest and recently there have been a spate of new ways of accessing NoSQL databases from SQL tools, such as Cloudera’s Impala, offering a “massively parallel SQL query engine”.

Data Manipulation Languages and Pipelines

The MapReduce paradigm is essentially one of a pipeline of data transformation (map) and aggregation (reduce) steps.

Hadoop comes with Pig (sic). Pig enables rich and complex data processing pipelines to be built in a modular way, using off-the-shelf and custom building block, bound together using a specialised data management language called Pig Latin.

Pig is not to everyone’s taste and there are some excellent alternatives notably Cascalog and Scalding (from Twitter).

Much of the work in acquiring and processing new data begins by capturing, cleaning, normalising and storing the data: the so-called extract, transform and load (ETL) stages. This is where Pig, Cascalog, etc excel.

Machine Learning

Big Data has triggered an explosion in interest and use of the three main areas of Machine Learning / Artificial Intelligence: recommendation, clustering and classification.

The Apache Foundation offering in this space is Mahout but there are many other options available such as R, Weka, scikit-learn, etc.

Although perhaps not thought part of the Big Data ecosystem, search engines are arguably the original Big Data technology.

Search deals mostly with unstructured text-like data (including, and especially, web pages).

The search engine from Apache, Lucene / Solr, is an incredibly powerful piece of technology powering many web sites’ search capabilities.

Some of the most interesting work in Big Data today combines machine learning and search to offer new and innovative insights into the contents of unstructured data.

Natural Language Processing - NLP

NLP is another technology that has “come of age” in the Big Data era of cheap processing.

NLP has proved itself notably applicable to sentiment analysis and named entity recognition.

Apache has its own toolkit OpenNLP but Stanford University’s has a well-received range of NLP tools, while NLTK has had traction in the Python community.

Business Intelligence

For many enterprises, there is little point is processing vast volumes of data unless the results can be reported on and visualised. The industry is well served by many commercial products such as Tableau but Mondrian is an open source gem.

Visualisation and Dash Boards

Computing is generally well served for visualisation tools and languages like R and Python have particularly strong support for graphics.

However it is worth mentioning D3, a JavaScript library for explanatory visualisation in modern browsers i.w. web-based.

Secondly, and also worthy of note, is Shopify’s Dashing for creating browser-hosted dashboards.

Statistics

As a discipline, statistics has been around and well-established for centuries. But the computing era, and more recently Big Data, has provided enormous volumes of data on which to predicate statistical analyses.

The undoubtedly most widely used language for statistical computing is R but Python has a strong following.

But the challenges of Big Data has revealed limitations in R’s usefulness because of its architectural limitations. More recent statistical languages like Julia has to potential to obviate and remove these obstacles to process really large volumes of data.

Data Serialisation and Interchange

Aka the plumbing, data serialisation and interchange technologies such as Apache’s Thrift, Google’s ProtoBufs and Apache’s Avro have broken the back of data exchange between processing components.

What’s Hot

Big Data is only a decade old but already seeing the usurpation of the earlier technologies (e.g. Hadoop) with more recent innovations.

For example, Hadoop is already under pressure from new kid on the block Apache’s incubating project Spark offering “lightning-fast cluster computing”.

Another Apache project Mesos look to revolutionise the way compute cluster resources can be managed, offering resource-efficient dynamic partitioning as the next step on from an the sort of static partitioning found in a Hadoop cluster.

Meanwhile Storm looks to make real-time analysis of new, incoming data a reality.

Nathan Marz’s, in his book Big Data has articulated how batch and real-time can be integrated and used pragmatically to deliver real-time views of large and growing volumes of data.

On the “theory” front, but with important practical implications, there has been some amazing work by Twitter on an Algebra for Analytics using associative monoidal semigroups to enable low-latency, parallelizable computation approaches applicable to a range of use cases culminating in their Summingbird library.