The emergence of the Internet, more than two decades ago, has transformed business models and, in recent years, data has gained special relevance for decision making with regards to the future of companies.

In this line, for some years now, we have heard the term Big Data more and more frequently, but do we really know what it consists of?

Big Data

When we talk about Big Data we refer to large volumes of data, both structured and unstructured, that are generated and stored on a day-to-day basis. Although what is really important is not the amount of data we have, but what we do with it and what decisions we make to help improve our business, based on the knowledge obtained after analyzing the data.

Directly related to this concept, we can find the DIKW pyramid which establishes that information, knowledge and wisdom are defined based on the data as we see in the following image:

Image 1. The DIKW pyramid (wikipedia.org)

Big Data projects are carried out on distributed file systems, in many cases on the distributed storage system of the Hadoop ecosystem, HDFS (Hadoop Distributed File System).

HDFS is a distributed, scalable and portable file system written in Java, which was initially designed to be used together with Hadoop, an open source framework for distributed application development inspired by Google's File System and MapReduce papers.

For a few years it has been the most used framework to carry out Big Data projects and has been the key element in its evolution to take it to the point where it is nowadays.

To place in time the appearance of Hadoop and others strongly related to Big Data we will use the following image:

[caption id="" align="aligncenter" width="911"]

Timeline of Big Data technologies[/caption]

In addition to the technologies related to the Hadoop ecosystem (Hadoop, Hive, HBase, etc...) Spark stands out for its determining role in the evolution of Big Data.

Spark is a distributed data processing engine that can handle large volumes of information. It could be understood as an evolution of Hadoop MapReduce, offering among others the following advantages over it:

In the timeline that we have seen previously, we can find two Greek letters in the graph in the years 2012 and 2014, which have given name to the different architectures of which we are going to talk about next.

Lambda and Kappa Architectures

Given that companies have an increasing volume of data and need to analyze and obtain value from it as soon as possible, there is a need to define new architectures to cover use cases different from the existing ones.

The most common architectures in these projects are mainly two: Lambda Architecture and Kappa Architecture. The main difference between both is the flows of data processing that intervene, but we will see what each one consists of in more detail.

A couple of concepts that we have to define before seeing the characteristics of each, are batch processing and streaming processing.

Lambda architecture

The Lambda Architecture represented by the Greek letter λ, appeared in the year 2012 and is attributed to Nathan Marz. He defined it based on his experience in distributed data processing systems during his time as an employee in Backtype and Twitter, and is inspired by his article “How to beat the CAP theorem”.

Its objective was to have a robust, fault-tolerant system, both from human error and hardware, that was linearly scalable and that allowed for writing and reading with low latency.

Nathan gives the solution to this problem by creating an architecture whose high level diagram appears in the following image:

Image 2: Lambda architecture

The characteristics of Lambda Architecture are:

In short, this type of architecture is characterized by using different layers for batch processing and streaming.

Kappa architecture

The term Kappa Architecture, represented by the greek letter Κ, was introduced in 2014 by Jay Krepsen in his article “Questioning the Lambda Architecture”.

In it, he points out possible "weak" points of Lambda and how to solve them through an evolution. His proposal is to eliminate the batch layer leaving only the streaming layer.

This layer, unlike the batch layer, does not have a beginning or an end from a temporal point of view and is continuously processing new data as it arrives.

As a batch process can be understood as a bounded stream, we could say that batch processing is a subset of streaming processing.

This evolution consists of a simplification of the Lambda architecture, in which the batch layer is eliminated and all the processing is done in a single layer called Real-time Layer, giving support to both batch and real-time processing.

The architecture diagram would be represented by the following image:

Image 3: Kappa architecture

We can say that its four main pillars are the following:

As a prerequisite, it must guarantee that the events are read and stored in the order in which they were generated. In this way, we can vary a specific processing from the same version of the data.

Which architecture best fits to our needs?

Once we have seen what each of the architectures consists of, now is the complicated part of deciding which one fits best for our business model.

As in most cases, it can be said that there is no single optimal solution for all problems, which is usually defined by the term "One size does not fit all". Lambda Architecture is more versatile and is able to cover a greater number of cases, many of which require even real-time processing.

One question that we must ask ourselves in order to decide is, is the analysis and processing that we are going to carry out in the batch and streaming layers the same? In this case, the most appropriate option would be the Kappa Architecture.

As a real example of this architecture we could put a system of geolocation of users by their proximity to a mobile phone antenna. Each time you approached an antenna that gave you coverage, an event would be generated. This event would be processed in the streaming layer and would be used to paint on a map its displacement with respect to its previous position.

However, at other times we will need to access the entire data set without penalizing the performance so here Lambda Architecture can be more appropriate and even easier to implement.

We will also lean towards a Lambda Architecture if our batch and streaming algorithms generate very different results, as can happen with heavy processing operations or in Machine Learning models.

A case of real use for a Lambda architecture could be a system that recommends books according to the tastes of the users. On the one hand, it would have a batch layer in charge of training the model and improving the predictions; and on the other, a streaming layer capable of taking charge of real-time assessments.

Conclusions

To conclude, we must point out how quickly the use cases that we want to cover with our Big Data solutions evolve, and that means that we must adapt to them as soon as possible.

Each problem to solve has particular conditions and in many cases we will have to evolve the architecture that we were using so far. As they say: "renew or die".

Tell us what you think.

Comments are moderated and will only be visible if they add to the discussion in a constructive way. If you disagree with a point, please, be polite.

Subscribe