Most of the information that is produced today is done so continuously (sensors, transactions, interactions, user activity…). Giving a fast answer for Big Data processing is becoming increasingly important.
The most common way to analyse all this information is to keep it in stable storage(HDFS, DBMS…) for later periodically analysis through batch processing.
The main characteristic of streaming processing engines is that they are able to analyse this information as it arrives. We consider real-time as data streaming processing in the shortest time possible to perform an analysis of the processed information.
Among these recently emerges tools we emphasise Samza (developed by LinkedIn), Storm or Flink. This article will focus on explaining what is behind Flink, how and where it emerged, and how we can use it in projects that require the shortest response times.
What are the origins of Flink?
The Flink project began as a collaboration of several European universities in a research project called “Stratosphere: Information Management on the Cloud”. Flink is a fork of this project and it was in March 2014 that it became part of Apache Incubator. December 2014 was a key date in this project because it was accepted as a Top-Level project by Apache. Today the framework is supported and developed by the start-up Data Artisans.
Apache Flink is an open-source platform for stream processing of scaling data and batch processing. Apache Flink is not just another Big Data analytics framework. Its design includes many technical innovations and a different vision that sets it apart from the rest.
Why use Flink?
The original design of Flink is based on the concepts of MapReduce, MPP Database, (Massively Parallel Processing), and data-flow systems. Flink can work independently from existing technology like Hadoop, but it can run over HDFS and YARN.
Thus, stream processing allows for the simplification of infrastructure, minimising the number of components that must be maintained and orchestrated in our architecture.
Apache Flink includes the following features:
- Low latency (results in milliseconds).
- High throughput (millions of events per second).
- Consistency (correct result in case of errors).
- Fault tolerance through a system of distributed snapshots.
- Unordered events (event processing according to associated time).
- Very flexible streaming windows system.
- A single system for batch processing and streaming.
- Intuitive multilingual APIs (Scala, Python, and Java) similar to those of the batch model.
The previous systems had approached the problem of stream processing in a different way. Apache Storm was a pioneer in real time using processes like Pure Streaming, and did for real-time processing what Hadoop did for batch processing.
Apache Spark found an intelligent way of doing real-time processing using an estimation through micro-batching. Apache Flink accomplishes pure streaming by implementing features like memory processing, native support for iterations, automatic process automisation, and advanced time window support. The following graphic shows the production improvements with regard to Apache Storm:
How is Micro-Batching different from pure Streaming? With pure Streaming the entries arrive as records in a unique sequence and the output is necessary as quickly as possible. With micro-batching the entry is divided into batches by record number or time.
After seeing all the features of Flink, what kind of projects is it appropriate for? How could we use it in our projects? How can we get the best performance? For example, we can implement efficient distributed systems that respond rapidly to computationally complex questions (machine learning, statistics…), implement cleaning processes and pre-filtering of huge amounts of information, detect anomalies, and implement real-time systems of monitoring or alerts, IoT projects, etc.
Do you need very fast response time or to show data in real time? Perhaps the tools based on micro-batching don’t fit your needs. Flink offers us a new way of real-time processing to get a more instant output.
Without a doubt Flink is going to be a relevant technology in the Big Data world for the near future. It has brought a new focus to real-time processing and its innovative approach is going to be very appealing for cases of use where real time is a decisive factor.