Amazon Web Services, Google Cloud Platform and Azure have become the main providers of Cloud technology today. Among the many different IaaS and PaaS solutions these providers offer, the components that offer specific solutions for the Big Data field stand out. In this post, we’ll analyse the Big Data-oriented tools offered by these three providers. We’ll also clarify the different components, such as storage, processing or intelligence solutions.
In this post, we’ll analyse the main Big Data-oriented tools offered by these three providers. We’ll also clarify the different components, such as storage, processing or intelligence solutions.
NoSQL database managed service, capable of storing a large volume of data while offering latencies of less than 10 milliseconds, regardless of the storage size. These systems always use SSD to be able to offer this performance.
This is the quintessential Data-warehouse service, ideal for Business Intelligence. Expandable up to several Petabytes. It features a standard ODBC/JDBC interface, and a simple integration for data uploading and backups from S3.
Vertebral storage system of AWS objects that allows storage of files up to 5TB per file. It is a highly scalable system with encryption capabilities. With a durability in this type of system that is very high, reaching up to 11 9s (99,999999999). It is the main system used for importing data or storing results.
This is a simple method for analysing a large volume of data through a classic MapReduce. For this purpose Amazon uses Apache Hadoop or Spark for the distribution and processing of the data over instances of EC2 enabling the execution of jobs in batch as well as real-time (spark streaming).
This is a platform that offers us Amazon for the management of data in real time, divided into three components. Kinesis Firehose for the uploading of data in S3 or Redshift. Kinesis Streams management of streams and communication with other applications. Kinesis Analytics for the carrying out of simple analytics that allow transformations or aggregations of the data through SQL language in flight.
This helps us in the orchestration of the data between different services offered by AWS. During this transition, it allows us to realise transformations on the data. This service, like others from Amazon, has an on-premise available.
A service for developers that makes possible even the creation of models in a simple way without having to worry about the algorithm to use. Our task consists of providing the data to train the model via S3, Redshift, or RDS. Once we have the model trained, we just have to send our prediction requests.
This is the Business Intelligence service with high integration with Redshift as a storage layer, though of course it also allows for integration with other data sources. Oriented for business users and with an in-memory data system called SPICE.
This service offers us a self-managed Elasticsearch engine. It comes with the Elasticsearch-Logstash-Kibana stack in client mode and also offers us the Elasticsearch API for more concrete queries or to carry out indexation of the information we need.
This is the expandable and self-managed NoSQL service that Google offers with low latency and high performance, even when subjected to a high load. It is also compatible with the HBase communication interface in addition to via API, allowing capacity at the Petabyte level.
Datastore offers us a scalable NoSQL storage system. It is a hosted service which oromatically manages sharding and replication. It offers ACID transactions, indices, and an SQL-like search interface.
Dataproc allows us to manage Big Data processing clusters based on open source technology like Apache Hadoop or Apache Spark in a simple way. It relies on the Google Compute Engine for the execution, but absolves us of the complexity of managing the cluster, so that we can concentrate on the development and execution of the jobs.
This is a tool designed for the definition of execution pipelines in a distributed manner, very useful in the case of ETLs, batch processing and continuous processing or streaming. It is based on Apache Beam and the executors internally will be Apache Spark or Apache Flink.
This offers a high performance message queue service, which allows us to uncouple the different components of our system by means of asynchronous processing. It is designed to offer low latencies and is capable of processing more than 1,000,000 messages per second. Internally Google uses this component in a large number of its applications, from Google Ads to Gmail.
Datalab is a tool based on Jupyter, which makes available to us a powerful interactive tool to explore, analyse, and visualize data. We can easily connect it with BigQuery or Cloud Storage. With Datalab we can combine code, documentation, results, and visualisations together in a notebook format.
Google offers us a hosted Machine Learning service, based on its open source library TensorFlow. It allows us to train models and make predictions in a simple manner and without worrying about the computing resources necessary for this. It integrates naturally with Google Datalab.
This high-level tool offers us a REST API through which we have available a very powerful text analysis tool. We can extract information and make analyses of feeling, content classification, and relational graphs. Internally Google uses technology based on Deep Learning to offer this service.
SQL Data Warehouse is the solution managed by Microsoft for having a data warehouse aimed at performing analytics. It offers us an SQL interface to perform searches on the data. It integrates easily with the rest of the Azure components, such as Azure Active Directory for managing security. It can also integrate with other BI tools like Tableau or Qlik and even with Microsoft Excel.
Azure’s Data Lake offers us a hosted service based on Hadoop HDFS in the cloud. We will be able to store structured or unstructured data in its original format. Thanks to its distributed nature we will get high performance to carry out batch processes or queries in real-time. It internally manages the replication, which guarantees the durability of the data.
This is the existing product for data stream analysis in real time, making use of simple scripts in SQL. Combined with Event Hub, sales figures or system monitoring alerts can be generated.
A distributed processing service based on open source technology like Apache Spark or Apache Storm. It allows us to scale up the capacity of the cluster on request and thus be able to process a TB or PB of information. It allows us to develop our jobs in .Net or Java.
These services encompass a series of high-level APIs that simplify the application of Artificial Intelligence in our systems. We have services geared towards language processing, speech, vision, searching, and knowledge.
Microsoft offers an interactive workflow-type system for live analysis with a complete set of data transformations or via API to be used by external applications. For non-professional users it provides an totomatic system for the generation of optimal models.
In summary, we can say that the main providers of cloud technology are including a growing range of tools geared towards Big Data, although without a doubt they are still far from full Big Data platforms like Stratio, Cloudera or Hortonworks which cover all the necessities from start to finish in a perfect flow of input, storage and Big Data processing.
On the other hand, these Big Data platforms always provide us the possibility of running on premise or in the cloud via IaaS, something which cloud tools don’t offer us. In any case, the cloud tools are a good solution in many occasions and offer us the possibility of solving concrete problems cost-efficiently and with a reduced time to market.