Big Data in the Cloud

Amazon Web Services, Google Cloud Platform and Azure have become the main providers of Cloud technology today. Among the many different IaaS and PaaS solutions these providers offer, the components that offer specific solutions for the Big Data field stand out. In this post, we’ll analyse the Big Data-oriented tools offered by these three providers. We’ll also clarify the different components, such as storage, processing or intelligence solutions.

In this post, we’ll analyse the main Big Data-oriented tools offered by these three providers. We’ll also clarify the different components, such as storage, processing or intelligence solutions.

Amazon

aws

DynamoDB

NoSQL database managed service, capable of storing a large volume of data while offering latencies of less than 10 milliseconds, regardless of the storage size. These systems always use SSD to be able to offer this performance.

Redshift

This is the quintessential Data-warehouse service, ideal for Business Intelligence. Expandable up to several Petabytes. It features a standard ODBC/JDBC interface, and a simple integration for data uploading and backups from S3.

S3

Vertebral storage system of AWS objects that allows storage of files up to 5TB per file. It is a highly scalable system with encryption capabilities. With a durability in this type of system that is very high, reaching up to 11 9s (99,999999999).  It is the main system used for importing data or storing results.

EMR

This is a simple method for analysing a large volume of data through a classic MapReduce. For this purpose Amazon uses Apache Hadoop or Spark for the distribution and processing of the data over instances of EC2 enabling the execution of jobs in batch as well as real-time (spark streaming).

Kinesis

This is a platform that offers us Amazon for the management of data in real time, divided into three components. Kinesis Firehose for the uploading of data in S3 or Redshift. Kinesis Streams management of streams and communication with other applications. Kinesis Analytics for the carrying out of simple analytics that allow transformations or aggregations of the data through SQL language in flight.

Data Pipeline

This helps us in the orchestration of the data between different services offered by AWS. During this transition, it  allows us to realise transformations on the data. This service, like others from Amazon, has an on-premise available.

Machine Learning

A service for developers that makes possible even the creation of models in a simple way without having to worry about the algorithm to use. Our task consists of providing the data to train the model via S3, Redshift, or RDS. Once we have the model trained, we just have to send our prediction requests.

QuickSight

This is the Business Intelligence service with high integration with Redshift as a storage layer, though of course it also allows for integration with other data sources. Oriented for business users and with an in-memory data system called SPICE.

ELS

This service offers us a self-managed Elasticsearch engine. It comes with the Elasticsearch-Logstash-Kibana stack in client mode and also offers us the Elasticsearch API for more concrete queries or to carry out indexation of the information we need.

Google Cloud

gcp-header-logo

Bigtable

This is the expandable and self-managed NoSQL service that Google offers with low latency and high performance, even when subjected to a high load. It is also compatible with the HBase communication interface in addition to via API, allowing capacity at the Petabyte level.

Datastore

Datastore offers us a scalable NoSQL storage system. It is a hosted service which oromatically manages sharding and replication. It offers ACID transactions, indices, and an SQL-like search interface.

BigQuery

BigQuery offers a service hosted by Datawarehouse that scales to a Petabyte level. It supports queries through SQL 2011 and allows us to define user-defined functions (UDFs) in Javascript. It is easy to upload data through Cloud Storage, Cloud Datastore, Cloud Dataflow or files through serverless ETLs.

Dataproc

Dataproc allows us to manage Big Data processing clusters based on open source technology like Apache Hadoop or Apache Spark in a simple way. It relies on the Google Compute Engine for the execution, but absolves us of the complexity of managing the cluster, so that we can concentrate on the development and execution of the jobs.

Dataflow

This is a tool designed for the definition of execution pipelines in a distributed manner, very useful in the case of ETLs, batch processing and continuous processing or streaming. It is based on Apache Beam and the executors internally will be Apache Spark or Apache Flink.

Pub/Sub

This offers a high performance message queue service, which allows us to uncouple the different components of our system by means of asynchronous processing. It is designed to offer low latencies and is capable of processing more than 1,000,000 messages per second. Internally Google uses this component in a large number of its applications, from Google Ads to Gmail.

Datalab

Datalab is a tool based on Jupyter, which makes available to us a powerful interactive tool to explore, analyse, and visualize data.  We can easily connect it with BigQuery or Cloud Storage. With Datalab we can combine code, documentation, results, and visualisations together in a notebook format.

Machine Learning

Google offers us a hosted Machine Learning service, based on its open source library TensorFlow. It allows us to train models and make predictions in a simple manner and without worrying about the computing resources necessary for this. It integrates naturally with Google Datalab.

Natural Language API

This high-level tool offers us a REST API through which we have available a very powerful text analysis tool. We can extract information and make analyses of feeling, content classification, and relational graphs. Internally Google uses technology based on Deep Learning to offer this service.

Azure

azure

SQL Data Warehouse

SQL Data Warehouse is the solution managed by Microsoft for having a data warehouse aimed at performing analytics. It offers us an SQL interface to perform searches on the data. It integrates easily with the rest of the Azure components, such as Azure Active Directory for managing security. It can also integrate with other BI tools like Tableau or Qlik and even with Microsoft Excel.

Data Lake Store

Azure’s Data Lake offers us a hosted service based on Hadoop HDFS in the cloud. We will be able to store structured or unstructured data in its original format. Thanks to its distributed nature we will get high performance to carry out batch processes or queries in real-time. It internally manages the replication, which guarantees the durability of the data.

Stream Analytics

This is the existing product for data stream analysis in real time, making use of simple scripts in SQL. Combined with Event Hub, sales figures or system monitoring alerts can be generated.

HDInsight

A distributed processing service based on open source technology like Apache Spark or Apache Storm. It allows us to scale up the capacity of the cluster on request and thus be able to process a TB or PB of information. It allows us to develop our jobs in .Net or Java.

Cognitive Services

These services encompass a series of high-level APIs that simplify the application of Artificial Intelligence in our systems. We have services geared towards language processing, speech, vision, searching, and knowledge.

Machine Learning

Microsoft offers an interactive workflow-type system for live analysis with a complete set of data transformations or via API to be used by external applications. For non-professional users it provides an totomatic system for the generation of optimal models.

 

Conclusion

In summary, we can say that the main providers of cloud technology are including a growing range of tools geared towards Big Data, although without a doubt they are still far from full Big Data platforms like StratioCloudera or Hortonworks which cover all the necessities from start to finish in a perfect flow of input, storage and Big Data processing.

On the other hand, these Big Data platforms always provide us the possibility of running on premise or in the cloud via IaaS, something which cloud tools don’t offer us. In any case, the cloud tools are a good solution in many occasions and offer us the possibility of solving concrete problems cost-efficiently and with a reduced time to market.

Nuevo llamado a la acción

Foto de jaimefernandez

Professional who has participated in digital projects for different sectors in different parts of Europe: from the development of software for the optimisation of aeronautical structures to projects investigating neuronal interactions in the cerebral cortex, working in multicultural and multidisciplinary teams. Currently a member of Paradigma, dedicated to the technical definition of products in the telecommunications sector.

See all Jaime Fernández Martín activity

Escribe un comentario