When we ask what is Big Data and what are the roles associated with it, we find endless definitions that often confuse us instead of clarifying concepts.
In this post we will not give a formal definition, but one that fits our point of view and our experience in Big Data. We will not elaborate a long list of profiles, we will only focus on those that play a key role in the Big Data universe.
From the basic definition of Wikipedia:
“Big data, big data, massive data, data intelligence or large scale data is a concept that refers to such large data sets that traditional data processing applications are not enough to deal with and the procedures used to find repetitive patterns within those data”.
… we extract 2 clear ideas:
- Nowadays, data sets of such immense volume are being generated that the tools that have always served to store and process them have become obsolete.
- The data is used to obtain useful information based on these “repetitive patterns”, which serve to analyze past behaviors and to predict future behaviors.
On the other hand, and to get an idea of the immensity of the volume mentioned in point 1, in an article published by IDC they foresee that by 2025 the total volume of the world data will be 163 zettabytes (1,000,000,000,000 gigabytes).
That is, on the one hand we have the processing of large volumes of data and on the other the analysis of such data. And that’s it? Is this Big Data? Not so fast!
Big Data is a technological revolution. The fact is, having so many areas makes it difficult to define because there are many things in general and none in particular.
In addition to this, its definition is complicated by the fact that it is an ecosystem in constant evolution. Each year it is composed of new tools, improvements and concepts that make the complexity of the Big Data world grow and, therefore, the diversity and complexity of its roles.
What roles do they play in this world?
He who claims to be an expert in Big Data is like one who claims to be a computer expert. The next question should be: “An expert, yes, but in what branch?”.
The subject in question tells us again that he is an expert in Big Data. There are three possibilities. Either he is a superior being, he is lying to us or he does not want to explain what he is doing in particular, since saying “I am Data Scientist” or “I am a Data Engineer” in general provokes a reaction of strangeness followed by “And what is that?”.
And the answer is what we are going to try to develop in the shortest and most concise way possible in this article (note that this post can become obsolete as soon as the world of Big Data continues evolving).
You can define many roles. As many as people who decide to write an article giving their opinion on the subject.
As part of the development team of Paradigma in the Aura project in Telefónica, we will give our humble opinion trying to break down the roles, based on the two ideas we have drawn at the beginning of the article: the storage/processing of data and its analysis.
Data Analyst (DA)
Focusing first on profiles more oriented to data analysis, Data Analyst is a profile that came before Data Scientist. In some cases they are refrred to as “Junior Data Scientists “.
They have a fairly generalist role, covering a wide range of functions that include mining, obtaining and/or retrieving data as well as its processing, advanced study and visualization.
The study or advanced analysis of data is done based on algorithms, mathematical and statistical methods. Therefore, this profile mainly requires knowledge of maths and statistics applied to data mining and machine learning.
The latter means that it is also essential to know how to develop software (at least in current projects). Although its specialty is Machine Learning, the use of libraries of statistical methods such as Panda requires in depth knowledge in the operation of each algorithm, as well as the basic functionality of the corresponding language, in this case Python. Another common language for a Data Analyst could be R.
In addition to the concepts of Machine Learning and the Python and R languages, Data Analysts stand out for their knowledge in the use of notebooks such as Jupyter, as well as knowledge of the Big Data environment in which they work, such as Spark or Hadoop.
It is also well valued that you have knowledge of SQL Databases and traditional Business Intelligence.
It is the “evolution of Data Analyst”. In many cases they are considered the same profile with a different approach. For us, it is a more specific role and less aligned with the business vision.
Like the DA, it requires knowledge of mathematics, statistics and Machine Learning, programming languages such as R or Python, the use of notebooks and Big Data ecosystems, but what we believe differentiates the Data Scientist is that they are responsible for extracting value from data.
They also obtain, process and visualize data, although with a more focused role in prediction, based on the behaviors learned.
Considering a Data Scientist as a more modern version of Data Analyst, it is more appropriate for them to use more recent libraries such as TensorFlow for Deep Learning techniques based on neural networks.
Also many of its developments are linked to Artificial Intelligence techniques and neuro-linguistic programming (NLP). But, once again, they are quite similar profiles and the inclusion of technologies is not strict for one role or another.
In the case of Data Scientists that use tools such as SAS Enterprise Miner to perform statistical analysis, there is a perception on the part of many that the tool itself does not require programming knowledge, a perception with which we currently disagree.
Although it is true that SAS in many cases provides a much more graphic and visual modeling capacity, it is still required to know how the algorithms behind each operation work, and in many cases, it will also be necessary to know the SAS programming language.
Already focusing on the storage and processing of data, we find ourselves with the role of Data Engineer. This is our role in the Aura project at Telefónica and here is one of the reasons why we are going to give it a lot of importance.
Perhaps the most relevant is that it provides the Big Data project with a value very different from the one provided by a Data Scientist or Data Analyst.
We know that the latter are the ones that work with the data, but where do they get it from? How does the environment in which they do their analysis work? It is the task of the Data Engineer to prepare the entire ecosystem so that others can obtain their data clean and prepared for analysis.
The Data Engineers are those who design, develop, build, test and maintain the data processing systems in the Big Data project.
You must know how the data is modeled as well as having a wide knowledge of the SQL databases, since in the Big Data world they are not excluded and in many cases they are still the origin of the data. They simply complement each other.
They perform and program data intakes (for example, from a relational model to a Spark processing engine). They also do cleaning, validation, data quality and aggregation processes so that the information reaches the Data Scientist as expected, and they configure the cluster in Spark (number of nodes and cores per node, GB of RAM) so that the statistical models are executed optimally.
What technologies do they use? A Data Engineer should know Linux and Git much like an engineer working on software projects. Hadoop and Spark at the environment level; Map Reduce at the level of computational models; and HDFS, MongoDB and Cassandra at the level of NoSQL technologies.
In terms of programming languages it is essential to know SQL, since the relational model is still an important part in the generation and query of data.
It is also usually required to know one or two of the following languages: Python for data processing (sometimes PySpark) and Scala as the native language of Spark and Java in many cases.
Should a Data Engineer know the models used by the Data Scientist in depth? In principle, you should know what it means to use one or another model for the environment, and what architecture is ideal for them to work in.
In summary, the Data Engineer is in charge of the Big Data infrastructure. How important can this be? According to the article by Todd Goldman, which is based on a Gartner study, it states that only 15% of Big Data projects go into production, it is obvious that basic implementations in architecture are overlooked.
This is the key to realize why the remaining 85% does not reach production. The slowness with which the data is loaded, the failure to do it automatically and incrementally, the inability to consult them and the lack of agility to migrate from the testing environment to the production environment are problems that the inclusion of more Data Engineers would help solve.
The Data Engineer plays a key role when it comes to converting a Big Data PoC into a real and tangible project. That is, from prototype to production.
At this point many may wonder what a Data Architect would be then. According to our point of view, a Data Architect is a Data Engineer with a more global vision, and more oriented to the integration, centralization and maintenance of all data sources.
We are aware that we may have left out some profiles that someone considers important. The MIS Reporting Executive, the Business Analyst, the statistician, the Machine Learning Engineer, or even the Data Translator.
There are also traditional profiles such as the Oracle DBA, the Teradata Business Analyst or the “All-terrain Java dev” that have been recycled and also have their function here. But with this article we have tried to talk more about the roles that are played in the world of Big Data and not profiles or certifications.