Paradigma promoted the Big Data Spain conference that took place on Friday November 16th at the ETSIT-UPM in Madrid. I attended the conference along with other colleagues from Paradigma. These are the main things I learned from the conference.
You do not need Terabytes of data to have a Big Data problem
There is not just one definition of “Big Data”. It just means different things to different people. In a nutshell, Big Data is a term referring to technologies and approaches to make sense of lots of data. Nevertheless, “lots of data” could mean large amounts of stored data or not such large amounts that should be processed very fast. Large here is anything from Gigabytes to Exabytes.
Jordan Tigani at Google, said at the conference that Big Data is not only about the size of the data, but about having problems processing them. In some cases data arrive faster than can be processed or their size is growing very rapidly. In other cases, the data format, lack of structure or data architecture make the system not easily scalable.
Big Data is more about architecture than volume
Big Data involves the analysis of data and, crucially, the storage and management of big scalable systems. If other to be able to scale up systems require that adding more machines does not imply any further one-off investments. In other words, a system that can scale greatly in a linear relation with the costs. If you are to multiply by ten your infrastructure your costs will not increase tenfold or more. This is easy to say, but very hard to do and it requires a lot of planning.
Big Data was not invented yesterday
Marketing specialists, industries and governments have been collecting data about individuals for decades. Aggregating data of all kinds is not new, as Jon Bruner, at O’Reilly Media, explained in his brilliant talk at the conference. He illustrated his claims with examples such as the traffic lights system of Los Angeles. Thus, Big Data does not necessarily mean “software Big Data”. Then, why does it sound like a new software concept?
For starters, hardware has become significantly cheaper in the last years, meaning more computing power and more storage capacity for less money. Therefore, processing Terabytes of data is not only possible, but affordable.
More software means more data
The other important change is that more data are produced now, because there is more software in our lives. Paraphrasing Bruner, the various social networks produce a lot of information that can also be processed, and the new generation mobile phones are full of sensors that are constantly producing data. Your mobile phone knows where you are, with whom you have talked and maybe, it also knows when you wake up or when you go to bed. As Brendan McAdams, from MongoDB, pointed out, “software eats data and excretes data.” More data produced needs more software to process it, which produces more data, and so on in an infinite cycle.
In summary, Big Data seems something new because there are more data produced and the technology to analyze it and store it has become affordable.
Another cause for the recent “fame” of Big Data is the recent emergence of the MapReduce paradigm. MapReduce applies the principle of “divide and conquer”. This implies preparing and “chunking” your data to be able to process it in parallel. When you use MapReduce paradigm, first you divide the processing tasks into independent tasks (map) and then you add all the results in the final stage (reduce). By using these strategies, a lot of small commodity computers can be used for processing humongous quantities of data without a High Performance Computer.
Big Data is not just MapReduce
I thought that every Big Data problem could be addressed by using any NoSQL database and some MapReduce implementation, such as Hadoop, before attending the event. What most speakers said during the event is that Big Data requires using the right tool for each problem, just as in any other engineering area. Jonathan Ellis, from DataStax explained that no database is suited for every task. Not everything can be or should be solved using MapReduce. It is great for processing slow tasks in a batch and making them much faster. However, for some tasks faster is not enough. There are a lot of problems that require almost real time solutions, and for those MapReduce has too much latency.
Use memory, not disk
The caveats with using just MapReduce were explained by many speakers, for example by Don Rochette (CTO of AppFirst). He said that “Tape is dead. Disk is tape” which means Big Data processing should limit the writing to disk. An issue with MapReduce is that it needs to write each step it takes into disk. First, map, write into disk, then reduce, write into disk. Moreover, if you want to do additional processing you are forced to add more complete MapReduce cycles, with all the writings into disk.
For Big Data realtime applications, writing into disk should be used very sparsely and just for having “checkpoints”, and not every step can be a check point. Now the major actors in Big Data are using MapReduce just for those tasks which are well suited for it. A good example is that in the next version of Hadoop, MapReduce will only be one of the multiple data-processing strategies that can be applied when using Hadoop.
Complexity is the next Big Challenge
As more tools are employed with tackling Big Data issues and the systems grow more and more complex, it is crucial to keep them scalable. The foremost objective of Big Data is making costs linear with volume. Alan Gates, from Hortonworks, explained that, before Big Data, everything was deployed on a single machine. For instance, the database would be deployed on a dedicated MySQL server, the web server on another one and maybe some applications on a third server. With Big Data, each machine has many components.
Therefore the next big challenge is to make these very complex systems easy to scale, making them easy to manage and easy to deploy and maintain. A company should be able to multiply the number of systems it uses, and do it with relatively low effort in time and money.
The cloud makes things cheaper, not simpler
In this regard, it is obvious that “the cloud” will help greatly, as there is no way that every company has every computer they need. Nevertheless, deploying a complex system within a vendor platform makes it even more complex. Also, it is important to note that, as Nati Shalom (CTO of GigaSpaces) explained, a cloud system is just another tool, so it is important to choose the right cloud for your needs, and also to prepare your system to be able to move it to another cloud if needed or even to use more than one cloud.
There is no “magical solution” that can be applied to every problem, therefore the challenges are: first, to define a methodology to cope with Big Data issue, second, to identify the best-fitting Big Data tools for each issue. And last but not least, to create a way to manage complex systems, scaling-out and maintaining a linear cost.
This is just the beginning
The conclusion is that Data is where software engineering is headed to, the technologies, philosophies and techniques to make everything scalable and easily maintainable. There is still a lot to develop in Big Data, not everything is invented and it is exciting to see that you could participate in this endeavor without being Amazon or Google.
I really do think that Big Data will evolve greatly in the next years and I believe that the Big Data architectures will be used in every aspect of software engineering.