Among the new storage systems that are appearing within the Big Data universe, Cassandra is one of the most interesting and significant. Cassandra is defined as a distributed and massively scalable NoSQL database, and this, from our point of view, is its greatest virtue: the capacity to scale up linearly.
Additionally, Cassandra introduces very interesting concepts, such as multi data center support and peer-to-peer communication between its nodes. In this article, we will take a closer look at these and other characteristics that make Cassandra so special.
History and origins
Cassandra’s original development started in Facebook, where it was designed to power the inbox search feature. In 2008, it was released as an open-source project; in February 2010, it became a top-level Apache Software Foundation project. It is inspired on and influenced by the 2007 Amazon Dynamo and 2006 Google Bigtable papers. Nowadays, it is maintained and developed by Datastax.
Its name is inspired on priestess Cassandra of Greek mythology, who had the gift of prophecy and predicted the Trojan Horse deception.
Architecture and characteristics
Cassandra offers tolerance to partitions and availability, but at the expense of consistency, as defined in the CAP theorem. The level of consistency is tunable, based on our preferences, including at the query level.
It is distributed, which means that the information is scattered across the various nodes of the cluster. Furthermore, it offers high availability, so that if any of the nodes fails, the service doesn’t drop or degrade.
It is linearly scalable, which means that the throughput increases linearly as we add nodes. For example: if we can support 100,000 operations per second with 2 nodes, we will support 200,000 with 4. This gives our systems great predictability.
It is horizontally scalable, which means we can scale up our system by adding new nodes based on low-cost commodity hardware.
It implements a peer-to-peer architecture, which eliminates single points of failure and dispenses with the master-slave patterns of other storage systems. This way, any of the nodes can take on the role of query coordinator. It will be the driver to decide which node should take on the role of coordinator.
The data is distributed across the cluster based on a single token, calculated for each row by a hash function. The nodes evenly split the range of tokens, which goes from -263 to 263; this defines the primary node. Internally, Cassandra will replicate the data across the nodes with the policy we define; for example, we can define the factor of replication. Additionally, it supports the concept of data center to logically group the nodes and keep the data closer to the user.
Cassandra Query Language (CQL) is the language used for accessing data in Cassandra, derived from SQL. In Cassandra, the data is denormalized, and therefore the concept of joins or subqueries doesn’t exist.
We can interface with Cassandra with CQL through the CQL shell, cqlshell. We can also use graphic tools like DevCenter or through supported drivers for multiple programming languages.
It also combines properties of a key-value database with those of a column-oriented one. As we can see in the following diagram, the information is organized in such a way that the entire row has a single key and a series of key pairs, a column value. It is important to keep these characteristics in mind when we design our data model.
When we design our model, we must be guided by the data-access patterns, we must analyze the queries we want to implement in our system; this way, we will be able to design an efficient model that can make the most of Cassandra’s advantages.
It is also important to adequately define the partition key of our data, because Cassandra will use this key to distribute the data across the cluster. If we want to make the most of our cluster, we must conceive the distribution of the data to avoid bottlenecks. It is also advisable, given our query pattern, to attempt to minimize the number of partitions we must access for any one reading.
Cassandra is a brilliant solution for many of the uses we can find in the world of Big Data. Nevertheless, it is not suited to store a conventional data warehouse.
Ideally, we want to have from the start a clear understanding of the use and type of queries needed, so that we can design the database coherently. This way, we will be able to handle large data volumes and make the most of this powerful distributed database.