AWS Redshift: A Top-Tier Data Warehouse

Do you want our logo?

Do you want our logo description

Back in 2018, I had the opportunity to work on a project that required deploying an AWS Redshift cluster. At that time, Redshift was no longer a new service (it had been available since February 2013), but due to its cost, it wasn’t the most sought-after service by clients.

For those unfamiliar, AWS Redshift is AWS's data warehouse solution. As a fully managed service, it frees users from administrative tasks such as hardware provisioning, software updates, node and disk monitoring, failure recovery, and backups.

Looking back, I remember finding Redshift inflexible and expensive:

The cluster couldn’t be turned off, the only option was to delete it and restore from a backup (which was a slow process).
Cluster configurations were limited (mostly compute or memory).
The user interface was extremely basic, offering little more than a query history and a simple SQL execution console.

Over the years, I kept seeing AWS announcements about Redshift improvements:

"Cost Savings with Redshift Standby"
"Zero ETL and Redshift"
"Redshift Data Sharing"
Numerous articles highlighting the cost benefits of Redshift Serverless and Redshift Spectrum.

Recently, I had the chance to explore these new features and improvements firsthand. Many of them were crucial for the service's survival, given the fierce competition from:

As Redshift continues to evolve, it remains one of the top choices for cloud-based data warehousing. In the following sections, we’ll take a closer look at how these enhancements make it a strong contender in the market.

What Has Changed? Key Innovations in Redshift

To start somewhere, let’s first note that the architecture of Redshift (excluding its Serverless version) has remained largely unchanged since my first experience with the service. As seen in the documentation, it is still based on the cluster model, consisting of two main components:

Master Node:

Handles client communication and manages all compute nodes.
Analyzes and creates execution plans for database operations.
Compiles and distributes the execution plan to compute nodes and assigns data segments to them.
Occasionally executes SQL operations when necessary.

Compute Nodes:

Execute the compiled code received from the master node.
Generate partial results that are then aggregated by the master node.
Each compute node has dedicated CPU and memory, whose capacity depends on the chosen node type.

While this fundamental structure remains unchanged, AWS has introduced significant enhancements to Redshift’s ecosystem, making it more efficient, flexible, and cost-effective. Let’s explore some of these key innovations.

From top to bottom: Client → Amazon redshift RA3 cluster (leader node → compute node 1 / compute node 2 / compute node 3) → Amazon S3 Amazon redshift managed storage (exabyte-scale data storage)

Query Editor V2: A Usable Console at Last!

At first glance, it might seem that the core of Redshift hasn't changed much. But has it really? Let's take a look at how we interact with data now, especially with the introduction of Redshift Query Editor V2.

To access it, simply click on the "Query Data" button in your Redshift cluster and select Query Editor V2, as shown in the following image:

This interface is much more intuitive and allows us to execute commands, such as creating new tables or loading data, in a visual way. You can simply select the data you want to import, define its type, and configure other parameters, greatly simplifying these operations.

As I mentioned earlier, one of the main issues with Redshift was its cost. Over time, various benchmarks have favored Redshift in terms of cost compared to some of its competitors. However, with the introduction of Redshift Serverless and the "Stand by" functionality for clusters, this concern has been effectively addressed. Depending on the use case (particularly for Redshift Serverless), these options are more cost-effective than keeping a cluster running at all times.

On the other hand, AWS Redshift’s strongest advantage has always been its native integration with other AWS services, and that remains true today.

S3, the Backbone of Storage in AWS

S3 has always been the foundation of storage for data platforms in AWS, and with the COPY command, loading data into Redshift is fast and straightforward. Additionally, with the auto-copy functionality, this process is further simplified. It eliminates the need to create a pipeline for continuously loading data from S3 into Redshift tables, as the auto-copy jobs themselves monitor and apply updates automatically. In other words, auto-copy functionality not only reduces technical complexity but also eliminates costs associated with developing and maintaining manual pipelines.

Moreover, if we want to query and analyze tables stored in S3 from Redshift without loading them into the cluster (external tables), we can use Redshift Spectrum with commands like CREATE EXTERNAL SCHEMA, CREATE EXTERNAL TABLE, and CREATE EXTERNAL VEW, as demonstrated in this comprehensive tutorial.

Finally, security integrations with S3 make managing permissions easier without requiring complex bucket policies or individual AWS IAM roles. This is achieved by mapping IDP users and groups to access permissions using Amazon S3 Access Grants.

Easy and Fast Access to the Glue Catalog!

Just like with Redshift Spectrum, Redshift fully integrates with the Glue Catalog, allowing users to view its databases and tables directly from Redshift. In this simple example, we can see how to grant a nominal user access to the Glue Catalog from the Redshift editor and visualize tables directly within this catalog, using straightforward commands like the ones shown below.

SHOW SCHEMAS FROM DATABASE awsdatacatalog;

SHOW COLUMNS FROM TABLE "awsdatacatalog"."myspectrum-db"."sales";

SELECT * FROM "awsdatacatalog"."myspectrum-db"."sales";

Zero ETL with Dynamo, RDS, or Aurora: The Swiss Army Knife of Serverless ETLs

This is where I’ve noticed the most significant change. Back in the day, using Redshift meant loading data from S3 with COPY, and there weren’t many other options. However, thanks to AWS’s Zero ETL concept (which aims to minimize the need for traditional Extract, Transform, and Load pipelines), data loading processes from various AWS services into Redshift have been greatly simplified.

Now, Zero ETL integrations are available for accessing data from DynamoDB, RDS, and Aurora in Redshift. This allows organizations to leverage data from multiple services without additional processing, accelerating data-driven decision-making.

But What About Data Streaming?

When it comes to streaming data ingestion in AWS Redshift, direct ingestion from streaming sources like Amazon Kinesis Data Streams (KDS) and Amazon Managed Streaming for Apache Kafka (MSK) has been available for a while.

Recently, AWS also announced its integration with Confluent Cloud and Apache Kafka, expanding its capabilities for real-time data ingestion into Redshift.

Data Exchange: How to Easily Share Your Datasets

The data market is increasingly demanding secure and controlled access to share datasets with third parties. To address this need, AWS introduced AWS Data Exchange in 2019, allowing users to discover, subscribe to, and query third-party datasets while combining them with their own data for more comprehensive analysis.

Naturally, AWS Redshift integrates seamlessly with this service, enabling federated access to datasets directly from the AWS Data Exchange catalog.

Redshift and SageMaker: The Strategic Integration

In the era of Artificial Intelligence, AWS SageMaker has become AWS's most widely used service for building, testing, training, and deploying Machine Learning models in a scalable and production-ready manner. However, all these models require data for training and validation, and thanks to AWS Redshift’s native integration with AWS SageMaker, it is possible to work directly with data from the Redshift data warehouse using SQL.

Conclusions

Over the years, AWS has continuously invested in Redshift as its flagship data warehouse solution, consistently updating, improving, and expanding its capabilities in terms of security, integration with other services, cost-efficiency, and performance.

And it’s not just AWS, other key players in the data architecture world recognize its value. That’s why platforms like Denodo, Starburst, Trino, and Presto offer connectors for Redshift, acknowledging its position among the top data warehouses.

If you haven’t explored Redshift or its latest features yet, now is the perfect time to discover how it can benefit your business, and at Paradigma, we can help you make the most of it.

Eloisa Ibáñez

I am Senior Telecommunications Engineer and in my many lives I have developed in languages you wouldn't believe, operated systems beyond Orion, administrated systems with Ansible and even managed international offers. A renaissance mom, fan of cooking, anime, zombies and gardening. ¡Full Stack!

View more of Eloisa.