AWS Lake Formation is a service launched by AWS on August 8, 2019. The goal of this service is to simplify data governance and security in a centralized manner, enabling the secure sharing of data across various AWS storage, analytics, and ML services.

AWS lake formation structure

The core capability of this service, in summary, is to assign permissions to available resources in AWS Glue Data Catalog via roles, using a format similar to traditional database "grants". These permissions can be set at the database, table, column, row, and even cell level, allowing for a high degree of granularity. The AWS Glue Data Catalog centralizes data administration and management, while also acting as a connector to other catalogs, such as a Hive metastore. Additionally, it integrates seamlessly with AWS services like Amazon Athena, AWS Glue, Amazon Redshift Spectrum, and Amazon EMR, all of which can use Lake Formation to access S3-registered data.

Since AWS Lake Formation is a native AWS service, it integrates seamlessly with AWS IAM for user authentication and role-based data access. However, it also supports integration with other governance tools and engines, such as Starburst, Dremio, Privacera, and Collibra.

Moreover, thanks to its cross-account and cross-region capabilities, data can be securely shared even when distributed across multiple AWS accounts, organizations, and regions. This feature facilitates the creation of modern data architectures, such as Data Mesh, without requiring all data to be centralized in a single location—reducing the need for excessive data movement and improving efficiency.

aws, starbust, privacera, dremio, collibra

Another highly useful feature of Lake Formation is its ability to scale permission management. With the ever-growing volume of data, data products, and the increasing need to securely and systematically share those products, large organizations require a secure way to scale the assignment of these permissions. Thanks to Lake Formation and tag-based access control, administrators can assign LF-Tags to data catalog resources and grant permissions to users or applications based on those LF-Tags. These tags can be dynamically managed using services like AWS Glue Sensitive Data Detection, which identifies sensitive data and tags specific resources in the data catalog accordingly.

Additionally, AWS Lake Formation integrates with AWS Data Exchange, allowing users to discover, subscribe to, and leverage third-party data in the cloud. It also enables organizations to share their own data products with external businesses without needing to move or duplicate them.

One of Lake Formation’s goals is to provide users with better visibility into the data stored within their AWS Glue catalog. By combining both services, users can perform searches based on name, content, sensitive data, or any other custom tags assigned to catalog resources.

Finally, Lake Formation offers audit log access through Amazon CloudTrail. This significantly simplifies data access auditing, as it allows administrators to quickly see which users or roles have attempted to access specific data and at what time.

At this point, it's clear that Lake Formation is a critical component for any AWS data platform requiring access control segmentation within the AWS Glue data catalog. But how effective is Lake Formation? Does it truly deliver on its promises? And does it make sense for data access to be exclusively managed through Glue, given its associated costs? Next, we’ll dive deeper into its strengths and weaknesses based on real-world experience.

The Good, the Bad, and What We’d Love to See Next in Lake Formation

After taking this technology to production, we believe that Lake Formation has great potential, but there’s still plenty of room for improvement in certain areas. In this article, we want to share some of the challenges we encountered while implementing this data segregation model in large enterprises with over 100 AWS accounts and more than 100 TB in S3.

For each challenge, we will explain alternative solutions and features we’d love to see in future versions. With a bit of luck, AWS may have already rolled out native solutions by the time you read this article. If not, we invite you to explore the alternative approaches we’ve tested.

Alternative Ways to Consume Data

As we’ve mentioned, Lake Formation allows access control over data through services like Athena and Glue. However, some high-intensity workflows—such as training machine learning models—are better served with direct access to S3 or by combining Glue Catalog queries with direct S3 access. These two options tend to be faster and more cost-effective than processing via Athena.

Currently, Lake Formation does not provide an easy way to enforce segregation for direct access to S3. In this case, we had to complement Lake Formation with alternative solutions, including:

1 ABAC (Attribute-based access control)

ABAC is a method that involves tagging IAM Principals (users, groups, roles) with IAM tags, which act as key-labels. Access policies are then defined on resources, referencing these Principal tags as if they were locks. A Principal can only perform certain actions on a resource if it possesses the expected tag.

The main drawback of this approach is that it does not provide the same level of granularity as Lake Formation (LF), which allows restrictions at the row and column level in tables. However, ABAC’s prefix-based restrictions in S3, based on business unit, project, database, tables, or partitions, are often sufficient for meeting business department needs.

Example of a common ABAC configuration:

Imagine a critical data bucket with a second-level prefix structure organized by development squads. Below these prefixes, each squad sets up databases, tables, and experiments. Access to data under a given prefix is restricted to users or processes belonging to that squad.

ABAC por squads example

It seems simple when dealing with four squad prefixes, a dozen collaborative stakeholders, and two seasons. But now, imagine the project becomes a massive success—more seasons, prequels, sequels, spin-offs, remakes… Suddenly, we need to define cross-inheritance rules for 100+ people across 10+ prefixes within a single JSON file, constrained by a 20 KB Bucket Policy limit.

In this scenario, only ABAC can save us:

On one hand, we tag each user or process role with the corresponding IAM squad tags, allowing inheritance from multiple squads. On the other hand, in the Bucket Policy, we define short, flexible access rules using tag-based variables like “${aws:PrincipalTag/Got:squad_member}”.

With this method, we can scale independently of the project's success. We could even imagine scenarios where Principal tags are assigned based on AD groups or where usage segregation extends to other critical or costly AWS resources.

2 S3 Access Points

S3 Access Points can be seen as virtual buckets or views over a specific prefix within a real bucket. Each Access Point has its own unique name that can be used for read and write operations, almost as if they were traditional bucket names.

Previously, we had to define all cross-account permission assignments in a complex JSON policy, which became an unnecessarily critical component. The current alternative is to create an S3 Access Point for each business unit or project, allowing for a more flexible and organized approach to managing S3 access.

Each S3 Access Point has its own access policy, specific to the project it serves, ensuring a cleaner and more manageable permission structure.

Visibility Over Existing Permissions

While Lake Formation does allow viewing permissions for each resource, verifying segregation requires an operations team member to manually list all the access permissions granted within the system. Whether it's for an audit, listing permissions for a specific user, or identifying all users with access to a particular resource (database, table, entry, column), the process is not very intuitive or user-friendly.

AWS's console does not provide basic search filters, such as searching by tag value. Additionally, in a Data Mesh setup, where information is distributed across multiple accounts, managing permissions becomes even more challenging.

Our Solution

To overcome this limitation, we implemented a process executed via a Lambda function or a local Notebook, leveraging boto3 to extract permission data. This allowed us to version and store the information in S3, making it easier to analyze using Excel and Pandas' powerful filtering and aggregation capabilities.

A Wishlist for AWS

An amazing improvement from AWS would be:

Infrastructure as Code

Over the past two years of working with Lake Formation, we've observed significant improvements in support for AWS CloudFormation and HashiCorp Terraform when managing Lake Formation-related resources. However, we still strongly recommend handling LF tags and permissions programmatically using boto3, ideally behind a custom CloudFormation resource or an AWS Service Catalog.

This approach can help prevent unexpected resource replacements, which can be highly disruptive. Replacing a permission in Lake Formation is an operation that can cause a level of stress equivalent to having to remove and reinstall every lock and key in a bank in the middle of a release.

A Friendly Tool That Harnesses All This Power

Lake Formation is an incredibly powerful tool for data and security specialists. It eliminates the need for custom coding to manipulate RAM, bucket policies, Glue settings, KMS, and more. However, it is not particularly user-friendly for business users or even data scientists.

Business professionals and data scientists don’t want to grapple with highly technical or complex configurations—they have entirely different goals when they start their day. They require an abstraction layer that allows them to manage their data products as valuable business assets without added complexity.

AWS offers DataZone as a solution, providing a data catalog, automated discovery via ML, and easy access management. DataZone is integrated with Lake Formation, Redshift, Athena, and Glue.

Unfortunately, when we explored its demos, it lacked essential features such as sufficient IAM role integration. This was a dealbreaker since many enterprises rely on SSO roles to manage user access through centralized Active Directory groups.

That being said, I highly recommend keeping an eye on DataZone’s updates. It has tremendous potential due to its native integration with the AWS ecosystem and the intuitive concepts it introduces to facilitate Data Mesh architectures.

At the end of the day, Lake Formation has its strengths and weaknesses, but it has undeniably become a fundamental pillar for data governance and management in most modern data architectures deployed on AWS.

Tell us what you think.

Comments are moderated and will only be visible if they add to the discussion in a constructive way. If you disagree with a point, please, be polite.

Subscribe