Increasingly, agencies are storing, creating, and analyzing more data than ever before. This data comes in different types, sizes, is from various sources, and is used for different purposes. As artificial intelligence becomes increasingly prevalent, the ways in which agencies will need to securely store, mine, and leverage data will grow to include use cases we haven't even conceived of yet.

For good reason, agencies turn to data lakes to manage their ever-growing data volumes. A data lake is a central place where large amounts of data can be stored, processed, and secured. It accommodates structured, semi-structured, or unstructured data. The advantage of a data lake is its ability to store any type of data in its native format without needing the additional data modeling efforts required for relational data stores. Moreover, data lakes can scale to any size.

Why a Data Lake?

Due to their inherent scalability and flexibility, data lakes can help agencies tackle the challenge of storing and analyzing data, as well as overcoming myriad data silo issues. Data lakes are also ideal for cost-effective data storage, making them a superb choice for housing extensive and complex data sets. Additionally, data lakes can enhance data discovery and analysis by enabling seamless collaboration and facilitating the research, piloting, and deployment of various AI-enabled initiatives.

What Is a Data Lakehouse?

While data lakes have been beneficial for the rapid ingestion and analysis of large data volumes, they introduced a new data architecture. Agencies have decades of investment in relational data warehouse architectures – but the first generation of data lakes did not attempt to leverage these investments. Among the many use cases, data scientists accessed data lakes for AI/ML, and data analysts accessed data warehouses for dashboards and reporting. Moving data across the data lake and data warehouse divide required substantial effort.

Enter data lakehouses. This architecture combines data lakes and data warehouses to support queries against both. It leverages the strengths of both systems – the low cost of data lakes and the ability to query both through SQL and other data access methods. By using a data lakehouse, mature data warehouses can be maintained and augmented with data lake contents.

A data lakehouse provides agencies with unified access to both their data lakes and data warehouses, allowing for query, integration, and analysis of both repositories through a common structured query language (SQL) interface. Furthermore, cloud-native data lakehouses offer a flexible data and insights platform that can scale on demand, handle different data types, enable sharing, and deliver rich analytics for deeper insights.

Data Lakehouses in Action

The most effective data lakehouse solutions are cloud-native, tailored to meet unique missions or challenges, and come paired with insightful reporting and analytics. GDIT has worked with customers across the federal government to deliver this type of data solution. We have deployed AWS-hosted data lakehouses for defense agencies and have used them to predict domestic air traffic to improve flight operations, identify policy issues in healthcare programs, and detect fraud across data sources.

To Move or Not to Move Data?

Unfortunately, when data is distributed, it necessitates moving to a central repository before analysis. Data movement is costly, duplicates data, locks data domains into rigid architectures, and makes data owners nervous. They worry their data might be transformed or used improperly, be out of sync with the original data, or be shared with inappropriate parties. Centralizing data for analytics also moves far too much data when, in most cases, the analyst needs only a smaller extract of data, an aggregate, or another derived product. These data products are key to improved distributed data analytics, being typically smaller and more precise, having been curated by data domain owners.

What Is a Data Mesh?

In reality, agencies manage dozens to hundreds of data domains, each with its own data lakes, data warehouses, and data lakehouses, and each governed by data domain owners. The value of these distributed assets is significantly enhanced when they can be located, combined, and utilized in analytical and AI/ML use cases. A data mesh architecture allows users to find and access distributed data without needing to move it to a data lake or data warehouse. This approach reduces data replication, strengthens domain governance, and facilitates easier inter-agency and public data sharing. GDIT recently developed a data mesh, built on cloud services, that provides the essential functions needed to share distributed data products – data catalog and self-service discovery, data product publication and consumption, data access and governance, and data transport.

Charting the Future

Together, data lakehouses and data mesh provide agencies with the flexibility to align data assets with consumption patterns in an economical and adaptable manner.

The White House’s recent executive order on AI established new standards for the safe, secure, and trustworthy development and use of AI. Achieving the AI vision outlined in the executive order – which aims to protect Americans’ privacy; advance equity and civil rights; advocate for consumers and workers; promote innovation and competition; and advance American leadership globally – depends on seamless and effective data management. The data lakehouse and data mesh architectures are powerful and flexible paradigms that enable just that.