Reshaping the Data Lake

ORDER REPRINTS DOWNLOAD COMMENT DISCUSS SHARE

Traditional data lake query engines are based on brute force query processing, culling through all the data to return the result sets needed for application responses or analytics.

How explorable is your data lake?

The biggest advantage of data lakes is flexibility. Allowing the data to remain in its native, raw and granular format means that data is not modeled in advance, transformed in flight, or at target storage. This is an up-to-date stream of data that is available for analysis at any time, for any business purpose. But data lakes only have meaning to an organization’s vision when they help solve business problems through data democratization, reuse, and exploration by agile and flexible analytics. The access to the data lake provides a real force multiplier when it is used by companies thoroughly, across business units.

Is your data lake strategy living up to its potential?

Most organizations have the best of intentions to fully leverage the power of their data lake architecture. However, even after a successful implementation, many enterprises use the data lake on the fringes, running queries on a limited basis for ad hoc, high-value queries. Thus, they dramatically fail to use their data lake to its potential—and experience poor ROI as a result. There are several obstacles that prevent organizations from utilizing the power of their data lake stack, all of which require organizations to rethink their data lake architecture to capitalize on their investment in big data and analytics.

Are you using compute resources effectively?

Research shows that 90 percent of compute resources are “wasted” on full scans. Traditional data lake query engines are based on brute force query processing, culling through all the data to return the result sets needed for application responses or analytics. The result is that SLAs are not sufficient to support interactive use cases andrealistically support only ad hoc analytics or experimental queries. To effectively support a wide range of analytics use cases, data teams have no choice but to revert back to optimized data silos and querying traditional data warehouses. This unnecessary leverage of widely excessive resources runs up significant costs.

Are you minimizing your DataOps and achieving observability?

Today’s enterprises need deep and actionable workload-level observability to gain a comprehensive understanding of how resources are allocated among different workloads and users, how and why bottlenecks occur and how to allocate budgets accordingly. The workload perspective enables data teams to uniquely focus engineering efforts on meeting business requirements. To manage data analytics cost and performance efficiently, data teams should look to solutions that autonomously and continuously learn and adapt to users, the queries they’re running, and the data being used.

Does your data lake stack have what it takes to be analytics-ready?

Manual query optimization is time-consuming, and backlog optimizations grow every day, creating a vicious cycle that diminishes the agility promise of data lakes. A lack of workload-level observability prevents data teams from identifying which workloads need priority based on business needs rather than on the needs of an individual user or query. Overcoming these obstacles to leveraging the power of the data lake demands a transition to an analytics-ready data lake stack, which is composed of:

Scalable and massive storage (petabyte to exabyte scale) such as AWS S3
Data virtualization layer that provides access to many data sources and formats
Distributed SQL query engine such as Trino (PrestoSQL) or PrestoDB
Query acceleration and workload optimization engine for performance and cost balance, to eliminate the disadvantages of brute force approach and their implications

Follow @PipelineWire