Clicky

Cloudera Infuses Value Across Data Ecosystem with Innovative Open Data Lakehouse Approach
by Daniel Newman and Ron Westfall | January 17, 2023
Listen to this article now

Data is exploding, and the quantities and types of data that organizations are generating is growing at exponential rates. When combined with the burgeoning need to provide accurate and real-time insights to help drive business decisions, the challenges for teams tasked with delivering data and analytics proliferate.

Data volumes and formats have reached the point where organizations are struggling to keep up. Powered by smart edge computing devices, sensors, and machine-generated data, IT teams are faced with an avalanche of data, oftentimes trapped in legacy data architectures that have huge cost overheads to maintain.

We see Cloudera playing a pivotal role across the data architecture ecosystem to adopt an open-source approach toward building and optimizing data lakehouses. For instance, Cloudera is a key contributor to the expanding Apache Iceberg industry standard, a high-performance format for huge analytic tables. Iceberg brings the reliability and simplicity of structured query language (SQL) tables to big data, while making it possible for engines such as Hive, Impala, Spark, Trino, Flink, and Presto to safely work with the same tables and at the same time. Moreover, the Cloudera Data Platform is playing an essential role in swiftly expanding the ecosystem-wide influence of Iceberg and augmenting customer journeys to full adoption of open, cloud native data lakehouses.

The Rise of the Data Lakehouse

Traditional data architectures and even the more recent advent of data warehouses have proven to not be fit for purpose as the pivot to advanced artificial intelligence (AI) and machine learning (ML) data science has emerged. The new requirements have therefore led to a new architecture, namely a data lakehouse. The data lakehouse blurs the line between structured and unstructured data, enabling data scientists to store all types of raw data in one location, while still having a storage layer on top to provide transactional views of data and structured data management and analytics when needed.

Data engineers and data scientists need lightning-fast and simple access to shared, secure, connected data. The differences between the architectural approaches provided by data lakes and data warehouses in how they unify large volumes and varieties of data into a central location are significant. Warehouses are vertically integrated for SQL Analytics, whereas lakes prioritize flexibility of analytic methods beyond SQL.

The ideal approach is one where the benefits of both approaches are realized, namely the flexibility of analytics in data lakes, and simple and fast SQL in data warehouses. This is made even better by the ability to avoid the vendor lock-in that comes with the approach of many vendors, especially across traditional implementations.

Data engineers need access to the same data for analytics and business intelligence. Unfortunately, even if data engineers and data scientists work against the same data, they are most often operating in different silos: data engineers in a structured cloud data warehouse and data scientists in an unstructured data lake.

Organizations could architect a solution where large data stores that data science requires reside in a traditional data lake, but this would not be the whole equation. This inherent data architecture dichotomy oftentimes leads to duplication, inaccuracies, and inefficiencies, all of which lead to increased costs — and frustration throughout the organization.

We believe that a well-implemented lakehouse architecture approach collapses the silos between data and data professionals, delivering an architecture where everyone has access to the same shared, secure, connected data. As a result, different teams can work concurrently on the same analytics-ready datasets, which is incredibly beneficial. Through a shared system, data is more reliable, processes are more efficient, and cost overheads are contained.

Cloudera Uplifts Open Data Lakehouse Proposition

Organizations need to increase time to value for their data science teams while operating at the speed of innovation required by the business which is placing a further undue burden on data teams.

Our research has shown that Cloudera’s portfolio can meet the challenge of enabling organizations to successfully transition away from closed, manual-laden data warehouses and ensure that their data lakehouses evolve on an open-source foundation.

Cloudera’s data lakehouse is entirely designed and implemented on the open source and open standards principles of the Apache Iceberg initiative. Apache Iceberg is aimed at cultivating broad developer community acceptance of building high-performance formats for massive analytics tables, including the storage of multiple data formats and the enablement of multiple engines to work on the same data, on an open ecosystem basis.

From our view, Cloudera assures organizations can build an open lakehouse anywhere, on any public cloud, as well as within their own data center. Through a build once and run anywhere philosophy, Cloudera offers the same data services with full portability across all clouds, making it an attractive value proposition on numerous fronts.

Cloudera: Ensuring Organizations Agile, Secure Transitions to Open-Source Data Lakehouses

As noted, organizations have widely adopted both data warehouses and data lakes to bring together massive volumes and varieties of data into a unified location. Organizations developed data warehouses to provide vertical integration for SQL Analytics, while data lakes evolved to emphasize flexibility of analytic methods outside SQL.

To attain the benefits of both realms — flexibility of analytics in data lakes, and simple and fast SQL in data warehouses — organizations frequently deployed data lakes to complement their data warehouses, with the data lake feeding a data warehouse system as the last step of an extract, transform, load (ETL) pipeline. Through this arrangement, organizations accepted the inevitable: the resulting lock-in of their data in warehouses.

These lakes power mission critical large scale data analytics, BI, and ML use cases, including enterprise data warehouses. In recent years, the concept data lakehouse emerged to describe the architectural pattern of tabular analytics over data in the data lake. However, as lakehouse adoption gained momentum, users had to reconcile the paradox that while data lakes were open, lakehouses lacked that same openness.

We view the Cloudera Data Platform (CDP) as fully aligned with the Apache Iceberg mission, taking advantage of its open table, high-performance, and cloud native format that scales to petabytes agnostic to the existing underlying storage layer and access engine layer. This includes enabling smooth integration between processing and streaming engines, while maintaining data integrity between them.

In addition, through the integration of Iceberg into the Shared Data Experience (SDX), we find Cloudera can substantially ease the data lakehouse journey, including support for schema evolution, hidden partitions, and streamlined data management key to efficient lakehouse administration. The Iceberg tables in CDP integrate within the SDX enabling unified security, fine-grained policies, governance, lineage, and metadata management across multiple clouds, which allows customers to focus on their business-core data analysis and not burdensome legacy data mechanics.

As a result, CDP is well-suited to play an essential role delivering cloud-driven Iceberg benefits and advantages in supporting and augmenting organization transitions to open, converged data lakehouse implementations.

Key Takeaways: Cloudera Ready to Fuel Open Data Lakehouse Adoption

Overall, we believe Cloudera Data Platform, in combination with Iceberg, warrants high-priority evaluation by organizations that need to advance their data architectures on an open source, cost-effective foundation. As described, top considerations include enabling organizations to select the engine of their choice according to what aligns best for their evolving and new use cases including ML, SQL analytics, data curation, and streaming applications.

From our perspective, through Cloudera’s support of open and flexible formats, coupled with capitalizing across the upstream Iceberg community to avoid dreaded vendor lock-in and locking in enterprise-grade security and data governance, especially unified data authorization to lineage and auditing, organizations can optimize their cloud journey to open data lakehouses.

Disclosure: Futurum Research is a research and advisory firm that engages or has engaged in research, analysis, and advisory services with many technology companies, including those mentioned in this article. The author does not hold any equity positions with any company mentioned in this article.

Analysis and opinions expressed herein are specific to the analyst individually and data and other information that might have been provided for validation, not those of Futurum Research as a whole.

Other insights from Futurum Research:

The Attraction of Cloudera’s Open Source Data Platform – The Six Five Podcast Insiders Edition

The Six Five at Cloudera Evolve 2022: Strategic Advantages of Portable, Hybrid Data Lakehouses

The Six Five at Cloudera Evolve 2022: Data Innovation — Leading by Example

Image Credit: Techspot

About the Authors

Daniel Newman is the Principal Analyst of Futurum Research and the CEO of Broadsuite Media Group. Living his life at the intersection of people and technology, Daniel works with the world’s largest technology brands exploring Digital Transformation and how it is influencing the enterprise. Read Full Bio

Ron is an experienced research expert and analyst, with over 20 years of experience in the digital and IT transformation markets. He is a recognized authority at tracking the evolution of and identifying the key disruptive trends within the service enablement ecosystem, including software and services, infrastructure, 5G/IoT, AI/analytics, security, cloud computing, revenue management, and regulatory issues. Read Full Bio.