The Six Five at Cloudera Evolve 2022: Strategic Advantages of Portable, Hybrid Data Lakehouses
The Six Five “On The Road” at Cloudera Evolve NYC. Host Daniel Newman is joined by Wim Stoop, Sr. Director, Hybrid Data Platform & David Dichmann, Sr. Director, Data Warehouse & Lakehouse, Cloudera. They discuss the advancements & advantages of a portable, Hybrid Data Lakehouse.
You can watch the full video here:
You can listen to the session here:
Disclaimer: The Six Five Webcast is for information and entertainment purposes only. Over the course of this webcast, we may talk about companies that are publicly traded and we may even reference that fact and their equity share price, but please do not take anything that we say as a recommendation about what you should do with your investment dollars. We are not investment advisors and we do not ask that you treat us as such.
Daniel Newman: Hey everyone. Welcome back to another Six Five on the road here at Evolve New York City. Excited to be here. Day of video content conversation. Solo for this one as my co-host, esteemed Mr. Patrick Morehead is running around here at the event, but no less, even more excited because now I get to run the show. I don’t have to negotiate the questions, but I’m really excited about this conversation. Two great guests. We’re going to be talking about data lake house, fabric, but really that’s the technical term.
We’re going to be talking about a lot of complexity right now in a market full of rivers, lakes, streams, swamps, and basically that companies have a really tough situation ahead of them to get their arms around all the data that’s available to them and to be able to utilize it to get to those business outcomes that drive enterprises forward.
So, I’ve got David and Wim here. Before we jump into the conversation, going to give you guys a moment to do some quick introductions. Wim, I’ll start with you. Tell everybody about yourself, your role at Cloudera, mostly just that though. Don’t tell them too much about yourself.
Wim Stoop: Not too much? Okay. Well thank you first of all for having us on this call. My name is Wim Stope. I look after product marketing from a platform perspective. Six years with Cloudera. It’s a blast.
Daniel Newman: Absolutely. David.
David Dichmann: Hi. And I’m David Dichmann, and I do product management focusing predominantly on data warehouse and Lakehouse topics.
Daniel Newman: Well, we talked off camera about maybe doing a, telling me the difference between product management and product marketing, but we’re not going to do that. I just wanted to say what we talked about, so everyone kind of gets a feel from what’s going on behind the scenes here, but we are having a lot of fun. But I just got back, I don’t know if you guys had a chance to sit in. Pat Morehead, the normal, he actually opened up the event and your ceo, Rob Barden got up there and then we had Rob Thomas from IBM and a number of other great speakers. The wordle guy was a lot of fun too. You guys do Wordle?
Wim Stoop: Yeah, I used to, but–
Daniel Newman: You quit?
Wim Stoop: Well, I couldn’t get it. English is not my first language.
Daniel Newman: You struggled.
Wim Stoop: Yeah, no.
David Dichmann: It’s tough.
Daniel Newman: Now. My children won’t let me quit.
Oh yeah. It’s a commitment in my household too. But we did have a lot of fun, but there was some common themes that were coming up over and over again. And I want to boil them down to maybe a big one. And Wim I’m going to start with you, I’m going to have you kind of respond to this, but the one theme I really came away with it is that there is no sort of standard right now in terms of how companies are approaching these data ecosystems. There’s a lot of different ways to arrive. So we’ve got this common desire, which is to get to better insights and outcomes, but it seems that the journey is really hard. So just in your view, where are we at? What’s kind of going on with this overall data journey that’s driving you guys to try to build out this next data lake and fabric that your product-
Wim Stoop: Yeah. How does it all tie together? Right. Why talk about, and I think you, you’re touching on a very good point. There is very little commonality. The only commonality that exists for all of our customers is they want to get better at getting value and insight from their data. But to do that in an environment where more and more data is being generated all over the place, it’s more and more unstructured data is generated in all kinds of different infrastructures and they need to be able to deploy more use cases faster than ever.
That is exactly the challenge because how do you as an organization start to tackle a problem like that without everything becoming an individual project? And that is exactly where architectures, not even technology yet, but that is where architectures can help these organizations get more agile and flexible, be able to do that same thing, whether it’s understanding your data or dealing with structured ,unstructured data, being able to do that across all kinds of different infrastructures. That’s how we accelerate that with architectures. And that’s where data fabric and data lake house fit in. I don’t know if you wanted me to say that already.
Daniel Newman: No, we’ll get to that. I’m going to let David sort of start to touch on that. As an analyst I hear you. I talk to both the end customer and I talk to you and a number of your allies, partners, and EDM as well as competitors. But I also will admit, I get frustrated. I’m frustrated because I feel like there’s a lot of vernacular, a lot of terminology. You heard me talk about somewhere along the lines, we decided to use bodies of water as a parallel to data and it’s become a term. And then we start blending them together. It’s data warehouses and lakes, and now it’s data warehouse lake, and there’s data lake house. But the good news is it’s landed. Everybody sort of agrees. But there’s all this terminology, there’s all this flow into what it’s becoming. But separate for me, David, the architecture versus reality. Where are we in sort of creating these distinct different pools of data and what are the things that these companies should be focusing on?
David Dichmann: Sure. And so I think there’s two ways to look at this. The first way that we look at this is we look at what people have done and what we’re calling the data lake house. So we’ve done a really good job of building traditional data warehouses, and this is where we’re doing our data that count it once, only once, don’t forget to count it. We’re using precision here to pay the right tax to a government kind of thing. And then we built these data lakes and their job was to be the place where you put everything, every shape, every size, structured and unstructured. And the questions you asked there were more statistically motivated, more or less. Do our customers like what we’re doing? More or less is that equipment about to fail? And get into that more statistically motivated predictive analytics.
The challenge is that when we really start looking at things like doing these new predictive workloads where we need to bring AI and ML on top of things, and we’re doing things with faster moving data with data flow and streaming that we need to combine all the well curated data and all the raw data together in the same place, we can’t train our models effectively and test those hypotheses against actual data.
We can’t get that fast moving data into the right place at the right time to be able to do that. And we want these AI and ML predictive models to happen on all of the data available, not just the part that we carved off and shoved in the corner as these things are becoming more first class citizens in our organizations as opposed to something we’re just trying off to the side. So with these things happening, the Lakehouse gives us the ability to get that best of both worlds. But here’s the catch. We’re not going to build one lakehouse for everything. What we’re going to end up with is purpose built lake houses around different domains, solving different parts of the business problem, which leads to the silo situations that that Wim was talking about or alluding to a little bit earlier. And that’s where things like the fabric come into play.
Now, the promise of the fabric is all of your data can lie wherever it lies, and the fabric will get the right data to the right place at the right time. The reality is there are enabling technologies to help us do that, but we’re still learning how to actually make this happen. So we can do things like data replication, data federation, data distribution. There’s a lot of different techniques that we employ, but we look at these as what’s the right solution for the right use case at the right time for a given question, for a given problem space. But make sure that the fabric gives us visibility to all of the data that we have so we know where it is, what it is, and how we can use it. And we don’t leave it in that 80 or 90% of data that remains untapped today.
Daniel Newman: So Wim, how are the customers reacting to the data lake house? You’re leading the product marketing of taking this out. And like I said, I think a lot of customers are generally positive about the idea of getting all the data: structured, unstructured, different types of configurations, getting it flowing in edge to cloud and prem. But it seems to be a big undertaking.
Wim Stoop: It is, and it isn’t a big undertaking. I think for a lot of organizations, a lot of customers of Cloudera, they may actually have been doing this. They may have been using our platform and technology to implement effectively a lakehouse architecture, to implement a data fabric architecture without calling it out. And that was several years ago.
Now, as these terms are trending, customers are seeking that conversation more of, I know you provide a great platform, a hybrid data platform, but how can you help me implement these architectures? How can you implement these blueprints? So we’re in an extremely fortunate position there. That’s not only do we have the platform to support organizations implemented, but we’ve got the tremendous experience in organizations doing this. That’s how the conversation is landing. That’s how the messaging is landing. Customers are actively seeking it out. But also when we talk to them about, it’s not just about having, choosing a better data lake house or choosing a better data fabric in order to solve your challenges, it’s the fact that we bring that together together in a single platform. Because otherwise you’re just in the same position as when you’ve chosen a better data warehouse in order to address your challenges.
Daniel Newman: Are you getting any feedback though on why the Cloudera data lake house? Cause obviously you’re not do not the only one, and in many organizations now though, you are in a situation where you’re competing with a number of different data offerings. And so you guys have moved aggressively towards cloud, but a lot of companies started dabbling in cloud before Cloudera really started there. So maybe they started using some other technologies, kind of what are they saying to you? Don’t worry, I’m going to get to you in just a second. What do they say to you about why they’re picking and sticking? Is it because the technology is better because they have larger data sets already with you, that you’ve built a better mouse trap?
Wim Stoop: Yeah. There are two elements to that, and you’ll use both of us in order to answer it. I will answer it from the macro level, which is why is our platform so uniquely suited to be able to do this? Well, it is because we work across hybrid cloud. You can deploy us in the same sense, in the same fashion, the same capabilities, build ones, deploy everywhere, shift your analytics as in when they’re needed. But with consistent security and governance. That is the highest level of our platform that makes us so incredibly unique, which is why our customers are choosing this because as you just said, the data is generated. The data is born on all these different infrastructures. And although sometimes you may want to move it to where it is needed, other times you want to bring those same analytics to where the data actually lives. But then very specifically in the technology we’ve built, we integrated some great capabilities that make us extremely suited to delivering a very perform– I’ll let you talk about that.
David Dichmann: Sure. And so this comes in again to why are we doing lake houses in the first place? When our customers talk to us, they reason why they want a lake house is they want to bring in streaming data. They want to do AI and ML on more data. And so by having an integrated platform where data flow and streaming, data warehousing, data engineering, AI and ML are all pre-integrated. And as Wim says, across the hybrid cloud, wherever your data’s born, migrated to, used or ultimately analyzed, having all of that available in the multi-function analytics part of what the Cloudera data platform delivers means that you don’t just have a better place to put your stuff. You have all the tools you need to unlock the insight that stuff delivers to you.
Daniel Newman: Okay. So here’s a question though that I think a lot of people will be interested in getting both of your takes. And David, I’ll give you the first stab at this one, but fabric is the other side of this story here. You talked about the lake house itself. There are different offerings. I mentioned, I’m here in New York for this, but I was also yesterday at Google Cloud Next. And Google has, they have a pretty compelling data story right now. And one of the things they really focus on is this kind of open data approach that they believe that they’re bringing in many data players. Some of them, like I said, your colleagues and some of them your competitors.
But they’re kind of saying, Well, we’re just going to go open and in fact, you can do open on any cloud and we’re going to make any workload, any cloud, any type of data, and it’s just all going to work and it can all just be accessed in Google’s cloud. It’s an interesting and compelling story, but I think we’ve heard that before, by the way, from a hyperscaler that, and it hasn’t necessarily always materialized, but as time goes, things get better, things get more mature. I mean, your fabric is approaching or trying to solve that same problem: all the data. So how are you guys thinking about fabric in terms of getting all this data from all these sources to actually work fluidly?
David Dichmann: Sure. So the first part of fabric and Wim can explain a little more, but the first part of fabric is knowing what you’ve got. And that’s all about metadata management and data cataloging across the hybrid cloud.
So where we see a lot of organizations starting with their fabric approach is just getting that metadata. Either automatically collecting it, getting some of that passive metadata, but also getting in there and working the metadata, getting subject matter experts involved.
And that’s where pushing things into domain expertise makes a lot of sense, where you can get some additional metadata collected, what the business meanings of things are and how they’re used across domain. So knowing where you’ve got your stuff is an important part. And that’s something that’s an integral part of a hybrid open data platform that we have with Cloudera, that we have one shared data experience across all of the different experiences that you’ve got, whether you’re doing data flow and streaming or data engineering, one security, one governance, one metadata, one schema understanding across all of that, wherever it may lie in the hybrid cloud. So by being able to describe and define your data in one place and know where it’s used and what it means in another, gives you great visibility to all of the data across these different domains where we’re going to end up with all of these different lake houses serving those analytics needs.
Daniel Newman: So Wim anything, before I hit you guys up with the final, I got a really hard question. I’m going to end. Anything you want to add to that about the fabric? I know David indicated you might have some more color.
Wim Stoop: I think the fabric is going to have to be the departure point for many organizations, especially those organizations that don’t do much in the direction of proactive data governance. That’s really what it comes down to. This is like inventorizing what you have so you can make it available.
Data management is such an incredibly important part of a data fabric, but it also bleeds through into a data lake house. If I don’t understand my data, if I don’t gather that information in a data fabric, I can’t give it on a data lake house. And that’s why together a data fabric and a data lake house really bring the foundation of making fair data, making data findable, accessible, interoperable and reusable. Whereas from a data fabric, you make it findable and you make it accessible because you connect to your source systems, you know and understand them, you make them available in a self-service manner across all of the infrastructures that you have. But through a data lake house, you make it interoperable. You use structured and unstructured data, the same analytics, and you make those same data sets reusable for different analytics. And again, do that across hybrid cloud. So do I have anything to add? Well, actually we’re talking about the same thing.
Daniel Newman: Yeah, no, it’s okay. I think the reiteration’s important. I always kind of think about data in the enterprise and the organization’s data a little bit like a skyscraper. I mean here in New York it’s an appropriate, but first of all, a lot of people don’t think about it. But the skyscraper wasn’t built really at sea level. You generally speaking, that thing gets built way down into the ground to create balance. And then there’s all kinds of engineering and design that needs to make sure that that skyscraper can stand. And then again, you got the winds, you got ever, you know, got a lot of different things. You got the weight and the pressure. But in an organization, kind of what you guys are alluding to is so many of the fundamental things that companies could have been doing for decades now that maybe either haven’t been done or haven’t continued to evolve, become the limiters in terms of how effectively they’re able to move to a lake house or move to a fabric. Because they didn’t do the basics right, they didn’t do the management exactly right.
Wim Stoop: It’s limiting how quickly you can bring to bear all the data that you have in order to gain value and insight. 80% of data, and today I saw 90% of data, is just not used. It’s generated, but it’s not used for any analytics. And if every piece of data represents a nugget value, why would you not want to use all of your data? Data fabric is a great way of bringing all that together.
Daniel Newman: Yeah, it sounds like a really great thing. And I think hygiene’s the word that we like to use in marketing.
Wim Stoop: Hygiene?
Daniel Newman: Hygiene.
Wim Stoop: The moment you say hygiene, I also immediately have to think about regulatory compliance and data privacy that is increasing the world over for just about every single industry that I know. And a defense of “I didn’t know what was in my data yet I still used it” is not something that will actually stand up in court when it comes to compliance.
Daniel Newman: Yeah, compliance.
Wim Stoop: Ignorance is not a defense.
Daniel Newman: Residency, sovereignty. So much.
Wim Stoop: So if you don’t do it for any other reason than that, no one understands your data. So you can stay on the right side of the law. And for customers in financial and insurance industry, that is that ticket to be in business and continuing to be in business.
Daniel Newman: That’s such a huge thing. So as we wrap up here, maybe just a quick, we like to do these speed rounds at the end here at Evolve. What is sort of the one thing, David, that you hope the customers and the partners that are here listening come away with?
David Dichmann: Well I think the big thing I want folks to come away from this event is understanding that what got you where you are today, your data warehouses, your data lakes, whatever infrastructure you have today, however amazing it was to get you where you are today is not going to get you to that next level in the future. When our customers are building insights that are transforming the enterprises that they’re in, giving them greater competitive edge, giving them the ability to hold on to customers, making a difference, saving lives in the medical field, these kinds of new insights must come from a new way of looking at your data, a new way of treating your data and a new platform to be able to enable the technology today that gets you there.
Daniel Newman: Wim?
Wim Stoop: I think I would come from a slightly different angle, I’d come from the business side of things and I want anyone attending this conference to come away with, Steve, I need to ask myself the question, what is it that I want to achieve? What does my business want to achieve? What is it that I can’t do today that I wish I could? And then together with Cloudera and our partners, we’ll be able to help you achieve just that goal. And objective. Data’s going to play a role.
Daniel Newman: And maybe there it is, the difference between product management and product marketing.
Wim Stoop: There you go.
Daniel Newman: Wim, David, thank you both so much for joining me.
David Dichmann: My pleasure.
Wim Stoop: Thank you for having us.
Daniel Newman: Everyone out there. Thanks for tuning in to the six five here on the road at Evolve New York City, brought to you by Cloudera, IBM, Intel. Great partnership, great event, great interview. Appreciate y’all tuning in. Check out the other interviews, hit the subscribe button. We’ll see you all soon. Bye now.
Daniel Newman is the Chief Analyst of Futurum Research and the CEO of The Futurum Group. Living his life at the intersection of people and technology, Daniel works with the world’s largest technology brands exploring Digital Transformation and how it is influencing the enterprise. Read Full Bio