Mastering Data Engineering with LakeFlow

ETL/ELT Your Business Data Easily With Databricks' LakeFlow - Interview With Product Leader - EP101

Estimated read time: 1:20

Learn to use AI like a Pro

Get the latest AI workflows to boost your productivity and business performance, delivered weekly by expert consultants. Enjoy step-by-step guides, weekly Q&A sessions, and full access to our AI workflow archive.

Summary

In Episode 101 of "Josue Bogran Channel," the conversation centers around Databricks' innovative product, LakeFlow, which promises to simplify ETL/ELT processes. Featuring Bilal, a senior product manager at Databricks, the discussion delves into the structure and utility of LakeFlow—comprising of three main components: LakeFlow Connect, LakeFlow DLT, and LakeFlow Jobs. Bilal explains the declarative nature of the system and its benefits, making data engineering tasks more effective and manageable. He also highlights the competitive but collaborative ecosystem involving complementary tools and partners.

Highlights

LakeFlow Connect is a game-changer for efficiently ingesting data into Databricks. 🌟
LakeFlow DLT builds on the concept of Delta Life Tables, streamlining the ETL process. 📊
LakeFlow Jobs acts as an orchestrator, ensuring seamless workflow execution. ✔️
The product reflects a balance between innovation and customer feedback, learning from experiences with Delta Live Tables. 🧠
Databricks' partnership strategy ensures robust support for varied data sources without crowding out other ecosystem tools. 🌐
Focus on price performance and developer experience to enhance product usability and affordability. 💡

Key Takeaways

LakeFlow simplifies data engineering with its three key components: LakeFlow Connect, LakeFlow DLT, and LakeFlow Jobs. 🚀
Bilal emphasizes the declarative nature of LakeFlow, which allows engineers to focus on 'what' to do rather than 'how' to do it. 🛠️
LakeFlow is designed to be backward compatible, enhancing existing Databricks workflows while introducing new flexibility. 🔄
Databricks collaborates with other industry players such as Fivetran and Prophecy, offering a variety of ecosystem tools. 🤝
Listening to customer feedback is crucial in shaping the future of LakeFlow, focusing on user experience and functionality. 🗣️

Overview

In the latest episode of Josue Bogran Channel, Bilal from Databricks dives into the intricacies of LakeFlow, a cutting-edge data engineering product. LakeFlow is parcelled into three integral components: LakeFlow Connect, LakeFlow DLT, and LakeFlow Jobs, each playing a pivotal role in enhancing data processing capabilities.

Bilal shares insights into the declarative model adopted by LakeFlow, which simplifies the usual complexities associated with data engineering tasks. The discussion illuminates on making such tasks more intuitive, backed by robust APIs for change data capture, ensuring a seamless data flow experience.

The conversation also touches upon the importance of maintaining compatibility with existing systems while fostering strong partnerships within the tech ecosystem. Bilal stresses the significance of customer feedback in refining the platform, aiming for an unparalleled user-friendly experience.

Chapters

00:00 - 00:30: Introduction and Guest Introduction The chapter begins with an introduction to the guest, Balal, who joins the conversation from the Netherlands. There is a light-hearted exchange about job titles, with a humorous reference to the complexity and verbosity often found in official titles. Balal is introduced as a senior director, and there is a playful acknowledgment that titles can sometimes be unnecessarily elaborate. Finally, Balal identifies as a product manager.
00:30 - 01:00: Balal's Role at Databricks and Overview of LakeFlow Balal is a product manager who specializes in data engineering at Databricks. His work is involved with LakeFlow, which is either partially or entirely under his management. The discussion starts with a simple question about the essence of LakeFlow.
01:00 - 03:00: LakeFlow Explained: Injection, ETL, and Jobs LakeFlow is a data engineering product from Data Bricks that consists of three main components: injection, ETL, and jobs.
03:00 - 06:00: LakeFlow Features and Enhancements The chapter discusses the features and enhancements of LakeFlow, a data management system. It highlights the introduction of incremental connectors which efficiently capture and manage the latest changes from data sources, crucial for scalable operations. This feature is beneficial for various platforms such as Salesforce, Workday, Netsuite, Google Analytics 4, and more, with a promise of expanding to other databases. The author expresses preference for LakeFlow's capabilities over traditional methods like Debezium, noting its complexity.
06:00 - 10:30: Comparison with Other Products and Partners The chapter discusses the seamless integration feature of a product, which operates through a user-friendly point-and-click interface without necessitating direct connections by the user. It emphasizes the use of a unified catalog for secure connection management. The process outlined suggests that once the system is configured ('set it and forget it'), it is expected to be cost-effective and efficient. This chapter also introduces the concept of Lakeflow Connect, explaining that after data is collected, it requires cleaning, filtering, and aggregation, marking the ETL (Extract, Transform, Load) process as an essential part of data handling within the system.
10:30 - 16:00: Product Management Insights and Lessons Learned The chapter titled 'Product Management Insights and Lessons Learned' discusses the concept of LakeFlow DLT, previously known as Delta Life tables. It introduces three key components: LakeFlow DLT, a declarative framework for task declaration without worrying about orchestration, LakeFlow APIs for change data capture, and LakeFlow jobs, which serve as the orchestrator. The text acknowledges that while some components may sound familiar, LakeFlow Connect is a new addition.

ETL/ELT Your Business Data Easily With Databricks' LakeFlow - Interview With Product Leader - EP101 Transcription

00:00 - 00:30 [Music] So I am here with Balal joining us all the way from the Netherlands. Bilal is the senior director. What's what's what's your title? I feel like you have one of those titles that's never the title. That's actually the title for You can keep adding stuff. I'm a product manager.
00:30 - 01:00 I've been a product manager my whole career and I I'll I work on data engineering. So, uh product management uh is fun. Done it for a while. That's what I do at data bricks. Let's talk about leg flow. Lake flow is one of those things that are that is either in part or fully kind of under your realm of things that you're working on. That's right. So let's start off with the simple question of what is LakeFlow? Oh
01:00 - 01:30 um great question. Uh so LakeFlow is the uh unified streamlined and performant data engineering product from data bricks. Uh I'm going to tell you exactly what that means in a moment but really it's got three parts to it. One part is injection and the mission there or the idea there is we want to make it super easy and cost- effective to get all the world's data into data bricks so you can do cool things with it that includes structured data as well as unstructured data. So for structured data we have uh built
01:30 - 02:00 native uh incremental uh connectors and by incremental I mean uh it's truly you know uh taking the latest changes from the source for anything running at scale for data engineers out there it matters uh for Salesforce workday Netswuite uh Google Analytics 4 and on and on the list list will grow um as well as databases and this is kind of my official favorite because if anybody here has set up debzium or systems like that to kind of take CDC like these binary logs it's super complic licated stuff with with lake flow connect that's
02:00 - 02:30 the injection portion this is just point and click no point and click connect to it everything is governed in unity catalog connections are stored securely blah blah blah and then like just the data just starts showing up the whole idea behind connect lakeflow connect the first piece is set it forget it and hopefully it's cost effective and performant for you and you love it okay that's the first piece the second piece is uh what do you do with the data once you get it you clean it filter aggregate That's actually the ETL piece of Lake
02:30 - 03:00 Flow and that's called LakeFlow DLT. Uh you may have known that as Delta Life tables in the past. And then finally there is the and by the way that's a declarative framework. What that means is that you just declare what you want to do. You don't have to worry about orchestration. There's a bunch of cool APIs there for change data capture and std type one and type two records. And then finally there's LakeFlow jobs. Uh LakeFlow jobs is the orchestrator. Now the smart people among you, those who are using database will say, "Wait a second. One of these three is new. The other two sound familiar." That's exactly right. LakeFlow Connect is brand
03:00 - 03:30 new, built it from scratch on the Archon acquisition. DT is the evolution, fully backwards compatible of Delta Life Tables. And jobs is the evolution of workflows, which some people also know as jobs. So if you're using DT and jobs today, keep using them. They're just going to get better, much better very soon. Um, and if you're not using Connect, you should. I know you b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b basically equated pipelines to DTS, which I think is a little bit of an oversimplification. Can you perhaps
03:30 - 04:00 explain how is DTS in its current form similar and how is it different when it comes to the experience of working with DTS inside of uh LakeFlow pipelines? I'll show you kind of what's the same. Um so the fundamental idea is still the same. The APIs are still the same. So the fundamental idea behind uh this whole thing is it's just better to do declarative ETL. Okay. And what
04:00 - 04:30 declarative ETL? If you use streaming, streaming is declarative. SQL is declarative. And the idea behind declarative system is don't tell the system what uh sort of how to do something. Just tell it what to do. And we use declarative systems in our lives every single day. So that part is still the same. There's still a declarative API for doing ETL. Um, okay. So, now what's different? There's a lot of new things under the covers and we've w we've really gone out of our way to make sure that these are backwards
04:30 - 05:00 compatible. So, uh, let me go through a few. Uh, first of all, there's been a lot of focus on price performance and that's a fancy way of saying it should be more affordable. Uh the second piece is um adding some new APIs for flexibility for more use cases and I can get into that just DT is very opinionated and sometimes it's been the wrong kind of opinion. It sort of boxes you in. So sometimes it's like I got started with it. I wish I could do this thing with it and it's like you can't it's too opinionated. So we've gone ahead and like just added some open some
05:00 - 05:30 doors here to make it more flexible. Um I would say the third thing is we've been working on the new developer experience. So, I think that's one where, you know, I'm really excited about that. There's a brand new developer experience coming. There's actually super cool stuff and that really makes it easier for the data engineers to just get their jobs done and there's some cool cool new surprises in there. Um, and then the other thing we've been doing is that's new is just really making sure that the materialized views and streaming tables that DT creates, they behave just like normal tables and views, right? So there's a
05:30 - 06:00 lot of superpowers that streaming tables and MVs have, but in the past it's been like you couldn't do certain delta operations on these SDS and MVs. Now you can you can liquid cluster them. You can predictively optimize them. All those things will just work. What that means is you don't have to think about them anymore. Um I would say those are the things. So uh those are things that are new and different that just make it a better platform. And for me at least from you know I' I've gotten a little bit of the behind the scenes of it and I
06:00 - 06:30 would say for me I think the from from the outside the developer experience is what feels very different. It's like, oh, this is actually much more or organized, I would say, which is a which is a gift. And the whole idea behind it is that the development experience, the best developer experience is just get out of your way, right? They let you get your job done. They help you at key staff, but they kind of mostly stay out of your way. So, we're excited about that. So, obviously, data bricks has a few different partners in the
06:30 - 07:00 ecosystems, right? you in the ecosystem, you have the five trends, you have the prophecies, you have all these different companies. How is Lake Flow as a whole similar in terms of what they have to offer and how is it different like like what are the areas that hey there's value to to both to using both? Okay. Yeah, that's a great question. So um I'll sort of give a high
07:00 - 07:30 level answer then I'll get into sort of specifics. So I think on a high level there's the reason why these different products exist is because there's so many different personas and there's so many different types of users who use these. Um the way I look at the world is you know there there are these sort of classic data engineers. These are they're writing Python, they're writing SQL but really they're building these pipelines in a modularized testable you know productionized way. But you also on the other end of the spectrum you also have you know just someone like me who
07:30 - 08:00 might just drop in one day just do a little analysis that kind of takes off it's maybe it's built visually and then there's the entire sort of the the range in between right uh there are people who build pipelines that ingest data sort of you know using code and testing and all that and you know monitoring and graphfana kind of thing and then there are users who use say fiverran and they say point andclick set it forget it kind of thing so uh long-winded answer to Okay, lots of space and lots of space for partners. Um, the fundamental idea
08:00 - 08:30 behind lane flow is to unify and simplify what data bricks has in the box, right? That's the fundamental idea. Uh, it actually removes very little space from the ecosystem. And let's take a case study, right? Let's take injection for a second. Uh, the average customer of data bricks has I'd be shocked if they have less than 40 or 50 data sources, right? So, and these are, you know, there's often common ones like Salesforce and, you know, some sort of database, but then there's like all kinds of legacy, you know, legacy system. Somebody, some guy named Post
08:30 - 09:00 wrote a system like 20 years ago. We still have data in that. We need to get that in. It's a proprietary SOAP API of some kind because we all used to do like uh you know, web RPC back in the day. Um, so if you think about connect, lay connect will help you maybe with will definitely help you with the Salesforce connector. it'll help you with the SQL server connector. Then there's this long tail of things that it won't help you with and it probably doesn't make sense for data to build these connectors anyway. There's hundreds maybe thousands of them and to maintain them. That's where you know five is a great choice.
09:00 - 09:30 Uh and if you're already using five, you continue to use it. I can take another cut at it and I can say well how about DBT? Isn't you know DT just like DBT? And that comes up a lot and the answer is absolutely not. um they're both declarative frameworks but you know dbt has a massive open source community it's very warehouse ccentric it's very batch processing centric uh whereas yes DT is a declarative framework where it's native to data bricks it's not super warehouse ccentric but it does unify
09:30 - 10:00 batch and streaming uh and the the answer here is a little bit nuance because you can actually use materialized views and streaming tables inside dbt projects so we look for opportunities where we can to make things work better together. Uh long long answer short, I think there's a lot of space in both injection, transformation, and orchestration for different partners. We actually work really closely with uh all the companies I've mentioned, including Prophecy. So, you know, if you're looking for a great low code tool and you want to sort of superpower your team, you know, go go
10:00 - 10:30 use Prophecy. It's a great tool. It works natively with data bricks. Um but definitely there will be areas where they overlap. That's I'm, you know, I'm not going to lie about that. There definitely areas of overlap. That's that's pretty normal in the products across the industry. Yeah. No, and I I have uh I I have more than a good a few good friends at Prophecy. I have lots of love for them. Very awesome people. It's a great product. So thinking away from like the pure product and data brick side uh but more specifically as within the
10:30 - 11:00 realm of pro of product management. when you look back as to how Delta live tables were originally released to where things are headed with LakeFlow, what are some of the lessons I think that the team has learned from a pure product standpoint as well as from a customer uh feedback standpoint? That's a that's a great uh great question. I've having spent the last you know couple of years on this I I I can share a few reflections from a product perspective.
11:00 - 11:30 I think one is paradigms like or declarative for example as a paradigm it takes a long time to deliver on its value but what it does it's massive right so I do think it's so one of my learnings is that to to build something that is a fundamental change in the industry just be prepared to double and triple down on it it's just going to take you a while to get there and I'm not talking about revenue growth it's it's phenomenal for DT I'm not talking about adoption that's superb It's just more like there's like
11:30 - 12:00 always more to do and you always gosh I'm almost there but because it's such a fundamental shift in the way we work and it's promises so large it just takes time to get there. Um I would say the second one is it's going to sound pedestrian but listen to your customers. So one of the things I'm really proud of is that this team listens a lot to customers and we talk a lot to them. I mean including you by the way we're we're talking a lot right to get feedback from you. But the key here is to get unpunished feedback right? So don't don't pitch to your customers.
12:00 - 12:30 Just listen. Just ask them what's wrong, what's broken, what's not working for you. Uh when that happens, it's really magical. Uh and just do more of that. Um I would say the third lesson is uh sort of a product management lesson learned is that things that simplify the experience often sit on top of the stack. For example, DT is sort of a top of the stack. You know, if you go all the way down, it's Spark and compute and things like that. But D just abstract a lot of
12:30 - 13:00 that away. Customers don't care. So when there's a problem with Spark or Autoloader or compute, customers, what it's using DT, it's DT's problem. It's DT team's problem. So the lesson for product managers is that customers don't care that it's somewhere, you know, five teams away or 20 time five time zones away. You have to make sure that they get the right experience, but they also get their issues are resolved. And I will even take that a step a step
13:00 - 13:30 further and say customers don't even care whether it's DT or whatever. They care data bricks is this is not working. That's exactly right. So that's the and simple things often hide a lot of complexity under the hood. Uh that's been that's been really uh fun to build out doubling down and tripling down but at the same time listening to your customer feedback. How how do you how do you balance between the two? Cuz sometimes as a
13:30 - 14:00 company you do make bets that are like, "Oh shoot, that was that was a dumb bet that we made." How how do you balance that uh that vision that you have you know perhaps all the way up from from from the Elise from the arcelons from the reols and from the matees how do you balance and then then your vision and how do you balance that with okay this is what the customers are telling us I'll I'll I'll sort of be uh a bit transparent with you right so we've heard heard feedback about DT right so
14:00 - 14:30 for example one of the pieces of feedback we hear uh we heard sort in a rising crescendo was hey the errors are really bad like or they're super obscure and they're like hard to understand right um so the the really important thing here is that like Ali doesn't matter dal doesn't matter doesn't matter the customer actually matters the most so I think what you do is you become truth seek see seeking and you stay truth seeking and what that really means is
14:30 - 15:00 that the customer's voice is king and the customer's experience is actually what ultimately matters. So vision doesn't count for a whole lot if you can't execute on it. So um the the doubling and tripling down is actually on customer feedback. So there's another side to the customer feedback which is gosh this is so useful. Oh this this just took away like 20 hours worth of work. So that this very common feedback on DT is oh my gosh I I would have built this to should have would have taken me like three more engineers and you know twice the time if
15:00 - 15:30 this was like airflow or pispark but the errors are really bad or I wish it was a better developer experience. Can you guys fix that? So that's the truth seeking. So the doubling and tripling down. You have to have the upside, right? You have to be like, "Hey, customers really love it, but we got to fix these sort of these uh sharp these sharp edges. We got to file them away." Um that's the that's the that comes from being truth seeeking, right? Uh you can't take something that if customers are saying, "Gosh, this is terrible. It
15:30 - 16:00 doesn't even work." That's probably not something you should double or triple down on. But if if they are saying it's phenomenal, but I I need you to fix these things, then you can double and triple down on it. It's fair. [Music]