Understanding the Role and Impact of Data Engineering
What Is Data Engineering - Why Is Data Engineering Important?
Estimated read time: 1:20
Summary
Data engineering is becoming increasingly important due to the rising demand for data-driven insights. While many associate it with tools like data warehouses and ETL processes, the core goal is to transform raw data into accessible, reliable, and high-quality information for stakeholders. This involves creating systems for data integration, ensuring consistent and reliable datasets, and simplifying complex data structures into more analytical-friendly models. Data engineers act as intermediaries between teams to resolve data inconsistencies and support the company's data-driven decision-making.
Highlights
- Data engineering is witnessing a surge in job demand across various sectors. π
- The goal isn't just about tools like ETL or data lakes; it's about making data usable. π―
- Creating a 'source of truth' is instrumental for accurate analytics. π§©
- Data engineers facilitate integration of complex datasets. π οΈ
- Adopting models like star schemas simplifies data analysis. π‘
- High emphasis on ensuring data is both accessible and trustworthy. π
Key Takeaways
- Data engineering is key to transforming raw data into usable information. π
- Core goal: Make data accessible, reliable, and high-quality. β
- Data integration harmonizes disparate datasets for cohesive analysis. π
- Simplifying data models helps analysts and data scientists. π§βπ»
- Data engineers ensure data is also secure and compliant. π‘οΈ
Overview
Data engineering's significance is expanding with increasing job opportunities and demand for data-oriented solutions. Many might initially mistake the role for its tools, such as ETL or data lakes, but the essence lies in transforming data into a usable form. By focusing on refining raw data into reliable and consistent information, data engineers enable informed decision-making across companies.
One of the central tasks of data engineers is to integrate diverse datasets, often requiring creative solutions to join data with inconsistent keys. Through consolidating data into coherent and reproducible sets, analysts can utilize them without delving into the complexities. Structures like star schemas are employed to make data intuitive for analysis, significantly reducing the technical barrier for end-users.
Beyond mere data transformation, data engineers focus on the larger picture of data management, including security, compliance, and ensuring that data aligns with business needs. This holistic approach ensures companies not only collect data but transform it into strategic assets, which can drive insights and growth. A data engineer's work forms the foundation for a companyβs data-driven culture.
Chapters
- 00:00 - 00:30: Introduction to Data Engineering The chapter 'Introduction to Data Engineering' highlights the growing demand for data engineering jobs as evidenced by various surveys and studies. It acknowledges that many, especially recent college graduates, may not be familiar with the term or the duties of a data engineer. The chapter emphasizes the technologies and tools commonly associated with data engineering, such as data warehouses, data lakes, and ETL processes.
- 00:30 - 01:00: Understanding the Core of Data Engineering Chapter Title: Understanding the Core of Data Engineering This chapter focuses on demystifying what data engineering truly is, beyond just the tools and methods used in the profession. It delves into the real goal or essence of data engineering, which is succinctly defined by Jarice in his book 'Fundamentals of Data Engineering.' The definition highlights that data engineering involves the development, implementation, and upkeep of systems and processes for transforming data from its raw state into usable formats.
- 01:00 - 02:00: Goals and Objectives of Data Engineering The chapter titled 'Goals and Objectives of Data Engineering' focuses on the primary aim of data engineering, which is to produce high quality and consistent information for downstream uses, such as analysis and machine learning. It emphasizes that data engineering involves transforming raw data sets to make them accessible and easy to work with for stakeholders. Additionally, the chapter highlights the importance of ensuring data is reliable, robust, and reusable, allowing for repeatability in processes to facilitate ongoing usability.
- 02:00 - 03:30: Integration of Data Sets This chapter discusses the challenges and solutions related to integrating data sets. It highlights issues such as duplicate data, hard-to-access data, and data modeling that isn't user-friendly for less technical individuals. It emphasizes the need for simplifying data integration across various data sets and domains, filling in missing values and business logic, and making data more usable and accessible.
- 03:30 - 05:00: Challenges in Data Integration The chapter titled 'Challenges in Data Integration' focuses on the ultimate objective of data integration, which is to make data accessible to everyone in a company, not just those with technical expertise. It highlights the various tools and methods used in data integration, such as streaming, ETLs (Extract, Transform, Load), ELTs (Extract, Load, Transform), data warehouses, and data lakes. These are presented as means to reach the end goal of accessibility. The discussion emphasizes the presence of raw data sets representing business domains, departments, and transactions, which need to be effectively managed and integrated.
- 05:00 - 06:30: Role of Data Engineers This chapter covers the crucial role data engineers play in creating a 'source of truth' for businesses, particularly from the analytics perspective. While the ultimate source of truth lies with the business applications, data engineers strive to establish a reliable dataset that analytics and data science teams can utilize. The inherent challenge is that data does not naturally exist in this optimal shape, necessitating transformation efforts.
- 06:30 - 07:30: Data Modeling and Remodelling The chapter discusses the challenges and strategies in data integration, particularly focusing on how different data sets might be joined even when not inherently compatible. It highlights examples from the real world, stressing the importance of organic collaboration among data scientists and analysts. Teams may use various keys, such as address or email, to merge data sets effectively. The narrative underscores the creative and problem-solving aspects of data modeling.
- 07:30 - 08:30: Historical Data Tracking and Security This chapter focuses on the intricacies of historical data tracking and security in the domain of data engineering. It highlights the challenges associated with connecting disparate customer data, such as using first and last name combinations, which can lead to incorrect and unreliable results. The chapter emphasizes the importance of creating reproducible data sets that provide clear guidance on how data should be joined effectively, thus preventing the pitfalls of forced data connections.
- 08:30 - 09:00: Conclusion In this chapter, the author discusses a challenge faced in integrating data between two disparate project management systems within a large enterprise. One system was responsible for tracking employee hours, while the other managed higher-level financial information, such as budget spend and accounting data. The task involved creating a method to specifically match and attach employee hours to corresponding financial line items in the budget.
What Is Data Engineering - Why Is Data Engineering Important? Transcription
- 00:00 - 00:30 data engineering continues to grow in terms of jobs you can see this across multiple surveys and studies I'm going to put a few up here from various people kind of showing that there is increasing demand for people with data engineering skill sets but if you're like me and you just got out of college you had no idea what the term did engineering was or what they do if you look across all the videos including my own you'll see lots of Technologies you'll see lots of tools you'll hear words like data warehouse data Lake ETL and a whole bunch of other
- 00:30 - 01:00 stuff and maybe you get confused and think that's what data engineering is and yes those are the tools and methods that we use to do our job but that's not the goal or core of what data engineering is about so let's dive into understanding what is the goal of data engineering one great definition is from Jarice and his book fundamentals of data engineering where he describes it as the development implementation and maintenance of systems and processes that take data from a raw state and
- 01:00 - 01:30 produce high quality consistent information that supports Downstream use cases such as analysis and machine learning now let me consolidate that even more and say the goal of data engineering is to take raw data sets and make them easy to work with for our stakeholders yes there's other things like we want to make sure that they're reliable they're robust and we can continue to use them moving forward by making them repeatable but overall the goal is to take data that is often very difficult to work with for various
- 01:30 - 02:00 reasons sometimes it's because there's a lot of duplicate data sometimes it's because the data itself was hard to access or maybe just modeled in such a way that wasn't conducive to someone who is maybe either less technical or doesn't have the time to actually sit there and process it to actually use it very easily as well as you know maybe filling in missing in values missing business logic whatever is is missing and making it easy to integrate across various data sets and domains
- 02:00 - 02:30 um and that really is kind of the end goal everything else around it whether it be streaming etls elts data warehouses data Lakes all of that is just tools and methods that we use to actually get to this end goal of making data accessible to everybody in the company not just people who are technical the way I think about it is usually that there are all of these uh raw data sets that represent business domains and departments and transactions
- 02:30 - 03:00 that are occurring and what we want to do is create some sort of core data set oftentimes this is referred to as a source of Truth for the business at least for the analytics standpoint and that is the goal I always say that it's always a goal source of Truth is a goal it's rarely a full-on destination because things always change and really the source of Truth are the applications themselves but we're really trying to create a source of truth that the analytics teams and data science teams can work with and the problem is that data is not in that shape let's give a
- 03:00 - 03:30 few examples from my own career as well as just in general let's talk about integration or being able to join different data sets even when they don't really work well together well if you let this happen organically between let's say data scientists or data analysts they might find some sort of key that they could join two data sets together maybe for One A Day science team they'll join two data sets together via address and another data science team they'll join via email or possibly
- 03:30 - 04:00 first last name combinations and trying to join them because they want to somehow Connect customer data together that doesn't naturally connect what can end up happening here is then you end up with very different results and so the goal for a data engineering standpoint is to create data sets that are reproducible that everyone can rely on that don't have questions in terms of how you should join them but instead there is some clear path forward in terms of how data sets that don't work well fit together another good example that I usually give in terms of a bad uh
- 04:00 - 04:30 case of this was when I had to kind of create some way to integrate data between two different systems that we're managing project data if you are at a large Enterprise there's often multiple project management systems in this case there was one that tracked hours and one that tracked higher level information like budget spend different kind of accounting information and they wanted to attach hours specifically to that line item like employee time and
- 04:30 - 05:00 contractor time in the other project management system and then serve those up in reports now the problem was one system had a clear ID of where this was and now the problem was one system had a clear ID that was Project ID and the other system used something that was called project number the problem with project number was it was an open text field meaning anyone could put anything they wanted there was no drop down that would only limit you to the set of IDs so we had a whole cluster of types of inputs some people did use product
- 05:00 - 05:30 other people did things like project ID comma project ID comma project ID in a single field and still others thought the field meant completely different things like the amount of budget that the specific project had meaning we were very limited on how we could integrate and Report across those systems so as the data engineer I had to go in and not only obviously add in the technical aspect of it and trying to join this data and create data sets that were connectable between two different systems but also then going to teams and
- 05:30 - 06:00 going to the pmo and talking to them and being like hey we need to work on how we input this data here in order for us to do reporting later on so we kind of sometimes even have to act as some sort of arbiter between two teams to make sure that communication is correct so that way when the data finally does get to the analysts it is in a format that everyone can understand there are still other issues that data Engineers have to work on in order to make data more accessible for the end user this includes remodeling the data so we often
- 06:00 - 06:30 do all this crazy reprocessing like Star schema models or snowflake models or whatever model you're picking uh today so we often do all this remodeling whether it's something as simple as a star schema where we create a fact table and all the dimensions around it so that analysts and data scientists don't have to deal with the raw state of the model generally most data models that come from an operational system are in some third normal form you know they're heavily normalized this means that you're gonna have to do tons of joins to
- 06:30 - 07:00 kind of get data um into one single table and that means you need to know how to do said joins but a star schema approach or something similar is a much more simplified approach where there is a centralized fact table that often represents the core focus of what you're trying to do and then everything around it dimensionally is all about generally what you're slicing and dicing on maybe it's region location or maybe it's something like stores and product type whatever it might be this kind of makes
- 07:00 - 07:30 it a little bit of a simpler uh approach for analysts to work with and even then we might even go one step farther and create a plus type table at least that's what I've called it in the past where you just join all that information together up front so there's no confusion whatsoever in terms of how this data works together and another Point you'll often always hear is adding historical information and being able to track it not all systems track historical information correctly especially when it comes to things like dimensional information like customers maybe a customer moves or if we're
- 07:30 - 08:00 referring to employees maybe they change job type and you might lose that historical information over time and thus data Engineers will often put some layers in that will help you track any of that historical information but all of this and again whether it's in a data warehouse a data lake house whether it's being an ETL that runs the data or an ETL all of this is focused on making the data easier to access now on top of this data Engineers do often do other things including having to manage things like security data management in general orchestration and all of these these
- 08:00 - 08:30 other tasks but even all of this the end goal is providing some sort of layer of data that an analyst can access and yes there's other things we take into consideration like security and and now with you know gdpr other forms of compliance and privacy but the end goal of what we do is really trying to make data as accessible and trustworthy and reliable at companies as possible yes there's a bunch of fancy tools and best practices you can sit on top of that data but really that is what I would say
- 08:30 - 09:00 is the quintessential goal with that guys I want to say thanks so much for watching this video and I will see you next time thanks and goodbye [Music] thank you [Music]