Azure End-To-End Data Engineering Project (From Scratch!)

Estimated read time: 1:20

Learn to use AI like a Pro

Get the latest AI workflows to boost your productivity and business performance, delivered weekly by expert consultants. Enjoy step-by-step guides, weekly Q&A sessions, and full access to our AI workflow archive.

Summary

In this engaging end-to-end data engineering project led by Ansh Lamba, you'll explore the world of Azure Data Engineering. The project covers all in-demand Azure tools and technologies like Azure Data Factory, Azure Databricks, and Azure Synapse Analytics from scratch. The video is crafted to help you crack interview questions by simulating real-world scenarios that highlight the skills required for a data engineering role. Furthermore, you will learn to implement dynamic pipelines, create a medallion architecture (bronze, silver, and gold layers), and make Power BI connections. This project is perfect for those aiming to deepen their data engineering knowledge while preparing for real-life applications.

Highlights

Uncover how to dynamically create and manage data pipelines using Azure Data Factory 🧑‍💻.
See how real-time scenarios can highlight your skills to potential employers and set you apart in interviews 🎯.
Learn from detailed step-by-step walkthroughs on creating robust systems using dynamic parameterization in Azure 💼.
Get hands-on with Azure Databricks and perform big data transformations using Spark 🤓.
Understand creating secure API connections and managing identities across Azure services 🔐.
Visualize data with Power BI, showcasing your transformed datasets effectively in real-time dashboards 📈.

Key Takeaways

Learn to build dynamic pipelines to efficiently manage data transfers between Azure services 🚀.
Discover how to use Azure Data Factory to orchestrate complex ETL workflows with ease 🛠️.
Understand the Medallion architecture's bronze, silver, and gold data layers and the role they play in data organization 💾.
Explore Azure Databricks for powerful big data transformations and how to integrate it within your data pipeline 💡.
Gain insights into connecting Azure Synapse Analytics with Power BI to display data in insightful dashboards 📊.
Master Azure's Data Lake complexities and optimize data storage and access for large datasets 🌐.

Overview

The project kicks off by introducing the Azure environment and setting up the necessary resources. You’ll start with creating dynamic pipelines using Azure Data Factory, which allows you to manage data transfers efficiently across services. The main goal is to learn how to automate processes and reduce manual intervention, setting a strong foundation for your data engineering tasks.

Next, you'll delve into Azure Databricks to handle big data transformations with ease. This section covers real-time data streaming and batch processing, helping to highlight the practical skills desired in data engineering roles. Through hands-on demonstrations, you'll build pipelines that process data in Medallion architecture — the bronze, silver, and gold layers that ensure data is clean and organized at each step.

Finally, the course rounds off by establishing a connection between Azure Synapse Analytics and Power BI. This helps in visualizing the transformed data effectively and creating insightful dashboards. These dashboards display data analytics results that support decision-making for businesses. You gain confidence to face interview questions related to setting up robust, scalable data solutions using Azure technologies.

Chapters

00:00 - 00:30: Introduction This chapter, titled 'Introduction', is about a project that played a significant role in securing multiple job offers, landing the narrator a data engineer position. It focuses on an Azure data engineering project aiming to teach in-demand Azure tools and technologies including Azure Data Factory, Azure Databricks, Azure Synapse Analytics, managed identities, and API connections. The chapter hints at three unique aspects of this project that are not covered in other YouTube videos.
00:30 - 01:30: Overview of Project Structure This chapter provides an overview of the in-demand Azure tools and technologies such as Azure Data Factory, Azure Databricks, Azure Synapse Analytics, and more. It highlights that all these topics are covered comprehensively from scratch within a single video. Additionally, the chapter emphasizes covering real-time scenarios commonly asked in interviews and includes some interview questions to better prepare viewers.
01:30 - 02:30: Introduction to Data Engineering Project The chapter introduces a comprehensive data engineering project designed to prepare individuals for interviews and roles in the field. It is presented as an invaluable resource for tackling real-world problems and is aimed at providing an end-to-end solution. The speaker notes the personal benefits of this project, mentioning its role in securing their first data engineering job.
02:30 - 03:30: Importance of Data Architecture The chapter titled 'Importance of Data Architecture' emphasizes the critical role of data architecture in data projects. The chapter begins with the author sharing a personal experience from a project, highlighting how showcasing real-time scenarios during interviews can effectively demonstrate one's skill set. The author expresses enthusiasm about sharing knowledge and contributing to the data community. Subsequently, the chapter transitions into discussing the key stages of a data project, particularly stressing the importance of the data architecture stage, suggesting it to be a foundational step in ensuring success.
03:30 - 04:30: Understanding Data Sources and Tools The chapter titled 'Understanding Data Sources and Tools' focuses on the importance of having a solid data architecture blueprint for the success of any project. It emphasizes that a well-defined architecture provides the right path to succeed without much hassle. The chapter plans to delve into the specifics of data architecture for a particular project using HT connection as a source.
04:30 - 05:30: Using Azure Data Factory The chapter 'Using Azure Data Factory' focuses on real-world scenarios of data extraction. It introduces the process of pulling data directly from APIs, specifically using a GitHub account as an example. The aim is to teach how to fetch data directly from APIs, offering a practical and engaging learning experience.
05:30 - 06:30: Dynamic Pipelines and Real-Time Scenarios This chapter introduces Azure Data Factory, a highly demanded orchestration tool in the field of data engineering. It emphasizes its powerful capabilities despite being a low to no code tool. The narrator shares a personal affinity for working with Azure Data Factory, and mentions its presence within Synapse Analytics, known as Synapse Data Pipelines.
06:30 - 07:30: Understanding Medallion Architecture Chapter discusses the importance of being familiar with data factory tools, like Medallion Architecture, especially for data engineers. It hints at real-time scenarios and the challenges faced during preparation for data engineering roles.
07:30 - 08:30: Introduction to Databricks In the 'Introduction to Databricks' chapter, the focus is on addressing the lack of real-time scenarios often requested in interviews. The chapter covers building dynamic pipelines using Databricks by incorporating parameters and loops, enabling a more powerful and flexible toolset. This approach promises to impart significant practical knowledge and skill with Databricks, moving beyond static pipeline construction to more interactive and dynamic processes.
08:30 - 09:30: Writing Transformations in Databricks The chapter discusses the concept of the Medallion Architecture in data engineering solutions, focusing on data transformation in Databricks. It mentions the bronze layer as the initial phase where raw data is landed. The architecture organizes data into three zones: raw (bronze), transformed (silver), and refined (gold). The approach is key in structuring and processing data efficiently.
09:30 - 10:30: Working with Storage Accounts and Containers The chapter discusses the concept of data storage layers, specifically focusing on the raw or bronze layer. This initial layer is where data from the source is stored in its original form without any transformations. An example given is creating an exact replica of a file from the source into the raw zone, maintaining its integrity and original state.
10:30 - 11:30: Setting up Azure Data Factory This chapter discusses the importance of Azure Data Factory in the realm of data engineering, notably highlighting the power of Spark clusters within the Data Factory. It emphasizes the influence of Databricks as a leading tool due to its capabilities in handling big data, suggesting that Databricks is a pivotal element in data engineering today, with a strong demand in the industry.
11:30 - 12:30: Building and Running Pipelines The chapter discusses the rising demand for data skills, particularly those related to Databricks and Spark developers. It promises comprehensive learning about Databricks through the project, including real-time scenarios. The focus will be on extracting data from the bronze layer and processing it using the tool.
12:30 - 13:30: Introduction to Medallion Architecture The Introduction to Medallion Architecture begins by discussing the transformation process when moving data to the Silver layer. It emphasizes the application of various transformations and functions in data engineering, specifically using Spark. The chapter promises to cover these transformations from scratch, ensuring a comprehensive understanding. It briefly mentions the next step, which is the serving layer, indicating that after transformations and data cleaning, it is essential to prepare the data for serving. The chapter sets the stage for exploring these processes in detail.
13:30 - 14:30: Working with Databricks Clusters The chapter focuses on the role of different stakeholders such as data analysts, data scientists, and analytics managers in the process of building a data warehouse. It highlights Azure Synapse Analytics as a popular solution for data warehousing, emphasizing its importance as the goal layer where data is prepared and ready to be served to stakeholders or other end-users.
14:30 - 15:30: Introduction to Power BI and its Use The chapter "Introduction to Power BI and its Use" begins with a discussion between developers about the essential skills required for a data engineer, particularly in relation to Power BI. The developer highlights the importance of learning how to establish connections in Power BI, emphasizing that it is a necessary skill in an engineer's toolkit. The conversation addresses a scenario where a data analyst encounters issues in setting up a connection and seeks assistance from a data engineer, illustrating the practical need for data engineers to be equipped with a comprehensive understanding of end-to-end solutions.
15:30 - 16:30: Setting Up Azure Free Account The chapter discusses the importance of connecting Azure services to PowerBI. It emphasizes understanding link services and the process of pulling tables and data warehouse facts and dimensions into PowerBI. While the focus isn't deeply on PowerBI, it includes a brief overview that the reader will appreciate.
16:30 - 17:30: Setting Up Azure Portal and Resources The chapter introduces fundamental terms necessary for setting up Azure Portal and resources, crucial for any data engineer working on implementing solutions.
17:30 - 18:30: Data Source Exploration and Analysis The chapter emphasizes the importance of mastering fundamental areas of data source exploration and analysis, which are sometimes overlooked but crucial for interviews. It encourages enthusiasm and prompt action in acquiring these skills. The chapter begins by listing prerequisites for a project, starting with a basic requirement of having a laptop, PC, or a MacBook.
18:30 - 19:30: Creating Azure Resources and Configurations The chapter discusses the basic requirements for working with Azure, highlighting the need for a laptop or PC with a stable internet connection and an Azure account. It reassures readers not to worry if they don't have an Azure account, as it explains how to create a free Azure account. The chapter emphasizes that this is not a promotional or commission-based offering; rather, Azure provides free accounts to encourage learning and using their services.
19:30 - 20:30: Establishing API Connections The chapter titled 'Establishing API Connections' begins with encouraging learners to create a free Azure account, emphasizing the importance and excitement of learning in-demand technologies related to data solutions. The transcript suggests that the readers will be exploring multiple services and assures them that the preliminary information provided was essential. The chapter concludes by motivating readers to initiate the project from scratch, signaling a hands-on learning approach.
20:30 - 21:30: Pulling Data from GitHub API This chapter provides a detailed guide on how to create a free Azure account. It starts by instructing users to search for 'Azure free account' on Google. Clicking the first link will direct them to the Microsoft page where they can set up their account.
21:30 - 22:30: Using Data Factory for Data Movement The chapter begins with guidance on accessing Azure's free trial offer. It emphasizes the importance of selecting the 'try Azure for free' option instead of the 'pay as you go' to avoid charges. The transcript reassures users that a Microsoft email account is not necessary to sign up for the trial; a Gmail account or any other email service can be used.
22:30 - 23:30: Introduction to Medallion Architecture Layers This chapter provides an introduction to the Medallion Architecture layers, starting with the steps to create and register an account with Microsoft, either by creating a new one or using an existing Outlook account. The process is straightforward, requiring just a few clicks to set up your account.
23:30 - 24:30: Bronze Layer Data Transformation The chapter titled 'Bronze Layer Data Transformation' discusses the initial setup process for accessing certain services. It guides the user through filling out a form with personal details such as name and address, including verifying a phone number. Once signed up, users have access to a free account for one year, with an additional $200 credit available for the first 30 days. This credit must be utilized within the specified timeframe.
24:30 - 25:30: Usage of Databricks for Big Data The chapter outlines the utilization of Databricks for handling big data projects, assuring readers that a provided $200 credit will suffice for the completion of exercises within the project. It also covers guidance on needing bank information during sign-up, advising not to worry about the banking details.
25:30 - 26:30: Setting Up Databricks Workspace This chapter focuses on setting up a Databricks workspace using a trial account provided by Microsoft Azure. It emphasizes that there will be no charges during the trial period, even if Azure prompts to convert the account to a pay-as-you-go service. Users are assured that they will not be charged without explicit consent and are advised to just trust Azure's procedures. After the trial period, it is explained that all resources used will be disabled or removed if no further action is taken.
26:30 - 27:30: Understanding Real-Time Scenarios in Databricks The chapter titled 'Understanding Real-Time Scenarios in Databricks' discusses setting up and utilizing a free Azure account for 30 days. The narrator assures users that they will not be charged and encourages them to trust Azure. Once the Azure account is created, users are guided to navigate to the Azure portal.
27:30 - 28:30: Data Transformation Techniques The chapter begins with a guide on accessing the Microsoft Azure portal. It provides instructions to search for 'portal.azure.com' on Google. Upon accessing the portal, the user needs to enter their credentials to log in. The chapter highlights that the appearance of the Azure portal may vary among users. It serves as a preliminary step before delving into specific data transformation techniques on Azure.
28:30 - 29:30: Advanced Transformation Functions In the chapter titled 'Advanced Transformation Functions', the focus is on understanding the initial steps in data architecture within the Azure portal. It emphasizes the importance of becoming familiar with your data source to ensure a proper grasp of the data being handled, suggesting that users may have different views based on their existing resources. The chapter guides the readers through accessing their Azure account and starting with the data source, setting the foundation for further data transformations.
29:30 - 30:30: Building Dynamic Pipelines In the chapter titled 'Building Dynamic Pipelines', the discussion starts with exploring the reasons for choosing a specific data set. It touches upon 'crazy questions' related to the topic, and introduces the data source being used, which is the widely recognized Adventure Works data set. This data set is popular in the fields of data engineering, analytics, and data science. The chapter emphasizes the comprehensive nature of this data set, as it is filled with numerous tables, providing a distinct advantage when presenting data projects. This diversity and depth make it particularly valuable for showcasing complex data projects.
30:30 - 31:30: Learning Lakehouse Architecture The chapter discusses the importance of working with multiple tables in data projects. It highlights how experience with complex data sets and operations such as joins and lookups enhances one's ability to handle real-world scenarios effectively. The use of a diverse array of tables is emphasized as crucial for building robust solutions.
31:30 - 32:30: Successfully Running Pipelines The chapter titled 'Successfully Running Pipelines' provides an overview of the various tables and datasets involved in managing business data. It begins with the calendar table, which contains date columns, followed by the customers table containing details like name, address, and birth date of customers, essential for aggregation. It also discusses the product categories table, which is meant to be part of a larger dataset. This chapter aims to familiarize the reader with the structure and components of the data needed to set up and execute effective data pipelines.
32:30 - 33:30: Celebrating Success in Pipelines Execution This chapter discusses the process of building a comprehensive data solution involving sales data, product categories, product information, territories, and customer information. It emphasizes the importance of applying necessary lookups to integrate various data tables, including products, product subcategories, and product categories. Additionally, the chapter mentions that the solution covers data over a span of three years (2015, 2016, and 2017) and includes a unique aspect of handling return data.
33:30 - 34:30: Optimized Data Transformation Strategies The chapter titled 'Optimized Data Transformation Strategies' focuses on working with end-to-end solutions using a selected data set. The data set is comprehensive enough to perform various joins and transformations, making it ideal for learning purposes. The chapter discusses working with three years of sales data and returns data, emphasizing the potential for in-depth analysis and transformation of these data tables to extract valuable insights. The content aims to guide readers through advanced techniques for effective data handling and analysis.
34:30 - 35:30: Introduction to Azure Synapse The chapter begins with a discussion on returns data, identifying them as the fact tables. It highlights the importance of dimensions in relation to fact tables, particularly for performing aggregations and providing contextual information. The focus will be to elaborate on these aspects in detail.
35:30 - 36:30: Setting up Synapse Analytics Workspace The chapter titled 'Setting up Synapse Analytics Workspace' discusses building a model with a focus on sales data and returns data, described as 'the center,' with lookups referred to as dimensions. The process is presented as an engaging task. The chapter also mentions the use of Azure Data Factory for loading data into Azure. To facilitate the process, the data is pre-uploaded to a GitHub account for easy access and loading.
36:30 - 37:30: Data Warehousing and Synapse Analytics In this chapter titled 'Data Warehousing and Synapse Analytics', the discussion is focused on pulling data from a GitHub account using an API and pushing this data to a designated storage zone, referred to as 'Ros' or 'Bron' zone. This forms the first phase of the project which involves transferring data from a source to a data lake bronze zone. The chapter also covers setting up necessary resources in Azure, starting with the creation of a resource group and other essential components to facilitate the process.
37:30 - 38:30: Creating Databases and Tables in Synapse The chapter begins by instructing users to not immediately click on the 'create a resource' button in the Azure portal due to the importance of setting up a resource group first. Creating a resource group is critical as it serves as the organizing container for all resources, emphasizing careful planning and organization in the resource setup process.
38:30 - 39:30: Introduction to Lakehouse Architecture The chapter provides an introduction to Lakehouse Architecture. It begins with an overview of recent services and resources utilized within this framework. The chapter highlights the ease of navigating and creating resources using a search bar tool. It stresses the importance of using the right search terms to access resource groups and manage them effectively through the search functionality. This user-friendly approach aims to streamline the management of resources within the Lakehouse Architecture system.
39:30 - 40:30: Establishing Connections Between Tools In this chapter titled 'Establishing Connections Between Tools', the focus is on creating a new Resource Group. The process is straightforward, as it involves clicking a 'create' button and filling out a configuration window. The essential step is to provide the Resource Group name as part of the setup. Existing Resource Groups do not interfere with creating a new one.
40:30 - 41:30: Real-Time Data Analysis with Synapse The chapter discusses real-time data analysis using Synapse in a cloud environment. It involves setting up a unique Resource Group within a subscription, emphasizing the necessity for unique naming conventions to avoid conflicts. The chapter indicates a process of verification to ensure that the chosen Resource Group name is available and meets organizational requirements.
41:30 - 42:30: Understanding Serverless and Dedicated Pools The chapter titled 'Understanding Serverless and Dedicated Pools' discusses the importance of selecting the nearest region for serverless computing due to lower latency concerns. It suggests picking a region that the user is familiar with, and in this instance, East US is selected due to the speaker's prior experience. The process then involves moving on to the tag selection step.
42:30 - 43:30: Using Lakehouse for Data Analysis The chapter discusses the use of tags in the context of organizing resources within a data analysis framework. Tags help to categorize resources based on different criteria such as billing structures or categories. The speaker emphasizes that while they are crucial for organization and categorization purposes, at this stage of lakehouse data setup, they can be ignored if there isn't a defined purpose for them yet. The suggestion is to focus on validating and creating setups first, without getting bogged down in tagging details.
43:30 - 44:30: Creating External Tables and Views The chapter discusses the process of creating external tables and views. It starts with instructing the user to click on the 'create' button to form a resource group quickly. It is noted how fast the process can be. Once the resource group is ready, the user should click on it to view its contents. At this stage, the resource group will appear empty because no resources have been added yet. The chapter then notes that there are two methods available for creating resources, although these methods are not detailed in the provided transcript.
44:30 - 45:30: Finalizing Data Layers and Structures The chapter covers the process of finalizing data layers and structures. It starts with the creation of a resource group, which acts as a default container for resources. The process involves automatically picking and organizing resources within this group. An alternative method is introduced by navigating through a home tab to search for the first resource manually, showcasing a different approach to organizing data structures.
45:30 - 46:30: Integration with Power BI The chapter titled 'Integration with Power BI' introduces the foundational concept of 'data link' as the first resource in integrating with Power BI. The text hints at an interview setting, posing questions related to data redundancy and storage accounts. It suggests potential questions that could arise in an interview context, aiming to prepare the reader or listener for these scenarios. The transcript, although fragmented, is intended to guide through the initial steps and considerations when integrating data resources with Power BI, emphasizing the avoidance of data redundancy.

Azure End-To-End Data Engineering Project (From Scratch!) Transcription

00:00 - 00:30 this project helped me to crack multiple offers and I got placed as an a your data engineer with the help of this project in this Inn data engineering project we will learn all the in demand Azure tools and Technologies such as Azure data Factory Azure data breaks Azure snaps analex along with manage identities API connections and many more but what's so special in this particular video which is not available in any other video on YouTube well there are three reasons let's uncover them reason
00:30 - 01:00 number one all the in demand Azure tools and Technologies such as Azure data Factory Azure data bricks aour snaps analytics and much more are covered in one single video and do you know the best thing all are covered from Scrat reason number two instead of showing just simple approach we will be covering some real-time scenarios which are asked by the interviewers reason number three we will be covering some interview questions as well because the best way
01:00 - 01:30 to prepare for the interview is covering those questions in the projects so without delaying let's get started with this amazing project so welcome welcome welcome so this is your onstop solution for a your data engineering project in which we will be covering end to end solution yes this project helped me a lot to crack my first data engineer role because in this particular project I
01:30 - 02:00 covered realtime scenarios which caught the eyes of the interviewers and definitely it highlighted my skill set I know you are really excited to cover this project so I am double excited to share this knowledge with you all because this channel is dedicated to contribute to my data community so without wasting any time let's get started do you know what is the most important step what is the most important stage in any data project in
02:00 - 02:30 any project yes you guessed it right it's the architecture so in our scenario we going to look at our data architecture for this project if you have a good blueprint for your project you can succeed easily without any hustles because you will have the right path right so let's discuss the data architecture of our project in detail as you can see we going to use HT connection as our source we could
02:30 - 03:00 have easily used some manual uploading of CSU files into Data leag but I wanted to show you some real world scenarios where we will be pulling data directly from the apis so in this scenario it will be a GitHub account and we will be pulling data directly from that API so that you can learn how to fetch data directly from the apis this going to be fun because we will be using this
03:00 - 03:30 powerful orchestration tool and it's called a data Factory a data Factory is one of the most in demand orchestration tool right now in the world of aure data engineering because of its massive Powers it's low to no code tool and yet it is really really powerful I personally love working with your data Factory and yes if you are familiar with synapse analytics we have a Y data Factory there as well with name as synapse data pipelines or like similar
03:30 - 04:00 to that but code is same a your data Factory is there a your data Factory is everywhere like in any a data Engine Solution a your data Factory is there so you should be very much familiar with this tool and don't worry because I have added realtime scenarios while working with this tool because when I Was preparing for my data engineering interviews I was facing lots of problems because because there was
04:00 - 04:30 not much realtime scenarios available which were being asked by the interviewers so this time we going to cover some realtime scenarios you want some hint okay so instead of building static pipelines we will be building Dynamic pipelines we will be using parameters we will be using loops and much more so you will be learning a lot about this powerful tool okay what we will be doing like with you with using this Y data Factory tool so we will be
04:30 - 05:00 pulling this data that is our source and we will land our data to our bronze layer what is this bronze middle and what is it we don't know anything okay so basically it is a kind of architecture called Medallion architecture so this is the kind of approach that we follow in data engineering Solutions so what we do in this solution we make our data travel through three different zones raw silver gold okay you can also call it as like bronze silver gold you can also call it
05:00 - 05:30 as like raw transformed and serving layer there are so many names but the fundamental is same three layers first layer is the raw layer or bronze layer in which we keep our data as it is that is available in Source what does it mean so let's say we have one file in Source we will create exact replica of that particular file in our row Zone we do not want to apply any transformation that's simple right okay once our data
05:30 - 06:00 lands in bronze layer what's next bro the next is this Hulk this is called as Hulk off your data engineering or any data engineering because of its powerful spark clusters data braks is one of the most important tool right now because their power is really really insane while working with big data data Brak is like dominating data engineering World trust me I love working with data brakes plus the demand for data breaks
06:00 - 06:30 is rising exponentially companies are going crazy after data breaks developers or five spark developers so do not worry at all because we will be learning everything about data Pricks in this particular project and I have added some real time scenarios as well so you will be familiar with this tool and from scratch that's the best thing right so what we will be doing with this tool we will pick the data from the bronze layer and we will push the data
06:30 - 07:00 to the silver layer with some transformation because when we push our data to Silver layer we apply some Transformations so we will look at some crazy Transformations crazy functions available in the world of data engineering using Spar don't worry at all we will be covering everything from scratch okay what's next the next is serving layer okay what we going to do in this layer so after applying Transformations after applying so many cleaning it's time to serve our data to
07:00 - 07:30 our stakeholders and in our scenario it can be data analyst it can be data scientists maybe some analytics managers as well so we will be building data warehouse and the most popular data warehousing solution right now is azure synapse analytics so we will be learning a lot about this technology as well and this will be our goal layer where the data is ready to be served to the stakeholders or any other
07:30 - 08:00 developer okay that sounds amazing what's next so we will be covering a little bit about powerbi as well so that you will learn how to establish connections now you will say is this really required bro the answer is yes because when you want to become an efficient data engineer you should know end to end Solutions let's suppose you are a data engineer and you have prepared a data warehouse a data analyst came to you hey I'm facing a trouble I'm facing ing a problem while establish a
08:00 - 08:30 connection between these snaps and powerbi you should be there to help the person so it's really important to know how to build connections how to build maybe you can say Link services and how to actually Pull tables and data warehouse facts and dimensions into the powerbi so it's really important to learn that as well so that's why I have covered a little bit about powerbi as well not much in detail but yes a small part so you will love learning that part as well okay so it was all about our
08:30 - 09:00 data architecture and now I know you are double excited to actually implement the solution but before that I would like to just discuss some of the fundamental terms that will be used throughout the project and if you are already familiar with those terms it will be a good revision for you and if you do not know then it is for you so just quickly cover some of the prequest some of the fundamental terms that we should know as an your data engineer and trust me you
09:00 - 09:30 can expect some interview questions from these fundamental areas as well because we sometimes like Overlook these areas but you need to master these as well so without wasting any time let's get started so here are some of the prequest that we need to complete this project don't worry these are really really easy to cover so first of all obviously you need a laptop or PC or maybe MacBook I don't have a Macbook so I don't know if we can actually feel that WIP so it's up
09:30 - 10:00 to you laptop or PC we are simple people bro laptop or PC with stable internet connection Plus aure account don't worry don't worry wait if you do not have an aure account I will tell you how you can create an Azure account for free yes for free it's not a promotion or it's not a kind of you can say commission thing Azure actually provides a free Azure account so that you can use their services and you can actually learn Azure so don't worry I will tell you how
10:00 - 10:30 you can create your Azure account for free don't worry at all and then obviously you are learning one of the most important one of the most in demand technology you should have some excitement to learn a your data solution and that is in solution not just one or maybe two Services we'll be learning a lot more let's start the project because enough information is provided I accept that but it was required it was important so finally it's time to create this project project from scratch and as
10:30 - 11:00 I promised I will tell you how you can create your free Azure account so first let's create that and if you already have that account it's good so without wasting any time let's create Azure free account to create your Azure free account the steps are really really easy let me show you you first need to go on Google and just type as your free account and you will have your first link on your screen just click on it and then you will land on the Microsoft
11:00 - 11:30 website where you can clearly see try aure for free tab just click on it don't click on this one pay as you go because this is a paid account and this will charge you as much as you use so just click on try aure for free so it will just ask you to put your email ID if you do not have Microsoft email account do not worry you do not need to have microsoft.com at the end in your email account you can use a Gmail
11:30 - 12:00 account but you need to create you need to register that account under Microsoft as well so if you do not have one just click on create one if you already have your Outlook account just use that one it will work so I already have one so you can just click on create one steps are very easy just create your first account if you do not have one and just click on next let's say I just mention my account like demo account at the gmail.com and just click on next
12:00 - 12:30 so after putting your credentials you will land on this page so this is the kind of form that you need to fill and just put your all the details that you have like name address everything and just verify your phone number and then just hit on this sign up button and let me just show you what you can expect with this account so this account is free for like one year you can use these Services plus you will get $200 credit that you need to use in first 30 days so that you can just
12:30 - 13:00 create all the resources that require us credits to actually use those services and do not worry this project will be sufficient to like like these us 200 credit credits are like sufficient to complete this project so do not worry at all just go with it one thing that I like to mention once you hit on this sign up page this one it will ask you to put your details like banking details and all do not not worry at all because this
13:00 - 13:30 is Microsoft as your they're not going to charge you at all because this is your trial account they will ask you to convert it into a pay as you go account but if you do not want to do that you simply decline it and if you do not take any actions still they cannot charge you so just trust aure and just go with it and now the question is what will happen after 30 days or maybe after an year so all your resources will be fed uh or you can say all will be removed for your
13:30 - 14:00 account from your account but you will not be charged so do not worry at all just put all your uh information that is required and you are good to use your aure free account for 30 days I have personally created that so do not worry at all so just trust Azure and here is your Azure free account so once you create your Azure account we simply need to go on the portal that is aure Portal and how we can do it let me show you so it's time to go on the Azure portal and
14:00 - 14:30 how you can go because now you have your Azure account with you so simply go on Google and type portal. azure.com so once you write it just click enter and you will land on your Microsoft Azure portal obviously you're going to enter your credentials again I have already entered it so it just took me to my aure portal account so this is our your home screen don't worry at all because your screen can look different because I
14:30 - 15:00 already have some resources built in my account so do not worry at all if your screen looks different so this is a kind of UI you can expect in your Azure portal now hold on now you have your Azure account now it's time to follow the right path according to our data architecture the first step is data source so just keep your a account aside and now let's explore our data source so that we will have a good understanding like which data we are working with and
15:00 - 15:30 why we pick that data we will cover some crazy questions there let's see our data source so this is our data source okay Adventure works this is a very popular data set available in the world of data engineering data analytics data science why I picked this data set let me tell you this data set was full of tables and this gives you an advantage with when you showcase your project that involved
15:30 - 16:00 so many tables because they know that this person has applied so many joints so many lookups while completing this project so this gives you an advantage and that's why I also took this data set and this gave me an advantage too because in the real world scenarios we work with so many tables while building one solution so that's why it is really important to work with different different tables and let's have a look
16:00 - 16:30 let's have a look let's have a look so first of all we have calendar. CFC obviously in the calendar table what you can expect just a DAT column right so okay then second we have customers table in which we have all the information related to customers like their name address maybe like so many things birth date and all so we will be just aggregating so much of data don't worry at all so I'm just giving you an overview of the data that we have so then we have product categories obviously we are building a data set or
16:30 - 17:00 building a solution in which we have sales data product categories product information territories customers information are required to apply the necessary lookups so as you can see we have products table product subcategories product categories and then yes we have passed three years of data 2015 16 and 17 then we have an interesting table as well we have return data as well so this
17:00 - 17:30 is an ideal data set if you want to work with end to end solution because we can perform so many joints we can perform so many Transformations and we can do a lot more with these tables so that's why I picked this data set and you want to learn so much with these tables so as you can imagine like what we're going to do obviously we have sales data right as you can see we have past three years of sales data we have returns data as well
17:30 - 18:00 okay we have returns data and these are basically our fact tables don't worry we'll discuss it in detail so these two are are our fact tables and we can build dimensions on these tables like like why we need Dimensions because if we want to perform aggregations if we want to provide contextual information to our fact table we need Dimensions so we will be just
18:00 - 18:30 building a model by keeping sales data in the center and you can say returns data in the center and lookups as our Dimensions so this going to be fun while working with data so are you excited to just load all this data to our aure and yes we will be using aure data Factory for that so just to make things easy I have already loaded all this data to my GitHub account and now we will be just
18:30 - 19:00 pulling the data from GitHub account using API and we will push this data to Ros on or Bron on right so this is our phase one of the project where we will be pulling data from data source and we'll be pushing this data to our data L Bron Zone without wasting any time let me take you to aure and let's create the necessary resources Resource Group and much more let's start working with Azure so we are
19:00 - 19:30 here in our Azure portal so as you can see create a resource plus button do not click on it because I know you really excited to create a resource but before that we first need to create a resource Group because it is really important to create a resource Group because that is the place where you will keep all your resources right so how you can create a resource Group you simply need to click on this button if if it is not visible to you because it just displays the
19:30 - 20:00 recent Services resources that we have used so simply go on this search bar this search bar will help you navigate all your existing resources plus if you want to create any new resource you can create one from here do not worry so for you all I will just click on this resource resource search bar so I will just type resource oops resource groups and you should just see the popup here and just click on that and it will just show you all the
20:00 - 20:30 existing resource groups do not mind my resource groups because I already have some so so sorry for that so we will just first create the new Resource Group by simply clicking on this create button simple just click on it okay so this is the window a kind of configuration that we need to do before creating our Resource Group it is very simple you simply need to provide the resource Group name because obviously
20:30 - 21:00 you should have your resource Group name right then this is our subscription so don't worry about that as well because this is your subscription and in your case it will be just one subscription because you are just one single organization having just one department just kidding so this Resource Group I can just give let's say aw project and it will check is it available or not because our name should be unique so we cannot re reuse our name so I can just put aw project and now
21:00 - 21:30 here comes the region part so this is really important because it is recommended to just pick the nearest region because it will just give you low latency and a is not much big and it's not much critical so you can pick any region so let's pick East us because I have already worked with these regions and it works fine so don't worry about that just click on next that is tags now what are these tags so we
21:30 - 22:00 can totally ignore it but you should have some knowledge why we have tags there if we want to categorize our resources or any resource groups maybe because of billing structure billing categories and all for any purpose if you want to categorize it we just add tags so this is for that purpose and for now we do not need to worry about that just skip it and click on review plus create so it will just validate it and finally
22:00 - 22:30 it will ask you hey bro please click on create button so you all need to click on create so it will just create your resource Group I think it has already created it's your is really fast okay now we can see our Resource Group is ready just click on it so you will see this window right so here you can see it is empty because we do not have any kind of resources now there are two ways to create resources
22:30 - 23:00 one is just click on this create button so what it will do it will automatically pick this Resource Group and it will just choose this as our default option and it will just put all the resources inside this Resource Group but I also want to show you the other method so let's cross this and let's click on this Home tab this one click on Home tab and similarly we will just search it here what should be our first resource just
23:00 - 23:30 tell me just tell me the First Resource the very first resource yes you guessed it right it is data link okay so now here comes the interview part as well why and how let me show you the possible questions that you can expect one question we have already covered that is data redundancy but I will show you how you can navigate it second question I will tell you once like I create this data l so I will simp simply write storage account because
23:30 - 24:00 that is a storage account right here comes a question for you if you are already familiar with a so that you should answer this question and if you are new don't worry at all bro because I will tell you each and everything so obviously it will show us all the existing storage accounts so these are my existing storage accounts don't worry at all just ignore so simply click on Create and once you create your storage account it will ask ask you for some of the
24:00 - 24:30 configurations so see now you can see the resource Group is already selected as our aw project so if you do not have your resource Group ready while creating resource you can click on create new and you can create it from here as well but it is a recommended advice anything you can say you should have your resource Group ready before creating resources okay now we want to give storage account name what name we should pick it totally
24:30 - 25:00 depends upon you so I would like to pick name like aw storage data Lake because our name should be unique we cannot have same names so you cannot pick this name if you want to pick you can say like aw storage data Lake and you can just put your name at like as suffix that's the only option if you want to keep your name as it is otherwise you can pick any name like any any
25:00 - 25:30 name okay now primary service obviously we want to use this use it as like your blob storage or data Lake if you do not pick it don't need to worry it will just create as your storage account so let me show you let me deselect it okay it is not giving us the option to deselect it okay just keep it like this now just keep this performance issue as standard
25:30 - 26:00 then we have redundancy that we have just discussed in our prequests so this is the data redundancy Now by default it will just put your data under GRS Geo redundant storage or you can say Geo replication storage because it replicates your data within different regions within different geography so I call it as Geo application storage you can say it as Geo redundant both are same things but we do not want to do that because we
26:00 - 26:30 will just keep it as lrs that is locally redundant storage which will just store your replica of the data in the same data center simple then here is the question okay the question is this is my storage account right so by default it creates a blob storage account what it creates blob storage account it does not create data link so if if we want to create a data L out of this storage
26:30 - 27:00 account what do we need to do you have 3 seconds you can just comment and let me see if you know this okay so now it's time to disclose it so if you want to create data L you just need to take a small box and it will do a magic it's not a magic but it's a like small option that you need to pick and then only you will have your data link otherwise it will just create a blob storage I know lots of data Enthusiast have the
27:00 - 27:30 question like what is the difference between both let me show you bro let me show you I'm here to tell you each and everything don't worry okay so this is the configuration that is hierarchial npace hierarchial nam space is the option that we need to pick this is the one what it does what it does okay let me let me first click it when I click on it it will create dat data L yes then it
27:30 - 28:00 will create data L otherwise it will create blob storage so what is the difference between two let me tell you so let's suppose this is our blob storage account right and this is our data link okay in the both accounts we have containers right let's say we have container number one container number two container number three and within these containers we save our files let's say customers. CSV file right let's say
28:00 - 28:30 customer. CSV file maybe any files but we cannot create hierarchy of the folders what does it mean let's say I am in my data link I can create this container that that is the like first level of hierarchy but within that container I can create folders as well see I can create one folder within that folder I can create second folder within that folder I can
28:30 - 29:00 create third folder then I will save my CSV file this feature is not available in Blob storage so if you are working with data where you need to analyze your data where you need to build tables on it you should pick data Lake you should pick data Lake yes and this was a difference and if you already knew that it's good because it it was a quick revision for you and if you don't this is a game changer because this is the like most important question
29:00 - 29:30 that you can expect and it's not just about the questions it's the fundamental thing that you should be aware of while working with the so now you have enabled hierarchal name space so now you have a data link don't worry about that okay now we just need to look at the access tiers so these are the access tiers we have hot cool cold hot means if we want to frequently deal with data we keep it as hot if we want to if we do not want to frequently use
29:30 - 30:00 data and we want Low Cost Storage low storage cost for our data then we keep it as cool if we want to just have the minimum cost for our data to store it we keep it cold but if we want to use it we need to pay more so it is for archiving of data so it totally depends upon the requirements we will be frequently use our data so we will keep it as hot okay let's check the configuration for one
30:00 - 30:30 more time and yeah all said just click on next this is networking so our networking has enabled Public Access for for all from all the networks okay we have option like for vets and like disable public ex and private access this is this will be very much related once you enter the organization as data engineer because some data will be restricted which cannot be accessed from private or oh sorry from public
30:30 - 31:00 endpoints so there you will be creating private endpoints so for now it's everything sorted just click on review plus create and just wait for the validation to be completed once it is done you are ready to click on create button just click on Create and then it will create your usource okay it will take some time because it will just create a data l so you can expect few seconds let's wait
31:00 - 31:30 let's wait let's wait let's have some protein shake H yes it's done so now we can see go to Resource tab that means our resource is ready and we can go and check that but we don't want to do that now because first we will create our aure
31:30 - 32:00 data Factory and then we will just establish the containers within our storage account so here comes our second resource how we can do that okay let me show you simply go let's just type Azure data Factory in the search bar Azure data Factory okay just try typing data Factory yes simple So currently I do not have any a
32:00 - 32:30 data Factory created in my portal account so I will simply click on create simple and again we need to pick Resource Group as I told you it is the like necessary thing and we will be keeping the same Resource Group because we need to just store all our resources within one Resource Group simply select aw project then we need to obviously name our ADF how we can just just name it simple it's same as we just named our
32:30 - 33:00 storage account simple so what we will do I will just say ADF aw project let's see if it is available or not yes it is available oh it's already taken see see this is the error that you will see if you will just try to enter the same name that I'm doing so just try to pick unique you can say ADF aw project unch I think it should be available yes it's available
33:00 - 33:30 an like people with name on are limited yeah because they are really really talented just kidding so just click on review plus create and it will just create your aure data Factory and that's it it was really simple right yeah it was quick and after this we will be creating containers what are containers basically we will just dedicate one container for for our one zone let's say one container for bronze second
33:30 - 34:00 container for silver third container for gold so I'm really excited and let's click on create button because our validation is complete H so it will just validate it and it will deploy it as soon as possible so we are good with our all the sources that are required to complete phase one of our project so are you excited are you excited because I was really excited when I was doing these projects
34:00 - 34:30 you should have excitement do not take it as a burden do not take it as a part of study it's like fun it's like the real skill so let's go to our Resource Group and as you can see wow A Y data Factory I love working with the Y data Factory because this is my first resource that I learned in the aure environment obviously data Lake was the first one because we just picked the data and sto there but I'm talking about like real resource that we use like for
34:30 - 35:00 etling eling so this was my first one and I love the UI this is my first love yeah after college so just click on it and let me show you what you can expect after clicking on this a your data Factory studio and then we will create our zones bronze silver gold so let's click on this tab launch Studio it will just open our aure data Factory Studio and it will you going to love the UI if
35:00 - 35:30 you are not familiar with uh ADF you're going to love the UI okay so this is the homepage this is the homepage that you should see okay do not worry about the GitHub repository right now so this these are the tabs that are available do not worry about that we'll cover everything so this is a kind of Home tab on which we are right now then we have this tab this is the most
35:30 - 36:00 important one because this is the one this is this this is the one this is the this is so author tab because we will be creating all our pipelines within this tab so this is one of the most important ones we just have four so all are important so this is Monitor tab where we can monitor our pipelines we can see the status of our pipelines if any pipeline is failed what is the reason everything everything under monitor tab it is like the monitor of our class so
36:00 - 36:30 it just monitors everything then we have manage tab where we can just manage things such as repositories GitHub connections devops connections and Link services wait wait wait wait what is link service you will learn everything don't worry about that I'm just giving you an overview so this is manage Tab and this is our Learning Center so this is the kind of Center where you can just read the resources documentation pick some data sets available simple and
36:30 - 37:00 sorted let me just show you the atho tab and okay so this is the area where we picked the ADF building blocks that that are like pipelines activities functions notebooks everything everything H have been like summarized everything has been summarized into this area okay now this is our ADF enough information has
37:00 - 37:30 been given so now it's time to prepare our zones and then we will just start pulling the data using ADF yes so let's jump on to the storage account so I have just clicked on my storage account like the Azure account if you would surprise hey what is this window I just clicked on my portal account now I will just click on the Home tab okay now let's go to our Resource Group and
37:30 - 38:00 click on our data link and and and and just focus on this area because this is our area of concern right now just click on containers so this is the kind of data L that we got so we get four storage Services one is like containers that is also known as data Lake popularly then file shares in which we can just provide one stop solution to store all the files throughout the organization or the project then cues
38:00 - 38:30 obviously if you are familiar with Json data that is no SQL data we use qes here and like qes is like basically can be Json data and can be anything because this is a kind of service which helps us to work with streaming data so let's say we have messages and we want to store it so we just store it in cues and those messages can be in the form of Json or can be in the form of anything so it depends and tables are like no SQL databases that means not only SQL so
38:30 - 39:00 that is very good if you want to perform some analysis on semi-structure data that means like if you have keys and value pairs so you can just use the service okay sorted just click on containers oh that's what I want so now we want to create three containers as I just mentioned one container for bronze second container for silver third container is for gold that's how we will create our three zones simple
39:00 - 39:30 sorted bro do not feel it like it is very difficult just try to grasp the information and if you find any difficulty just rewatch that part because it's really really easy just click on container and first of all we'll create the bronze container okay just type the name and just click on the create button perfect similarly you can
39:30 - 40:00 create silver container bronze silver gold I'm just keeping the naming Convention as bronze silver gold if you're familiar with row transformed solving layer you can keep it it's all depend upon the requirements and it's all about the naming convention you are familiar with so it's all up to you Bros bros Bros bros is not just about boys when I'm saying bro that includes girls as well so Bros means like boys and girls both now the third is gold so let's
40:00 - 40:30 create this as well perfect now our all the three zones are ready and now we can just perform data loading into this particular layer and as we just mentioned that we will be using API so now it's time to create a link service let's go to our ADF that is this tab click on
40:30 - 41:00 it are you excited to create your first pipeline let me show you how you can do that so to create a pipeline we need source and destination because we want to load our data from source to a destination simple and what are the prequest for that so it's very simple let me show you the activity first so that you can relate so first of all just click on Pipelines and obviously you won't see anything because we do not have any pipeline ready but it
41:00 - 41:30 will be ready soon just click on it just click on these three dots simple and you will see the option called as new pipeline click on it okay so this is our canvas this is our canvas okay so first of all we will name our pipeline so as you can see you can just name this pipeline from here so I will say Pro or let's say m get to Raw
41:30 - 42:00 this is the kind of pipeline I want this is the pipeline name that I want to pick and just click on this small button this will just hide this naming Tab and we have already named it so why we need to keep it here just aside it okay so now we have like much bigger canvas area so how we can create that activity first of all that activity is known as copy activity and how we can find it just click on this move and and transform button it will just show you all the possible
42:00 - 42:30 activities that we have under this category Okay click on it we have copy data drag it here simple simple don't worry don't worry bro we'll cover all these parts don't worry okay so now first of all first name it because it's really important so obviously we have just name our pipeline here as you can see get to row but we need to name this activity as well how we can do that just click on this and let's
42:30 - 43:00 say copy raw data this is my activity name that I want to pick right okay so as I just mentioned that we want a source and a destination to perform copy activity because just imagine this is the activity that will just load the data from the source and it will just push the data to the destination simple okay so this is our source and this is our sync sync is nothing but just a destination it is a fancy name
43:00 - 43:30 for Destination so when I was learning it I was also like hey what is sync okay sync is a kind of destination simple so in our scenario what is sync in our scenario it is GitHub account and within that GitHub account we have a folder called Data let me show you so this is my GitHub account where we have the data repository click on it and this is the folder oops this is the folder and within that folder we have all the CSV files and we want to fetch files
43:30 - 44:00 directly from this GitHub account let me show you first like how it looks like let's say I just pick this products. CSV file and if I want to look at the data I just need to click on Raw just to have the URL because we need to put the URL just click on raw and here is the data and we will use this URL to pull this data this one simple yes it is simple but we need to
44:00 - 44:30 create some connections okay let me tell you so this is the copy activity simple and we want to create a linked service what is a linked service bro linked service is basically a connection why it is necessary see this is azure this is azure it needs to read the data from GitHub it needs to push this data to data Lake simple so it
44:30 - 45:00 needs to build connection from this source as well then it needs to build connection with destination as well so these connections are known as link service this is just a fancy name for a connection simple now after link service we need to create data set now what is a data set okay okay let's say you have created a connection with this link service
45:00 - 45:30 right let's clean like all the Clutter first let's say you have created your link service with the GitHub account okay fine within this GitHub account if I go here I can see so many files within this data folder now how I would know which file to pick which data to pick here comes the role of data set it gives
45:30 - 46:00 us the detailed location for the data which data set we need to pick from this whole connection so this is the architecture behind link service and a data set see it was so easy let's do it then you will learn it and for link service obviously we need a link we need a URL so that it can locate this GitHub account how we can do that simply go to product store CSV file this one any file we just need the uh URL because first we
46:00 - 46:30 will be loading data that is Adventure work products. CSV because first we will create a static pipeline then we will create a dynamic pipeline but for that you should have good hands on while working with static pipeline we will grow step by step bro don't rush just click on this products. CSV file click on Raw then you just need this URL this URL yes now this URL has actually two
46:30 - 47:00 parts one is base URL second is relative URL let me repeat first is base URL which is till here till. this one I will simply copy it and rest of the URL is relative URL simple okay let me go here and let me create two link service one is with my GitHub account second is with my storage account there are in two ways I can simply click on it and I can just click on source and then I can just pick link
47:00 - 47:30 service but let me show you the recommended way we should always have Link services ready before performing anything because it go it it gives us a real you can say strategy your right road map before performing any kind of activity because connections are must if we want to move our data from here and there so how we can create in a recommended way so let me show you just click on this manage tab because it will manage our all the resources right okay
47:30 - 48:00 click on this and here you can see the very first connection is Link services simply click on it and you will see this plus button simple click on plus new and then here we have all the connections list ADF provides us vast list of all the connections like we can build connections with ad AWS Apache Pala red shift like there are so many options let me just show you that's why I love it
48:00 - 48:30 here like you just name it you will have all the connection data was Google ads HTTP connection everything ready so you have so many connectors and we will use these connectors to actually build a connection right so in our scenario we want to build a connection with GitHub account which is a HTTP connection as you just saw that we have a HTTP link there are two ways we can either use rest API and HTTP but it is recommended
48:30 - 49:00 by Microsoft aure if you have a data in your connection then you then you should just go with the HTTP connector because it is it is dedicated created for this purpose only so let me just pick HTTP how we can do it just simply search HTTP and pick this one simple see we do not need to write any code this is all low code or you can say no code click on continue oops I just clicked on Escape sorry so click on HTTP and click on
49:00 - 49:30 continue and then we need to name our link service how we can name our link service so as we all know that this is our HTTP connection so I will say HTTP link service you can just name anything I just named it like this okay now we need to give base URL as I just mentioned our base URL is this one till Doom just remove the Slash and then the authentic ation type we need to select Anonymous and then it's time to test
49:30 - 50:00 this connection how we can do that so here we have a tab here test connection just click on it it will just test our connection for us let's wait and let's see if it is successful yes it is successful so now our connection is established now we can use this link service to pick our data set but as I just mentioned that we will first create the second link Service as well for our data lab because whenever you want to
50:00 - 50:30 build any efficient data solution you should have all the Link services ready yeah so let's click on create see we can just see our link service here click on plus new so that you can create your second link service which is your data link so what should I pick yes data link just click on data link and just search ADLs gen two okay this is the one click on it and
50:30 - 51:00 click on continue then we will say uh storage maybe storage data link simple link service name yeah then you just need to pick storage account name with which storage account you want to make connection because in my scenario I have so many storage accounts in my portal account so how is your data Factory will get to know which account
51:00 - 51:30 it needs to pick so here I will just give hey just pick this storage account see this is the value this is the kind of importance of you can say having Link services in place so again test connection which should be there yes it is successful click on create our both the Link services are created now it's time to create the data set out of it how we can do that simply go on your author tab here and click on this copy
51:30 - 52:00 activity now we are ready to create data set okay I will simply go on source so first we will create the data set for our source okay for our source we need to create data set we do not have any data set ready right so we will simply click on plus new click on it and then again we need to pick from where we should pick our data dat set we should pick our data set from HTTP
52:00 - 52:30 okay and what is the file format for that data it is CSV simply click on CSV and click continue and now it is asking us to provide a data set name so I will use DS that is an abbreviation for data set and I will just give name such as DS HTTP so that it will give me some insights that this data set is for this connection then then as you can see we have to pick the link service that's why
52:30 - 53:00 I told you always create your link Service First okay just click on this drop down and just pick this HTTP link service it will just show you the available linked Services which are based on HTTP only that's why you are not seeing the second link service that we created click on it and obviously first row is header now it is asking us asking us for relative URL so we will just go to our source and we will give this relative URL just copy this one and just paste it
53:00 - 53:30 here simple now we will just click on okay simple Yes simple just click on okay so now our source data set is ready if you want to preview this data we have a tab here just click on this preview data I will just click on it and I should see the data perfect I can see my products table it looks good that means we are good to go with this now we need to do the same thing with sync as well
53:30 - 54:00 so it is a kind of homework that you need to do just do it on your own without coping me right just do it on your own on your laptop and if you find any mistake or if you find any error just watch me so again we need to create a data set because we do not have any data set right do not pick this one do not pick this one because this this data set is our source data set we want to pick the same data set we will click on plus new and where from where we need to
54:00 - 54:30 pick our data set obviously it is data link Gen 2 then click on continue and in which format we need to put our data we just need to pick data like in CSC format because we do not need to apply any Transformations and it is the recommended way that we should just create the exact replica of data okay just we will pick delimed text that is CSV then we need to just give the name so I will say DS raw so this is my data
54:30 - 55:00 set which is pointing raw container or bronze container hey let's say bronze same thing don't worry raw bronze both are same thing then we need to pick the link Service as you all know that we have already created one link service so we will pick this link service okay let's click on this drop down click on the storage account and then we need to give the file path okay how we can give the file path just click on this browse button the small browse button and then we will pick our bronze container within
55:00 - 55:30 that container we will click okay because we do not have any folder but we will create it how just click on directory D directory is a kind of folder within container so I will say products because that is the power of hierarchial name space because we can create containers so now I will create a dedicated container for every file that's my choice so I will say products then I I can just name it as products.
55:30 - 56:00 CSV products. CSV simple and obviously we need to keep first row as header because this is a CSV file simple let me just click on okay Kea import failed it shows this thing just click on none don't worry about that just click on okay now why it showed this error because there is no file there right now because we are copying this file don't worry about that now here comes the mapping part if you want to import as you can let's say if you have like if you already have one
56:00 - 56:30 file there and if you want to put another file and if you want to match the schema then you can just click on import schemas and you can just manually import it if you have like different different names from source and in the end so that totally depends on the requirements in our scenario we do not have any file there so everything is fine so now it's time to actually run this are you excited to run your pipeline okay let's run it just click on this debug button it will just load your
56:30 - 57:00 data from GitHub to your your bronze layer let's run it together 3 2 1 go H it will take some time like few seconds and it will just load this data and once we have the data then it's time to celebrate because we have built our first pipeline which pulled the data from API to our bronze layer okay let's
57:00 - 57:30 see uhuh I saw green flag I love it I love it so now we can say that we have successfully loaded the data from GitHub to our aure using a pipeline do you want to see that data yes I I want to see I want to see so I'll just just look at the data how we can look at it just go on your Azure portal click on your Bron
57:30 - 58:00 container hey we have a folder with products then we have a products. CSV file okay and this is our data lay we pull the data from this GitHub API to our bronze layer just click on it and you can see some configuration just click on edit so that you can you do not need to edit your data but you can but you should not it is just a kind of preview that I use to view the data so
58:00 - 58:30 this is data yes data looks so so so good whenever you see green flags in your pipelines it's like heartwarming thing so we are good with our pipeline but this is the beginning of your Azure Journey why because this is a kind of static pipeline we have built okay let's celebrate let's first publish this as you can see the publish all button it will just save your all the progress because this is also important
58:30 - 59:00 right okay so this will just save our work to our a data Factory and whenever you will come again like tomorrow maybe a day after tomorrow so you will see all the progress don't worry if you do not click on this then you need to just rebuild everything okay our publishing is completed now here comes the serious part so okay we have buil static Pipeline and as you know that we have successfully one file and we total have like I think 8 to 10 files one way of
59:00 - 59:30 doing is that is explained by everyone else on YouTube but we want to cover something extra okay so instead of relying on static pipeline we will create Dynamic pipeline okay let's first discuss the static pipeline one way of doing is just create the same copy activity again and again and again that is not the recommended way at all no you cannot build this kind of solution in
59:30 - 60:00 the real world scenario you cannot build this kind of solution in your interviews I'm serious you cannot perform this activity again and again and again because why we have something called iterations and conditionals why because we need to use it and that's why we have to use Dynamic pipelines and this this part is kind of tricky not tricky you just need to be more focused to watch
60:00 - 60:30 this part and you may require to watch it like two to three times to not hesitate to watch it again and again because when I was learning these things I also watched it I think at least eight to 10 times and I'm not kidding so if you are watching it like for two to three times the upcoming part not this one this was easy so do not hesitate and do not feel demotivated because it is a part of learning and in the next few few minutes we'll be learning how we can build Dynamic pipelines and just be a
60:30 - 61:00 little serious because we are having fun throughout the videos so you need to be just more focused in few few you can say phases of the videos so now it's time to just build Dynamic pipeline so let's first discuss the architecture like how we need to build a dynamic Pipeline and trust me bro once you master this technique once you master this skill you going to feel much more confident in your ADF skills trust me because your 90% of the solutions will require you to
61:00 - 61:30 work with these scenarios where you need to iterate your pipelines where you need to use parameters and dynamically move your pipeline so it's time to really really really focus towards this video and let's learn that now it's time to cover Dynamic pipelines so it's time to just be more focused and try to learn the scenario where we'll be learning dynamically passing parameters into our pipelines that makes Dynamic pipelines
61:30 - 62:00 so what exactly we are trying to do let me explain you so as you are aware that we have these files within our GitHub repository or you can say in our source okay that is sorted now we want to pull this data one way is doing like repeating the activity that is copy activity again and again and we do not want to do that because that is a static approach and that is
62:00 - 62:30 not a real time scenario that you should do okay so what is the alternative of that step so we will be creating a loop so if you are coming from a programming background so you will be aware of for Loop and if you are not from a programming background so basically for Loop or any kind of loop is performed to perform any kind of iteration so in this scenario as you can see we have iterations to perform like one 1 2 3 4 5 6 and so many iterations so what we will
62:30 - 63:00 be doing we will create a copy activity okay so we just created copy activity what's different in this copy activity let me tell you let's say we have this copy activity and instead of passing information statically what information okay information number one that is relative URL because you will be having
63:00 - 63:30 a different relative URL for a different file so one thing that will be changing for our copy activity is relative URL number one right what is number two in our scenario number two is folder in which we will store our data let's say I pick this file products. CSV right and I push this data to a folder called products so that is a dynamic parameter
63:30 - 64:00 that will be changing with every file so that is our second thing that will be changing right what is the third thing okay it is very simple the file itself so let's say we have a different folder for products perfect so we need to say products. CSV file within this folder so that means just these three values will be changing for this copy activity sorted so what we will be doing instead
64:00 - 64:30 of hardcoding these three values we will create three parameters we will create three parameters and what we will do we will keep on changing the values of these three parameters so every time every iteration will have a different set of parameters right so now you will be wondering how we can do it okay let me tell you so in aure data Factory we
64:30 - 65:00 have an activity called for Loop like it's not like named as for Loop but it's like for each activity which is equivalent to for Loop so that's why we call it as like iteration activity or for each activity whenever we want to dynamically pass any kind of activity then we make use of that particular one activity so that is for for each activity right so what we will be doing we will just create a for each
65:00 - 65:30 activity like this and we will put this copy activity inside this one here so it will keep on moving till our iterations are completed so this is the scenario this is the scenario so this is the architecture that we will be performing the it will be like building an AJ data Factory so you do not need to
65:30 - 66:00 worry if you didn't understand 100% of it because you will get it once we will actually building it in your data Factory so this was a kind of overview that I wanted to give you before actually performing it because now when I'll be performing those steps you will feel that thing that what exactly we are doing so this is all about parameterizing and let's get started and let's learn 100% of the so now we are into our Y data Factory portal so this
66:00 - 66:30 time we want to build a dynamic pipeline simply go on these three dots click on new pipeline simple so first of all let's rename it and let's say Dynamic get to Raw simple okay so now we want to create a copy activity now you'll be saying we just created one but this time we do not
66:30 - 67:00 want to hard Cod it we want to pass parameters inside that inside that activity so let's create a parameterized copy activity how we can do that just simply search copy graag the activity here rename it let's say Dynamic copy oop spelling mistake perfect so now we want to pick the data set so as we
67:00 - 67:30 just mentioned that we will be creating parameterized copy activity that means we need to pass parameters into the data set why because we need to have one data set for all the CSP files like one data set to pick all the files let me show you uh oh yeah so one data set to pick all the files instead of creating multiple data sets that's called the
67:30 - 68:00 parameterization so simply create one data set for all the files and now we will be using parameters let's see how we can do that simply click on plus new and in our case the data is in HTTP Source okay perfect and format is CSV simple now we need to name it so we can say DS get Dynamic simple Now link service will be
68:00 - 68:30 same why let me tell you the reason if you click on any file let's say let's open this returns. CSV and click on Raw so in every file our base URL will remain the same till do this will remain the same because this is the connection to my GitHub account but rest of the things will be changing like returns this thing Adventure Works return. CSV products. CSV so relative URL will be
68:30 - 69:00 changing so that's why we will not be creating a linked service in a parameterized manner we will just create be creating data set in a parameterized way so let's create let's pick the existing link service now it is asking us to provide relative URL we know we know this URL see I can just simply go and paste this URL but I do not want to do that why now you get it now you get it we want to pass the parameter so that we can just
69:00 - 69:30 pass different different parameter different different URL for that parameter so simply click on Advance tab to create the parameter inside our data set okay simply click on Advanced click on open this data set so this will show us a window where we have a detailed configuration for data set so here instead of hardcoding the value I will
69:30 - 70:00 be using parameter how just click on this box you will see this tab an option add Dynamic content so when we click on this thing it opens up a different window where we can use system variables functions parameters so many other things so simply click on this add Dynamic content and now as you can see we have the option to create parameters now you're getting my thing
70:00 - 70:30 Okay click on it and click on this plus sign and now we will create a new parameter and let's say I want to name it as p and relative URL simple and type a string and we do not need to pass in a default value so let's keep it simple and just save it and click okay and in before clicking okay now we need to obviously give something to to our box so we have just created this parameter we will pick this
70:30 - 71:00 as you can see it has automatically populated this variable so this is a kind of syntax that we use if we want to pass the parameter add theate and then the data set do parameter name click okay now now you can see this box has parameter in it instead of hardcoded value so this is a parameterized data set okay okay so now this data set is completed now let's go back to our data pipeline in which we
71:00 - 71:30 have copy activity and our source is done is it done no see it is asking us hey you have provided me a parameter I accepted it but what is the value of that parameter we will say bro wait because we will be using a loop to pass the value to this parameter simple now you are getting the things clear in your
71:30 - 72:00 head so let's click on sing and do the same thing try to complete this on your own let's do it if you feel stuck then only watch this thing or you can say this part of the video because in next few minutes we will be just doing the same thing so just try to build on your own and then just refer this part uh and simply select data L Gen 2 why because
72:00 - 72:30 we will be using data L to push the data click on continue okay and data is stored in the CSV format yes click on it then we need to just name it so I will say DS and let's say sync simple and let me add Dynamic to make it unique and Link service will be same because our connection data L Remains the Same we want to push the data to the same data l so just pick the same data L
72:30 - 73:00 now it is asking us to provide the path so as we just discussed that we will be using two parameters one for folder second for file so instead of choosing the files from this browse button we will be using parameters so rules are same just click on this Advanced button click on this open this data set and now first of all just write here bronze why because our container is same then
73:00 - 73:30 directory we need to create a parameter for it let's create one click on ADD Dynamic content and click on plus button and let's say p uh sync folder perfect default value nothing save and just pick this parameter and fill it here as you can see now we have a parameter there same thing with file name as well so so create a parameter click on ADD Dynamic content click on plus sign as you can see we already have
73:30 - 74:00 one parameter because we just created it so let's create another one we will say p and file name p file name oops and default value nothing click on Save and this time pick P file name simple so now our sync data set has two parameters one is this one one is this one sorted yes sorted sir sorted now I want
74:00 - 74:30 to just go back to my copy activity and I want to see like how it looks okay so now you can see our source is done our sync is done but it is parameterized this time it is parameterized in our source it is asking us to provide value for this parameter in the SN it is asking us to provide value for these two par so in total we need to provide value for values for three parameters sorted now you will be
74:30 - 75:00 saying like hey where is our for each activity for Loop activity let me show you and that is the game changer because we will be feeding these values using for each activity simple simple so let me drag that for each activity how you can do it so simply come on this area like activities first of all remove this cop so this is the area that you should see in your canvas and click on this
75:00 - 75:30 iterations and conditionals so this is basically area where you can just see all the activities related to conditions and iterations and we want to use for each simple so click on it drag it here let's name let's name it as like for each get simple now click on the settings button okay now here you can see a small checkbox you can check it I always do because I like to keep my things in sequence uh it looks sorted
75:30 - 76:00 then here comes the items part now you'll be seeing hey what we need to pass in items so bro you need to give something to run a loop right you need to pass a list maybe you need to pass an array anything but you need to do something so that this activity can use entities to just iterate through each each entity or element so for that we need to create an
76:00 - 76:30 array in which we will be passing these parameters in the form of maybe you can say dictionaries so one element for like one dictionary and then second element for second dictionary and how we can do it we can simply create a Json file for that yes you heard it right do not worry at all because I will provide that Json file in the description and you can just download it from there and and you can use it but still I will show you how you can create one because it is really important in the real world scenarios you will be creating your own Json file so let's see so I will simply navigate
76:30 - 77:00 to vs code and okay I already have one dumy Jason okay let's create one and let me say get dojason so now I need to pass an array so how we can do that first of all create a list empty list hit enter and
77:00 - 77:30 Within These list within this list we need to pass our parameters we need to pass the values for our parameters what do I mean so let's say our first set of iteration should have three values right for one file it's relative URL folder file done one file is done then second iteration second iteration relative URL folder file done so let me show you how
77:30 - 78:00 you can create simply create a dictionary and just enter key value pairs so first of all I will say p relative URL okay this is my first element of first entity P relative URL and I just need to copy the relative URL from here and let's go back and this is the relative
78:00 - 78:30 URL simple I will just copy it and I will paste it here oops oops I think I didn't copy it correctly okay let me do that again y it's done so this is my P relative URL for my first entry then I will create second one that is p what
78:30 - 79:00 was that syn folder yeah and this time we need to create folder and I can say if we are saving returns data then I can just say returns simple and then I need to create my third parameter which is p sync file and this time I can say returns. CSV so this is my one set of value this is my
79:00 - 79:30 one iteration second time I will repeat these steps I will simply copy it I will just repeat these steps now I know you are getting the things clear so this is done for one thing so let's say this is our one entity having all the three parameters one data is migrated second data this is this is our second data I will just change the uh relative URL just now I'm just
79:30 - 80:00 explaining you the concept second data then third then fourth then fifth so it will be feeding these three parameters to that activity and it will be dynamically pushing that data to our aure this is the architecture I know now it seems easy right so but we need to complete it let's complete it first and run our pipeline without any error then we will celebrate it together right
80:00 - 80:30 so let me just quickly complete this Json file and then I will just show you how you can just import this file into Azure and then we will just use this to feed our for each activity let's see so I have completed all the entities as you can see and now it's time to pass this jsn file to a y so let me first save it and upload it to a y because we need to pull this data directly from the data lake so I will simply go to
80:30 - 81:00 Google and obviously to my Azure account so this is my Azure portal and here is my storage account so simply go to Containers Tab and we will be creating uh one new container for parameters file so let's create one and let's say parameters simple and create one so so in this particular parameters container we will be uploading our file simple so
81:00 - 81:30 this is my g. Json file so I have uploaded it you can just download it and you can also upload it and if you want to create your own you can try so click on upload and now it will be uploaded so now let's jump to the aure data Factory so how we can pull this data in aor so now you will be learning a new activity called lookup activ what is this and why we use it if we want to use the output of a file if we
81:30 - 82:00 want to just have a look we should use lookup activity it will just give us the output of the data and let me drag the lookup activity just search lookup drag it here and I will say lookup get simple and in settings obviously we should have a data set if we want to pull any data we should have a data set for it let's create one and click on plus new our data is in a Y data link continue and this time our data is
82:00 - 82:30 Json not CSV it's Json click on continue uh name will be you can say DSG parameters simple link service will be same because our storage account is same so file path now we can just drag it using this browse button and we know our file is in the parameters container and this is the file simple click okay and
82:30 - 83:00 this will just pull the data simple now one thing one thing one thing is very important for this particular activity that can be asked in your interview questions as well so as we know that we have at least I think 8 to 10 entities right but by default this activity keeps the in as first row only so it will just return one entity just
83:00 - 83:30 one entity by default so if you will pass this in your Loop it will just run for one time only and we do not want that we want to run through all the values so we need to uncheck this box just keep this thing in your mind this is really important so just uncheck it now you are good so if you want to see the output like how it looks I want to show you so before running anything
83:30 - 84:00 just run this particular activity and how to ignore other ones click on the other one other activities whatever you have click on General and click on activity State as deactivated so if you have activity in your canvas and you do not want to run it simply click on deactivate so it will just deactivate it see the gray sign so now it will not be running simple do the the same thing with this as well deactivate so now it will just run
84:00 - 84:30 this particular activity and I want to show you the output of this because that is required to be passed in this particular activity for each activity because we are referring this activity so let me debug it and let me show you I know things are getting clear more and more with the time so just be with this part for few more minutes and everything will be clear crystal clear simple because this is the major part
84:30 - 85:00 and I also faced much difficulties like when I was learning so do not worry at all you can just rewatch this part again and again but once you master this thing trust me your ADF skills G A Skyrocket so it has succeeded it has succeeded let me show you the output of it how you can uh just have a look on the output just click on this button the exit button the little one click on it now now you can see this is the output see now under the value tab the
85:00 - 85:30 value key we have all the required information we want this is the thing that we want to use because what we'll be doing as you can clearly see see let me just show you this is a list and we want to pass a list in our for each activity so what we will be doing just comment down if you know yes you guys did right so we will be using this value
85:30 - 86:00 button this value key this value master key and this will be used as an array that will'll be passing in this for each activity so it will just iterate through these entities that it has like P relative URL then you can say sync folder sync file that is one entity then other one then next one then next one so it will be used as an array this output so how we can do it it is very
86:00 - 86:30 simple let's close this and just activate this okay so if we want to refer any activity like this one with any other activity so we need to select the nodes these are the nodes like skip succeed failed and just s like completed so we will be using succeeded if this activity succeeded connect to this and
86:30 - 87:00 now it's time to put the items items items so under the items we will write what we will write we will write Dynamic content why because this is an this is a kind of system variable that we are using a system function we are not hardcoding anything so just click on ADD Dynamic content and just pick activity outputs because we want to use the output of a of an activity as our items
87:00 - 87:30 so in our scenario we have three different activities but what we want yes you guessed it right lookup Activity Click on this oops Dynamic content click on this now you just need to add one small thing and that was the reason I showed you the output if I will rout if if I will write here output it will not do the work why because outsell uh output didn't had didn't have all the
87:30 - 88:00 information that we needed because we want to just pass the array and array was stored inside the value key just value key it had so many keys but we just want value key so we will write output. value this was the catch and this was another interview question because these are small small things and you cannot like find these things like in the you can say PDFs that are
88:00 - 88:30 available for interview questions and all because these are like scenario based questions you learn when you actually do it so we will be using dot value click on okay now it is getting the array from this output it is done it is done now rest of the steps are really really easy because now we just need to pick this copy activity and we just need to embed this copy activity within for
88:30 - 89:00 each and then we will be passing this Loop information within those parameters simple so let me show you how you can do it first of all click on control X so it will just cut the activity it is simple contrl x contrl c contrl v don't make it complicated click on for each activity and as you can see here activities tab so it is ask us to provide some activities that we need to perform under each iteration click on it click on activities and click on this
89:00 - 89:30 pencil sign this one and just it will just open a new canvas and now we are inside the for each activity yes so now we will simply paste our copy activity and simple now we have source yes we have Source we need to give the value of relative URL yes now we have the value because now values are coming
89:30 - 90:00 from this for Loop this for Loop so how we can do it simply click on it again Dynamic content because we are using parameters click on it and then you just need to use the item so as we just discussed let me again show you this is one item this complete dictionary is one item okay and every time it will be be just having different different items 1 2 3 4 5 so it was just a quick recap so we'll be using
90:00 - 90:30 item then item has three parameters but for Source what is the name of our key it is dot P relative URL simple simple simple simple now you got it click on okay this is done click on sync same thing just just do it on your own just do it on your own on click on sync and for sync folder same thing activity item item has three
90:30 - 91:00 parameters which parameter we need to pick which key we need to pick yes it's p sync folder now you getting things right now the third one file name same thing we want to pick the whole item within those item within that item we have three parameters which parameter we need to pick it is dot B file name
91:00 - 91:30 simple simple simple simple this was the end of the solution that we were expected to build just rewatch this part at least three times even if you have grasped all the knowledge just watch it to have a quick revision so that you can absorb all the knowledge now we are all set to run our Pipeline and we want to get it succeeded and without any errors and
91:30 - 92:00 once it is done we will celebrate together don't worry just click on debug button who click on it and just fingers crossed crossed crossed crossed and now let's see what it gives and TR trust me once you are familiar with this concept then you can handle any you can say real time
92:00 - 92:30 scenario any kind of question because this involves logic building this involves deep understanding Oo we see so many crosses okay this was expected no I'm not kidding okay what was the reason behind the failure so yes yes yes because we do not have the P file name in the uh Json file
92:30 - 93:00 so why did it intentionally so so sorry for that but I wanted to explain it so the thing is as you can see in the Json file this one we have name pync file but our parameter has a different name so we can keep names different it doesn't matter if we have a parameter with a different name and we have a key with different name we can use it we can give any key
93:00 - 93:30 to it so let let's correct it and then just see all the greens instead of red so I hope you understood the scenario so just open your copy activity this is your copy activity and instead of P filim you need to give e sync file something like that because that is our that is the thing that we are receiving from the for each items so this was the confusion that many candidates face so I
93:30 - 94:00 just did it intentionally so so sorry for that but it is for your betterment now you know these kinds of Errors can occur and it is not compulsory that we need to keep the parameter name and key name exact same no not at all so just write pync and then file Sim okay let's go here and now let's run
94:00 - 94:30 this now it should be running fine and if now error Comes This is not intentional then we need to just de it let's see let's see let's see errors are the part of learning if you are not seeing any errors while you are learning that means you are not learning you are just copy basting it when you try to build your own own Logics when you just try to use your own brain you face errors and once you know how to just
94:30 - 95:00 overcome the errors you become better and better I love seeing errors trust me so don't worry at all if we see errors right now as well we will debug it we will debug it bro don't worry at all let's see let's see let's see H okay I can see two greens let me see more greens more and more greens I was really really
95:00 - 95:30 scared because we obviously need to debug it if we see and hear it and because last time was like a national but this is not intentional let's see but trust me once you are familiar with this why I'm repeating this again and again because this is really crucial when I was learning it so I thought to skip it let's let's do it later but I realize it later because these things are really important and these are the
95:30 - 96:00 things that make you a pro and that is the reason that you are doing realtime scenarios that you are doing realtime projects end to end projects you did your tutorials you completed your tutorials now it's time to just gain the pro level skills so in my channel you can expect these kind of scenarios these kind of projects tutorials because I love messing with my projects so you will learn a lot so I think that
96:00 - 96:30 is a signal to just hit that subscribe button it is for your betterment because this channel is full of Knowledge full of support to as your data engineering data breaks end to end projects tutorials and much more and we'll be having fun like this because I do not like boring classes because I was a kind of student who was studious but a back bencher at the same time so you can just imagine and I like having that environment when I I study so don't mind
96:30 - 97:00 me and I can see many greens just one more activity is pending let's see the status and just tell me that tell me in the comment section what color are you seeing just be honest is it red or green just be honest just to be honest because I saw many red ER oh one more succeeded okay and after this our first phase of
97:00 - 97:30 the project is completed then we will be jumping to the next phase that is data breaks and I hope you learned a lot bro everything is succeeded see all the greens all the greens I love it I love it so now it's time to just validate it how we can validate it we will just go to bronze layer and we should see all the folders let me go to my bronze layer this is my home tab this
97:30 - 98:00 is my storage account let me go to containers and then within that container we have a bronze container I have all the folders bro I I I have all the folders as you can see admin in those folders We should have file like CSV file as you can say calendar. CS V customer. CSV like all the CSV files so now it's time to celebrate not that kind of Celebration just a healthy congratulation to
98:00 - 98:30 everyone who have succeeded the stage and you have learned just trust me you have learned a lot in ADF so far if you have built this Dynamic pipeline so just tap your back good job good job so now our first phase of project is completed is done now we are entering into into the phase two of the project why I'm doing this so now we are entering into the phase two of the projects so are you
98:30 - 99:00 excited to use data brakes so let me show you how you can use data Brakes in your project let's see so now it's time to create another resource that is aure data breaks show some excitement for aure data breaks your AIC is like really in demand so that's why I'm saying bro so without wasting time let's get started with the ja data braks so click on home button and obviously now you know the step
99:00 - 99:30 click on plus resource and now this time just type data bricks okay search it and you should pick this data brakes that is from Microsoft oops let me just retype it this skape button is like hell Okay click on create this one and again same steps pick your
99:30 - 100:00 resource Group Resource Group is like aw project workspace name I would say ADB aw project H nice name then you need to keep pricing tier as maybe standard or premium or trial you can choose anyone if you are using a free account you can just keep it trial because the thing is you do not need to keep your data after 14 days right it is
100:00 - 100:30 just for practice you can just keep trial so don't need to worry just keep anything so I'm keeping like premium and manage Resource Group what is this manage Resource Group so the thing is in data breaks we run our all the Transformations on the Clusters but we are not the ones who look after the Clusters but datab bricks is like looking after the Clusters so it will create a manage Resource Group where it
100:30 - 101:00 will just keep all the VMS what are VMS virtual machines vnet what is vnet virtual Network so all the cluster configuration will be stored in manage Resource Group so just keep just manage Resource Group is like just a resource Group so I recommend keeping it as like the same name as we have for the workspace it is like easily like you can easily find it so do not take much stress so just simply write
101:00 - 101:30 managed and just write ADB and aw project simple then click on next networking and just click on next encryption click on next security applies next next next and review click on review Plus create so it will just validate it and then we will create it so click on
101:30 - 102:00 create I was laughing because I was remembering my childhood days where when when I just bought the computer and it was the very first computer in my family so my friends told me if you want to install any game just click on next next next and at the end click on install so it was a kind of nostal nostalgia that hit because we do not like we haven't like read anything which was written on those
102:00 - 102:30 games instructions and all we just used to hit next next next next next and at the end install then we used to play computer games without internet okay so now it is being deployed and once it is deployed then it's time to actually go deeper inside data braks let's wait it will take just few more seconds and then it is done and
102:30 - 103:00 then we will be just creating our first cluster notebook workspace everything everything from scratch everything from scratch uh uh uh and after covering this part you can easily mention you can confidently mention data breaks in your resume because after that you will be much familiar with data breaks and I will
103:00 - 103:30 just show you some of the Transformations as well so that you can confidently sit in the interviews and you can confidently answer hey I I know like how to perform these Transformations it's just the game of confidence that's it no one knows everything no one knows everything even the interviewer and you are also a human just be confident in the interviews that's it if you do not know anything just say I don't know this thing but I can still solve this thing because I have like that skill to debug anything
103:30 - 104:00 because in the real world scenario you will not be just remembering everything and you will be using your brain like in most of the scenarios no you will be using online resources you will be using documentation so you can just argue with the interviewer hey I don't know but still if you just allow me I can just go on Google and I can just give you the solution like this it's all about presenting yourself it's all about confidence okay stop your philosophy and just start
104:00 - 104:30 working on data PRS yep I know you are saying this click on go to Resource and then let's see what is this thing then again same thing we need to just launch this workspace click on launch workspace H okay okay so now it will just take us to the datab bricks workspace wo so this is the UI by the way they have just updated it so it used to be black
104:30 - 105:00 here like darkish black or darkish gray but now it's white so they have just updated it I used to like the previous one okay so this is the UI this is the data braks this is the most in demand technology right now so don't worry we will be just we will be just having all the information that is required don't worry at all so first of all the very basic thing that is compute because we will be applying
105:00 - 105:30 Transformations on the data we will be processing our data using compute compute is nothing just a cluster so without wasting any time let's create the cluster first and obviously other things are like very common like workspace workspaces like other like we have just created a folder in your data Factory so it is the same thing we have workspaces so it's just naming convention meaning is same then we have catalog where we just Define the data cataloges like external location and all
105:30 - 106:00 don't worry about that for now just ignore then we have workflows if you want to just apply you can say orchestration so we have workflows and then SQL editor queries dashboard alerts job run Delta life tables forget about that for now just create compute and we will be just start working with it don't worry click on compute and then you will see this interface obviously you won't be having any compute ready so just
106:00 - 106:30 click on plus new so click on create compute now this is the configuration that we need to complete to create a new cluster for our workloads so first of all this is policy and by the way you can just change the cluster name as well so maybe I can just keep it as aw project this is my cluster aw project cluster and keep it as unrestricted yep and obviously single node because we do not want to spend
106:30 - 107:00 much and we can easily process this much of data using single node so just keep single node okay and xess mode is like single user you can keep it as no isolation shared if you want and then you need to just pick the run time this is the run time okay so we'll be Pi uh sorry so we will be picking any you can pick any run time
107:00 - 107:30 with long-term support LTS means long-term support So currently we have 15.4 LTS maybe free account user will not be having this one so for that reason I'm picking 14.3 this one if you have 15.4 you can just try with it don't worry it's same and Noe type again we need to pick the node type for the cheapest one I'll be using general purpose DS3 V2 in which we get four cores and it's enough then we have
107:30 - 108:00 terminate after so it is really important if you want to save cost obviously if you are using free account you do not need to worry just use it but if you have a paid account like me so it will just automatically terminate the cluster if if it is not in use like for like 120 minutes but 120 Minutes is is like Big Time range I will just keep it as like 20 minutes okay then tags I don't want to
108:00 - 108:30 add any tags then I also do not want to use phone acceleration so I will just use basic cluster just click on create compute here and that's it click on create compute so now as you can see it is creating our compute and you can also check it using spark UI okay it's creating now just wait and now once our cluster is ready
108:30 - 109:00 we can actually start working with data PR because cluster is the most important thing that we should have already in our your datab BR account so the thing is the datab BR Community Edition provides cluster and obviously you can use that as well but obviously not with aor that is just for datab don't worry we will just creating videos for data brakes only in future don't worry about that but this is just for the project and it is totally a different experience when
109:00 - 109:30 you actually work with aure storage or any cloud storage with data breaks because in uh data breaks you will not be getting that Vibe of connecting your data with Cloud pulling that data pushing that data so that's why it is really important to have this exposure as well and when you are just getting the opportunity to work with the these areas for free you should take the advantage right and here I am your friend your bro
109:30 - 110:00 to just help you to guide you how you can use it because obviously when I was learning I was ring so many resources so now it's time to just provide everything in one place yeah simple that's the whole intent so first of all congratulations on completing the phase one of our project which was ingesting the data dynamically so you should be proud of yourself because you learned how to build Dynamic data
110:00 - 110:30 pipelines in a Y data Factory so our phase one is completed good job now we need to cover the next phase and next phase is all about aure data bricks this masterpiece a your data brakes as we all know that a your data Brak is one of the most in demand technology right now if we want to deal with big data so we will be learning everything don't worry about that and this particular phase of
110:30 - 111:00 project will be full of data breaks data brakes and data brakes because this is the sole tool responsible for data Transformations for our project plus it is the so tool responsible for data transformations in the real world as as well so I decided to pick this technology because this really helped me a lot to crack data engineering interviews yes so what you can expect in
111:00 - 111:30 this particular phase of the project so what we will be doing we will pick this data which is stored in the row layer then we will just ingest this data to Silver layer with some Transformations this is all about our phase two of the project and this phase two will be full of fun so don't worry at all so without wasting any time let's get started because we have
111:30 - 112:00 already created a ja databas workspace so I'm really really excited to tell you ins and outs of all the things that you should know before working with data breaks yes because there are some things which are really important that you should know before actually working with data braks and yes it's very much relevant because you cannot start working with data breakes if you do not know this what's that let's cover this so I was talking about storage access what's that what's
112:00 - 112:30 that what's that don't worry let me explain you it's very simple but it's really important because you as a data engineer will be working on this part and you should know you should take the ownership of this part so let's start so as we all know that our data is residing in the data l sort it yes now we want to work with your data brakes okay sorted but just
112:30 - 113:00 tell me one thing this is a resource this thing this is your data lake is a resource which can be used as an independent resource right which is not linked to data breaks so how we will be using data breaks to access the data stored in data Lake
113:00 - 113:30 obviously we need some permissions we need some access level permissions we need some Authority we need some credentials to access the data stored in the data Lake and how we can do that here comes the role of data access that is stored in the data L what is it it is a kind of key so what exactly we will be doing we will create an
113:30 - 114:00 application which is known as service level application so this application will have the access to this asure data Lake okay and this application will be used by data bricks so let me give you an example let's suppose you want to enter into a museum and you should have a ticket to visit that museum right so this application will act as a ticket will act as a
114:00 - 114:30 credential that you should have u means data breaks datab bricks should have this credential and then when this data bricks will be going to a your data Lake and will say hey data Lake I want to access the data that you have stored so it will say just show me the ID card because I cannot allow everyone because I am the safety
114:30 - 115:00 guard here then it will show this credential you will say see this is my credential this is my ID card then data L will say okay now you can enter into my zone and you can pick the data sorted this was the easiest example I can just mention it here so this was all about data access and it's really easy let's jump onto the asure portal and let's actually execute this so this was all about the architecture or you can say behind the scenes and you are the owner
115:00 - 115:30 of this step so you need to do this so let's get started to implement this and it will be official step towards our phase two of projects so let's get started bro so I am in my as your portal so first thing that you need to do simply go to your search t Tab and just search Microsoft entra ID see just click on
115:30 - 116:00 this one and then you will see an area where you have different different things such as users groups external identities roles at administrations to not worry about that just focus on this part app registrations because we will be registering an application so simp click on it and here you just need to click on plus new registration click on it okay so here we
116:00 - 116:30 just need to name it let's say I want to say aw projector application or just app simple then you don't need to change anything just click on register that's it click on register it will create an application and what we will be doing with an application let me tell you bro just just hold your breath just take a deep breath so now our application is created very good so as we know our
116:30 - 117:00 second step that we need to assign this application a role which can contribute to our data Lake right before doing that I want to save this information which information this information why because this will be needed yes because while we will be pulling this application in datab brex we need to just pass these URLs so for that purpose I need to just save this
117:00 - 117:30 information okay so let me just store this information let me just copy this and let's say let me just open a notepad and there I can just save it let's create a new okay
117:30 - 118:00 so I copied this and this is an application ID so I will simply say app ID app ID and equals to this okay then I need to copy object ID as you can see object ID okay I will simply click on on copy to clipboard and then I will say object ID don't worry about this don't worry about this just copy it just save this
118:00 - 118:30 information okay and the third third thing is we need to create a secret why because it will be used in datab brakes to pull this information so how we can create a secret that is the third information that we need simply click on certificates and secrets click on it uh and obviously new client secret click on it and give a description like let's say aw
118:30 - 119:00 project simple and it is recommended that it will be expired in 6 months but I I believe that my viewers my data lovers my data Community will complete this project before six months so do not worry at all just click on ADD and here we have the secret and we just need value of this secret so simply copy the value don't copy the secret ID just copy the value just click
119:00 - 119:30 on value and just save it and I will name it as secret I'm also not sure whether it was value or it was secret ID we will test it don't worry about that because we just need to store the information let's store this value as well because precautions are better than cure wow so let's say secret secret
119:30 - 120:00 two simple because if that will not work we will use the second one don't worry don't worry don't worry simple okay so we have done our first step so step number one was creating an application let's go to the Second Step assigning a role to this application so that it can access the data later how we can do that simply go to your home tab select your storage account because we need to assign a role in this storage account so how we can do
120:00 - 120:30 that simply click on access control I am am so I will click on it then I can see add because I want to add a role so I will simply click on ADD add Ro assignment simple now there are many roles there are so many roles so the role that we want to apply is storage blob contributor it gives read and write both
120:30 - 121:00 like you can either write or you can read so it is the best ex that you can give so just search storage blob contributor uh here it is storage blob data contributor same thing click on next now it is asking select members sock click on select members because now we need to select our application okay and now we need to search our application so what we will do we will simply search aw project yes
121:00 - 121:30 here you can see your application simple just pick this oops just pick this and select and then click on review plus assign click review plus sign again now it will be adding a role you can see the notification adding role assignment so it will take some some time and it takes I think 10 to 15 minutes to be applied as well I know you can see like it is applied but it takes some time don't
121:30 - 122:00 worry we have some work before that so our two steps are done now comes the third step okay and for that we need to go to our data brakes that's the loveliest part so simply open your your data breakes workspace launch your workspace okay and you already know that we have already created the cluster so that is the best thing our cluster is ready so it automat automatically turns off after
122:00 - 122:30 some time so if your cluster is turned off you can turn it on if it is already turned on and it is good no worries at all and now it's time to actually start working with your data breaks so are you excited to create your first notebook because I was really excited when I first when I created my first notebook and I hope that my data lovers will be happy as well so simply click on workspace because first of all we will create a folder we
122:30 - 123:00 will create a workspace for us simply click on workspace and then obviously this folder is empty because we do not have any workspace click on create in this corner and we need to create a folder okay so what name I want to give I will give a w and project simple create now within this project within this project folder or you can say aw project folder I want to create a notebook then again click on Create and this time I will create a
123:00 - 123:30 notebook woohooo this is our notebook okay and obviously if you are familiar with pandas if you're familiar with Jupiter you must be familiar with this user interface if you are not familiar don't worry this is a kind of you can say uh compiler that you would have seen in vs code so instead of running the whole code we run the code in chunks like in the cells so you will get to know everything don't worry about that
123:30 - 124:00 so first of all I do like renaming my notebook first because that's the best thing that you can do to navigate your notebooks in the future if you forget to name it to name it so first of all I will say Transformations or let's say silver layer silver layer simple okay silver layer simple so I have renamed my notebook so this is file this
124:00 - 124:30 is edit View Run help this is the cell in which we will be writing our code and I know you are familiar with this now first of all I will turn on my cluster because it was turned off so how I can do this you can see a button called connect I will click on it it will show all the Clusters that are available and and in my case it is just one so I will just click on it and I will say just turn it on so now you can see starting so now it will turn on my
124:30 - 125:00 cluster and till the time it is turned on I will just do some headings I will just add some headings and I will just clean up my notebook because I am planning to upload this notebook as well in my GitHub GitHub repository so that you guys can refer and you guys can just keep my notebook in parallel to your notebook and you can just refer your code so it will be good for your learning see uh I do care about you all like so much so first of all I will just add a heading how we can add headings
125:00 - 125:30 because we are expected to write the codes so this is the best thing that I do like about notebooks we can just add headings titles as well so how we can do that there are two ways one way is just over over your cod cod cell and you will see an option called python just click on it you will see mult multiple things markdown sqls Scala r that means you can use all these languages but wait markdown is not a language when you click on markdown you will see that now
125:30 - 126:00 you can just put headings so let's say I want to write silver layer script simple and if I want to make it bold I can just add hashtags so when I put three hashtags that mean it is H3 heading in HTML if you are familiar with HTML so just if you want to run it simply click shift plus enter plus
126:00 - 126:30 simple simple see as you can now now as you can see it gave us the heading simple so this is our heading and now I will put another heading because I'll be putting let's say data loading data loading Okay click shift plus enter another heading so let's make this heading little bigger so just reduce two
126:30 - 127:00 hashtags see so now it is a H1 heading so everything is done and now I think my starter is about to start sorry my cluster is about to start so once it starts then we can just start loading our data and one important thing yes our step three is still pending now before data loading we need to create our application in data bricks
127:00 - 127:30 so basically we do not create the application we just write some codes in which we pull the credentials of the application and we employ those applications within our code so don't worry you do not need to learn any code so all that code is available in the documentation so you can just need to copy and paste it let me show you how you can do it first of all I will just add one cell here in between so you can click on Plus Code and then you can click on markdown
127:30 - 128:00 again and I will say this time data Access Data access using application so this time I will hit alt plus enter so what it does it will run and it will add a cell below that code cell so it's up to you if you want to hit shift plus enter if you want to hit all plus enter my duty is to tell you each and everything so now it's time to just look at the documentation let's look at it so
128:00 - 128:30 in order to get the code simply write on Google that access data L using data brakes and just then just search for the documentation like this connect to aure data Lake gen to blob storage so simply click on it so basically whenever you want to access the data we just need to get the code and we just need to copy it from a documentation because you do not need to learn all that coding thing because it is not required that is just for the purpose of data access that's it
128:30 - 129:00 so when you just scroll it down you will see your service principle so this is the thing that we were looking for so you simply need to use this one like you use the following format to set the cluster configuration we already have set the cluster configuration because data does that for us okay so now we just need to copy this code this code simple so I will click click on copy and I will go to my data
129:00 - 129:30 breaks and then I just need to paste it here like this do not run it do not run it wait wait do not run it just paste it here because we need to give some values because we just need to replace the values why bro this is a generic code and let's say it is asking for C create it is asking for a storage account it is asking for some other things application ID so we need to pass our values that we have so do not worry I will tell you how
129:30 - 130:00 you can do that first of all remove this one service credential just remove it because we have already saved the credential in our uh notepad don't worry about don't worry about that first thing storage account we first need to remove this let me just highlight it do not make any mistake just remove this storage account to your storage account name so our
130:00 - 130:30 storage account name is you can simply go to your Azure go to your home Tab and just get the storage account from here do not copy and paste my storage account you need to put your storage account name so I will simply copy it from here and I will paste it here control V then again I need to paste it here okay then again I need to paste it here no
130:30 - 131:00 worries again I need to paste it [Music] here again I need to paste it here okay now along with this we need to pass two more things application ID as you already know that we saved our data so application ID is this one you can just copy your again do not copy my application ID just use your your application ID so we will just replace it do not worry when we will be just filling all this information you can pause the video and you can just tally
131:00 - 131:30 everything do not worry do not do not be in a hurry so then we need to pass directory ID directory ID I think was this one object ID yes yes yes yes okay if anything fails don't worry we will we can just go back and we can just check the information again it's not like oh we just uh clicked back in the application and now we cannot go back now no it's not like that you can
131:30 - 132:00 just go back just paste it here now the service credential so service credential is this one let's try this first if it fails we can use the second one okay okay okay okay okay uh sh so we just need to store it in double quotes simple so now I'm going to run this so now you can just take the screenshot or you can just pause the video so now I'm
132:00 - 132:30 going to run this and let's see if it any if it it throws any error we will just tackle it together don't worry because this is something that we are just putting the code that is already there so oh it ran successfully wow wow wow wow so what we actually have done it here so we have just employed or you can say we have called our application here to allow us all allow us means allow data breaks to access the data because
132:30 - 133:00 data braks cannot go and access the data directly no we are going indirectly we are going indirectly with a credential in our hand and we are saying hey we have the required identity card please allow us us means data breaks I am the data breaks so please allow me to access the data stored in data Lake simple sorted so now it's time to test this connection obviously it will be tested
133:00 - 133:30 once we can just load the data so let's quickly read the data set number one what is our data set number one let me just see let me go to my storage account and then I can just click on containers then I can just click on bronze and first of all I will just read this data Adventure Works uncore calendar okay Adventure Works calendar okay so how we can do that first of all let me just write a heading that I am going to create okay here's the second way of
133:30 - 134:00 putting the heading just put percentage MD MD is the short perform markdown and percentage is a kind of syntax that we use it's called Magic command so I will say read calendar data because I want to create a notebook need and clean for you guys so that you can refer and learn better so first of all let's read calendar data so how we can do that let me show you the code it
134:00 - 134:30 is very easy simply say DF or any kind of variable for your data frame DF equals to spark. read. format first of all understand this we are saying spark. read. format then we just need to put the format the data it is in CSV format just put CSV then we need to say dot option and in the option we want to put
134:30 - 135:00 header equals to true so we will say header comma true simple then another option option INF first schema comma true don't worry I will explain exp you what are all these terms don't worry but let me just test this connection let me just load the data for you guys then I will explain you what these terms are for then I will say dot
135:00 - 135:30 load and now we need to just give the location and there is a format to pass the URL of the data stored in the data link right and let me just quickly first put that here you can just copy it from here so I will explain you how you can remember this format because sometime it becomes really confusing and obviously interview can ask you this question like what is the format of putting URL because this URL is not readily available in data L you need to either
135:30 - 136:00 refer data documentation and I have like enough practice of using data and data l so I remember the format so let me just quickly write it here so at theate storage account name it is aw storage data Lake don't worry we EXP play it in just few seconds just for just wait for few seconds okay dot DFS dot Windows do DFS
136:00 - 136:30 doc. windows.net simple let me just run it h okay I can see some errors what is that error bro oh I see let me just mention the folder name as well and what was the folder name
136:30 - 137:00 Adventure Works Calendar let me just copy it Adventure Works calendar Adventure Works calendar okay let's read this again error httv connection failed to it oh I got it it was not about that particular folder so the connection that we have made with this application failed now I want to dig deeper why what information we have put wrong so what I think is we have mentioned one of these
137:00 - 137:30 values wrong my 99% of the intuition and let me still see it because it oh I got it so the thing is we have copied object ID instead of tenant ID just a s mistake and you know how to correct it let me show you so now it's time to go to your aure and let me show you from the beginning so just go back to your
137:30 - 138:00 Microsoft enter ID and just select the app registration in which we have registered our applications just click on all applications so now here you see the list so just click on AW project application so this is the one that I recently created so these two like other two are like for my different projects don't don't mind don't mind just click on it and it will again show you the same page that we need so instead of object ID we need to copy the tenant ID just just a silly mistake so we just
138:00 - 138:30 need to replace this object ID e70 something with our tenant ID let's do that let's do that okay so we need to replace this value with this one so let's rerun it okay and let's rerun this as well now I should see some data if not we will tackle the errors because that's how you become a pro feel happy when you
138:30 - 139:00 see the errors really you would have seen like first developer who is saying this but trust me it will help you a lot in your Learning Journey it's better to see errors in your room rather than seeing errors in front of interviewers right so as you can see we have successfully loaded the data and in order to display the data we have a command called dot
139:00 - 139:30 display okay just write DF dot display and it will display the data as you can see we have just one column in our data source so obviously it gave us only one column congratulations first of all Cong congratulations and now let me explain you what we have done in our reading code so the code is really simple so let me just break it down for
139:30 - 140:00 you all guys first of all we have already covered that this is the format right okay this is the header obviously you are aware of header so we want to make our header header true because we want to see our header then what is this infer schema bro what is this infer schema so by default when we save data in CSV it will read all the columns are text columns so we want spark to just infer the schema what does it mean just to
140:00 - 140:30 decide the schema on your own based on your own intelligence just look at the data and just decide the schema on your own we do not want to extensively like Define the schema again and again we do not want to pass any argument for schema just decide it on your own and based on its uh you can say intelligence it can just predict the schema that's it it's very handy when you do not need to give schema for your data frames again and
140:30 - 141:00 again so I use this command a lot like in for schema then this is a load command as the name suggest we need to load the data now don't worry I'm coming to your favorite part how we can remember the format for the URL it's very simple so the format is let me show you if you can closely see first of all I will just take it to the next line because I don't like like keeping all the code in
141:00 - 141:30 one line yeah I know I was the one who wrote it but still sorry so yes you can just shift your codes in the next Lines by just adding one small thing which is this backs slash do not put uh front slash front slash is the regular slash we use back slash is the key which is available just above the enter key so use that slash simple simple simple now you can just move your code to the next line it
141:30 - 142:00 will allow you so what is the format bro format is very simple first of all you need to write AB FSS that is a kind of your blob storage something then put your container name our container name is bronze then put at theate storage account name and our storage account name is my my not our my aw storage data Lake simple then you need to put this thing as it is dfsco windows.net simple
142:00 - 142:30 and then your folder name simple this is the format I know do not feel overwhelmed like because obviously when you are just starting working with data brakes it is very difficult to remember the format but you can just copy paste from here and with the course of time you will be familiar with it and and you you would not need to see and you would not need to just copy and paste it you will just look at the storage account name and container name and you can just form your uh URL without just copying it from
142:30 - 143:00 anywhere else but for now just copy it from here I have just paused my video so you can just take it okay sorted now let's load the other data frames as well because we have successfully loaded calendar uh data data frame or data now it's time to load other data let's do it so as we have just loaded calendar data so now it's time to load other data sites as well so what we can
143:00 - 143:30 do we can simply copy paste the code that we have used for data reading it will save us a lot of time but keep one thing in mind just add a suffix let's say DF Cal because it will be identifiable in future that we are reading calendar data right just do this and just delete this because we will be displaying the data together so let's say read calendar data let me just copy this or let's let me call it
143:30 - 144:00 as reading data so it will be like generic for all the data reading right then let [Music] me copy this code and let me now read the next data set which is which is I think customer let me check click on containers and then bronze yeah it's customers so I just need to change the folder name because rest of the things will be same exact
144:00 - 144:30 same and as this is customers so I want to make my data frame to be known as dfdore cus so I will just read it yes successful then I will just read another data set which is d uh product categories okay so I just need to say product categories
144:30 - 145:00 product categories okay simply run it oops I forgot to just change it so I will say DF product categories proat okay let me rerun it and we just need to rerun this as well that's it then the next one so this is just a kind of data reading step so we are just reading all the data one by one in this we just need to say product so I will say Pro uh we can remove this product or
145:00 - 145:30 products products okay we just need to add s simple then another one returns okay so I will say rate and here [Music] returns simply run this as
145:30 - 146:00 well then we have sales 15 16 17 okay we'll do that sales 25 okay simple let me show you one thing so instead of writing 15 16 17 again and
146:00 - 146:30 again what I can say just put asteris after this let's try this it ran it so what we have done we have just set data breakes just pick all the folders which have this naming convention Adventure Works unor sales whether it is 15 16 or 177 it will pick all the files just a pro tip just a proad tip and a potential intervie equation as well how you can recursively
146:30 - 147:00 read files see so now we can just say territories as well okay territories okay and I will say here d r and I need to change this name as well because this for sales so let me rerun it and I need to rerun DF Pro as well because I just use that data frame name so do not do this mistake if you do it accept it and
147:00 - 147:30 correct it DF and then we have subcategories okay okay okay okay let's write it just one more just one more subcat subcategories and then we have oh I saw one error don't worry we'll correct it
147:30 - 148:00 then we have products products oh don't mind this one because this was the one that we just pulled it when we were testing our data I was thinking like why this naming convention is so different from the other ones so do not mind this one because we already have product file so let me just take take you to the your data Factory phase so when we were just testing our static pipeline this is that data so ignore it so I know we have a
148:00 - 148:30 error ah what is the error I think I have messed up the spelling I don't know Adventure work sub categories what is the spelling here sub categories let me just copy it from here I I think I have just missed up with spelling oh yeah I didn't add product okay okay okay okay now let me run
148:30 - 149:00 it let me run this what oh it's just product subcategories it's not like Adventure works subcategories okay makes sense makes sense Mak sense we are humans we can make mistakes the difference is we can correct it if you want if you don't want I should I should not speak about it okay so now our all the data frames
149:00 - 149:30 are loaded congratulations you have successfully loaded all the data frames and now we will be performing some kind of Transformations and simultaneously we will be pushing this data to to to to to to let me show you this to this container which is silver right so are you excited to learn some of the crazy functions available in data braks so let's transform our data to some extent obviously we are not just doing hardcore
149:30 - 150:00 transformation because these open data sets are already clean but we will try our best to cover as many functions as we can so that we can be familiar with Transformations or functions available in spark and vice spark so let's start transform Transforming Our data and pushing to the silver layer and complete our phase two as well successfully love you my data Community let's do that let me just put a heading here and then we'll start doing that as
150:00 - 150:30 well so this is Transformations okay perfect so now first of all let's transform calendar data now you'll be asking hey what do we need to transform it it just has one column so we will transform it first of all I will display the data DF do
150:30 - 151:00 display LOL LOL they're saying that that data frame doesn't exist what is not defined real we defined it let me rone it yeah now I can just say display it should work yep so now as you can see we just have one column do you want to learn some date functions let me tell you so let's say we have this date and we want to
151:00 - 151:30 create a column called month in which we have month of every date right that that can be the scenario because I have worked with these scenarios where we just need to fetch month year so let's create two columns month and year and by just applying this transformation you will get to know about about the date functions right so this is just a special edition for you for you for you okay so how we can do that it's very simple
151:30 - 152:00 so I will simply say DF do with column so if you are not aw aware of dot withth column so it is a kind of function that we have available in spark Library so with the help of this function we create a new column or we modify the existing mod how we can just make a difference between modifying the existing one or create a new column so if I keep the column name same let's say I just write here date and we already have date date
152:00 - 152:30 column so it will just modify this one but we do not want that we create a new we want to create a new one so in this scenario I will just pick a new name which is not available in my table or data frame right so I will simply say month then comma then we need to put the transformation like what transformation we want to apply on this data frame so I want to use a function called month yes we do have a month function within ppar
152:30 - 153:00 Library I just saw a fishy thing because I didn't see a recommendation based on this function and you know why because I didn't pull the library so for that first thing that you should do always just click on Plus Code and pull the modules so like from P Spark do SQL do functions import all the functions import asri and from F spark do SQL do types pull all the
153:00 - 153:30 types import estx just run this yeah successful yes so now when I will be writing month let me just rewrite for you now I should see some recommendation see this is a this is a quick tip for you guys and you can just identify the errors before compiler will do it for you so you are your your compiler so I will simply write month
153:30 - 154:00 now I need to use a column like I want to fetch a month of which colum let me give you bro I want to use column so column Co is a column object that we use with column name right then we just need to put our column name which is date simple it is done and if you want to say it we will simply say DF Cal equals this simple and if I want to display it I can here do display so now you will see
154:00 - 154:30 column created perfect as you can see it has given us the month one one one one one because this is just month one then I can just scroll it down I can see 8 9 10 as well similarly I will just create another column so so I will just use backslash for the next line as you all know now so click on it and then I will again say dot with column then I will say
154:30 - 155:00 ear and then I will use ear function we have ear function as well very good then we just need to repeat these steps column object and then we need to pass the column name which is date simple do not make things compliment not compliment like difficult so just remember one thing you need to put the single CES whenever you are using column object right so now let me just rerun it and now I should see two two more columns month and year perfect as you
155:00 - 155:30 can see year is 2015 16 17 simple and now I want to just push this data let me remove this so now I want to push this data to Silver layer and how we can do this let me show you so code is little bit similar to the data reading but it is slight different slightly different so just write DF Cal that is your data frame name then write
155:30 - 156:00 this is a pypar writer API so just write DF do write then you need to mention format in which format you want to push the data to Silver and I would say let's push this data to pocket format what is a pocket format bro pocket format is a columnal format and it is very much in demand and right now we work with pocket files I know we also use Delta but Delta are built on top of pocket files so pocket files are very
156:00 - 156:30 much in demand and it is very much optimized for data reads because it is a columnal file format so now you are familiar with pocket file formats as well and let's push this data to pocket format let's write pocket then backs slash then I want to say dot option oh before option I will say mode now what is this mode and writer uh API so there are basically four modes available let
156:30 - 157:00 me just write it for you so that you can keep this thing in your mind because these are some things that interviewer interviewers can ask and you should be like having these T these things in your tips on your tips like just ask me this and I will be like hey this is the mode and this mode is used for this purpose so basically there four modes available so first of all we have upend mode don't worry I will tell you each and every mode then we have overwrite
157:00 - 157:30 mode then we have error mode then we have ignore mode yes so we have these four modes so what does this mode do so if we have data already already stored in the folder and we want to append the data that means we want to merge that data with the existing one we want to just apply a union so we will use upend but if you
157:30 - 158:00 want to replace that data that is already there and you want to store the fresh data we use overwrite what is this error so if data is already there and we are just trying to write the data to the same folder it will say hey bro there's data there I cannot WR right so it will throw an error okay now what this ignore will do so ignore will just ignore the things no no no no I'm just kidding basically it
158:00 - 158:30 ignores so if data is there in the folder it will not throw any error but it will not write the data as well so it will not throw any error but it will not write the data as well so it will just ignore right now I have have explained all these four modes in the easiest manner so I think now you will be keeping these modes in mind and just feel confident just answer like this whenever I ask you like what is this
158:30 - 159:00 this mode for just say hey bro we already know so I will just use append mode okay when I use append mode I will just use back slash enter Then I need to use option what is this option so in this option command we will give the path where we need to store the data so I will say path comma and then same same data format yeah the the format that we used
159:00 - 159:30 while reading the data same same path not same I mean to say same type same format obviously the path is different because that is a source this is a destination so the source this time is oh sorry destination this time is abfs colon double slash and then this time container is silver not bronze so silver at the rate what is the storage account name aw let me copy this one because rest of things are same we
159:30 - 160:00 just need to save this data in silver container instead of bronze container and we have successfully put the URL here simple one last thing do not forget to mention that dot save so it will just save your data let's run this okay so it has successfully return the data so now I want to check this
160:00 - 160:30 let's go to your storage account and check your silver folder not folder container yes we have returns data why we have returns data oh because we just picked the location of returns sorry let me just change it let me just change it see this is the thing this is the advantage of data validation I instantly validated it and I got the issue just remove this data and just reload it uh let me just rerun
160:30 - 161:00 it simple so now I should see the correct data let me just refresh it yes now we have the data so this was all about date functions and this was all about calendar now the next data frame that we want to transform and push to Silver layer is customers let's do that let's do that I think I should create a heading for you guys so that you you will not feel confused like which transformation is for which data frame let's do that so I will simply
161:00 - 161:30 say calendar okay so now you can easily navigate through through through this notebook okay no worries okay so now let me just add another one which is customers right H so first of all I want to see my customer data by running the command. display and
161:30 - 162:00 then we will just create some scenarios where we can just apply Transformations on top of this data frame and in customers data frame we can easily apply lots of Transformations so this is one of the best data frames you can just imagine while working with spark Transformations because we have so much to to use right and this is our data frame this looks like this so we will cover three transformations in this three Transformations and these three
162:00 - 162:30 Transformations will be really really nice trust me because I'm going to cover some crazy transformation not crazy transformation but yeah it will be really handy and it will be really uh helpful for you First Transformation First Transformation so the First Transformation will be for text based transformation so we have already covered some date Transformations it's time to cover some string transformation SL text transformation whatever you want to say so what we will be doing we have Mr that is prefix Mr Mrs Miss then we
162:30 - 163:00 have first name then we have last name so what we will be doing we will create a column called full name in which we will be just concatenating all the three columns so let's create that uh transformation let's transform the the these three columns and it will work like this DF C equals to DF C dot with
163:00 - 163:30 column then we need to say hey full name this is our column column name full name and the transformation that we want to apply so basically there are two ways so that's why I picked this transformation because this is one of the favorite questions asked by interviewers yes so basically there are two functions available for concat one is concat simple concat in which we just put all the columns and we put the separator let me show you that first then I will just tell you the advanced one so first of
163:30 - 164:00 all I will say concat then I will say column name that I want to conards first of all I want prefix let me just check the column name yes it is prefix then after column I want to add a space so I will just do this it is similar to excel don't worry so just put another comma then I will put my second column name which is first name yeah it is first name
164:00 - 164:30 okay first name then again I will put comma and then space then comma third column name so this is the normal approach that everyone follows and now I'm going to tell you the advanced approach okay so this is my last name perfect so let me first display this to you and we will not save this
164:30 - 165:00 because we will save the advanced one right so let me just display this do display what what is this error uh oh I got it so here is another learning for you I forgot to mention H nice so the thing is whenever you want to add a constant what does it mean like any constant maybe a kind of space any kind of number any kind of alphabet if
165:00 - 165:30 you want to put any constant we want to make the use of lit function lit lit I forgot to mention that because I didn't use that functions in so long so I just remember like looking at like it is saying there's no column name so I said like I didn't use any column name with that space then I got to know hey I need to use L function so this is the function lit and then we can just Define this space within this because we cannot Define constant like this we have to use lit function simple now it's fine it should
165:30 - 166:00 be fine it should be fine fingers crossed simple so let's look at our new column wow it worked yes now I will tell you the advanced approach not an advanced approach but yes this is a kind of function that no one not every one knows about so what is this function so you have just seen that we were just again and again putting this space thing right I do not want to do that I just want to
166:00 - 166:30 put my delimeter just one time and then I just want to mention the column names for that we have a function called WS concat wsor concat yes this is the function that we have so I will just write DF C and then DF C dot with column and I will say full
166:30 - 167:00 name and I will use WS concat or concat WS yes it's concat WS so just open this and now as you can see here in the suggestion as well it is asking us for like SE seator that we want and then we just need to put the column names simple so here in the separator I will use let's say space here I can use space without
167:00 - 167:30 any lit function and all so I think so because I used it like few months back let's test it but I feel like we can use it without lit if not we can add it not a big deal then just add your all the columns let's say column prefix and column first name and then column last
167:30 - 168:00 name simple simple now I want to test it let's run it see we just need to put space we do not need to put uh LS because this function is specifically Built For This purpose where we do not need to mention the separator again and again so let me just show you the data now so just go to the next Cel and just write DF c.
168:00 - 168:30 display see now you're learning all the advanced things all the advanced things now I should see the column yes it is perfect and now this time I have saved this column to my data frame so it will be pushed to Silver layer as well it's time to write this data so just copy the code from above that we used in the calendar and do not forget to change the name the mistake that we did it earlier okay so this time I will say
168:30 - 169:00 customers perfect and our data frame is DF cus simple just hit the command and we good yes perfect now it's time to pull the third one third data frame let me let me just refresh it first okay I can see the data okay now it's time to call the third one which is product product categories okay so for product
169:00 - 169:30 categories I don't think so we have much data so I don't think so we need to apply any Transformations on it so we will just read it and we will just write it to Silver layer simple simple simp simple so first of all I will just write percentage MD and this is subcategories okay subcategories so DF
169:30 - 170:00 subcategories equals to let me just copy the code copy oh we just need to display it sorry categories oh yeah we just have like three columns H so we can just simply write this data to silverly without any transformation because I don't think so we should apply any transformation this is really simple
170:00 - 170:30 so now we can just copy the code here from right and then we can just write this data DF subcat dot this one simple and and do not forget to change the column name sorry folder name sub categories simple just
170:30 - 171:00 hit shift plus enter simple so now it's time to read products data yes let's do that I will simply write percentage MD and then products okay products so let's see what do we have in products data frame oh we have many more columns
171:00 - 171:30 okay so what we going to do in this particular data frame so as you can see we have all the information available related to products very good so I have a very beautiful for this particular table so as you can see that we have product sko column and this is the real time scenario because these kind of uh requirement that we get
171:30 - 172:00 in like normal days as well so this is product SKU I just want to fetch the first two letters of product SQ maybe this can be my mapping two categories or anything this is the requirement I just want first two letters that's it or I want to just level it up okay just level up I want all the alphabet let's say I have three alphabets because I can have three alphabets so I want all the alphabets
172:00 - 172:30 before this hyphen yes let's do this so this is my requirement Number One requirement number two I just want the last color thing like product name so as you can see in this product name column this product name column we have
172:30 - 173:00 all the names obviously related to products so I just want to fetch the first word of every product name column just First Alpha not alphabet like first word before space as you can see we have spaces like l L HL ml then we have long aw AWC Mountain the the like these kind of words I just want very first word so here we will be learning a very popular and very strong
173:00 - 173:30 function called split split function yes so what this function does it just splits the column into a list so then we need to use indexing as well on the top of it so that we can just fetch the design ired index so in this scenario as we talked about we just need the first one so first we will just split the data based on hyphen then we will just pick the first index the very first index
173:30 - 174:00 which is also referred with zero are you excited to to to do that let me show you how you can do that okay so we need to transform our columns this time instead of creating new one so now you will also learn how to transform the column so it is very easy simply write your DF product right equals to DF Pro dot with
174:00 - 174:30 column okay with column and then I will just keep the column name same so that it will just modify the column product SKU now here comes the function called split so how we can use this function in Split we need to just Define the column first and then we need to define the separator so in our scenario the column is product
174:30 - 175:00 SKU and the separator is comma then once it creates a list because split function creates a list based on the separator I just want index number zero simple this is the transformation just focus on this part because you need to do indexing as well because otherwise it will just create a list and we do not want a list we need a
175:00 - 175:30 index zero index so let me just create the second column as well the same way which is our product name okay and I will simply write here product name and here as well product name and this time my separator is space space yeah simple space see space and then again I want zero index like zero zeroe index
175:30 - 176:00 so okay now I want to just run this let me run this in syntax uh uh uh I think there's one braces missing uhuh uh I can't see anything let me just rerun it oh yeah it was fine I think so oh yeah it was fine just ignore these kinds
176:00 - 176:30 of silly errors so now I want to look at this data frame result now I should see just first words using split function so as you can see that it has returned just the first word of the product name very well done yes and we have just committed a small mistake instead of putting hyphen we have just put comma and I don't know why
176:30 - 177:00 I just put comma let me just rerun it okay so now it has given the desired result so let me just repeat what we did so we just needed to mention the separator and and I don't know why I just put comma so we need to just put hyphen because our separator is hyphen not comma because I was just talking about comma comma CSC file so it just came in my mind I just put comma so just ignore
177:00 - 177:30 that so now our this customer data frame is also modified now it's time to write it and I know now you are feeling much more confident with split column or like splitting the columns into a list and then use indexing on top of it very well done very done because it's not a not an easy function and not not a easy thing not an easy thing to grasp so if you have done it then good job and then we will just write this data frame Pro and
177:30 - 178:00 products uh uh uh products okay and then we have returns file okay let's pick returns file let me see what we have in returns Rons
178:00 - 178:30 simple let me just see DF red. display oh I just committed a mistake I just misspelled display I just WR display okay just write display and then oh this is a small data set okay I don't think so we just need to apply any transformation on this as well so just write it as it is because we do not have much to transform it in this data frame
178:30 - 179:00 don't worry I have a very special thing to add in your transformation game do you want hint so when you'll be just writing data for like sales sales 2015 16 17 I will show you something very special so this is regarding performing a kind of analysis don't worry we will not just performing lots of analytics but just two to three things and I will show you how you can build transform uh not transformation like how you can build
179:00 - 179:30 charts in data break can you imagine and you do not need to use any code we just need to just click on few things and you can just create charts like bar chart line chart so there's a special gift for you so just wait for a few more seconds because this is the second last data frame and then we will be just working with sales data set and I will show you just three to four Transformations and I will show you how you can build charts and also that so that you can just play with your data frame and you can
179:30 - 180:00 leverage that in your industry and you can just showas this talent in your project that you also build some Transformations because every organization is looking towards performing analysis on Big Data so you can leverage this as well that you performed analysis on big data in data breaks so just a special gift for you so it's time to write this data so I think I have already copied this data yes
180:00 - 180:30 so DF R and then we just need to write returns perfect let me just write it yes and I think now we have two data frames left sales and returns obviously we will be just covering sales after territories because that is a special gift first let's quickly transform territories data and push it to Silver and then come to our special gift okay then we
180:30 - 181:00 have uh territories yep okay what do we have in territories do display let me see DF is not defined H we just defined it let me redefine it don't worry maybe I just skipped it I just wrote the code and did run it no
181:00 - 181:30 [Music] worries uhuh where's my data reading for DF oh here it is let me just run it maybe it was missed don't worry just rerun the code we have the notebook we have the notebook and that is the biggest advant Vantage because when you will be having this notebook everything was in the sequence so so that you can just refer it and you can just learn in a better way so just run this oh this was a Basics basic small
181:30 - 182:00 data set again we do not have much things to do in this data frame as well so I think we should just focus on our gift part and we should just cover that data frame in little bit detail so let's quickly push this data set as well because I know you are really excited to work on that data set because I was really happy when I saw those Transformations and those charts those visualization in data breaks because I didn't imagine that we can build those data like data uh you can say analysis
182:00 - 182:30 and data breaks using like no code because we do not need to write any cbone code uh you can say no mat plot lip code so nothing we just need to click on few boxes and charts will be built so let's write this one quickly this is BF and then territories
182:30 - 183:00 simple now all the data are pushed to Silver layer now it's time to pick pick pick pick sales data frame so I'm really excited for it so let me first read the data and let me see what is it inside the data frame so sales and let me just display sales. display and we already have three years of data within this so that was a smart
183:00 - 183:30 tip that we picked like we we just merge the data frames together okay so we have some nice stuff to use in our analysis we have date function we have order number we have customer we have so many stuff so we can just perform many aggregations and I will show you how you can do that so let's start with like some Transformations and then we can start our analysis okay so what transformation we'll be doing let me
183:30 - 184:00 show you let's say we want to cover one more date function which is time stamp if we want to convert a date into a Time Stam date format so how we can do it we will cover that one because we will transform this talk date column right this is our requirement number one then we need to replace yes replace I want to cover that function as well because that is really really important so what we'll be doing we will replace the alphabet s
184:00 - 184:30 with let's say alphabet T just a requirement because that is really handy when you learn how to you can how you can replace so we that is our second requirement then the third requirement yes we have three requirements in it then what we will be doing we will simply perform some kind of mathematical functions
184:30 - 185:00 because we haven't covered those so what we'll be doing we will just perform a multiply column where we will be multiplying orderline item with order quantity I know this doesn't make any sense from business perspective but being a developer you should know how you can perform those functions like multiplication of columns because it is really necessary when you are working with numerical data right so we will be using some functions to achieve these results so we have three requirements let me quickly do that and then we will
185:00 - 185:30 perform some aggregations so you will be learning aggregation function as well so what is that the function is Group by the most powerful function in all the tools and Technologies in the data world because we want to aggregate data we do not want transactional data we are data domains like like we we are the people who are building reports or building pipelines or analyzing the data or like
185:30 - 186:00 doing some stuff based on the data stored in the data warehouses so we do not want transactional data we want aggregated data so let's do that first and let's let's let's let's let's get started so first of all simple requirement which is converting the stock date time stamp into the like the date format into the time stamp format so how we can do that so simply write
186:00 - 186:30 DF Sales equals to DF Sales do with column obviously and then we just need to use the same column name because we will be change this column instead of creating new one so column name is stock dat that's correct then the transformation that we need to apply is two time stamp as you can see the function is popped up click on it click on it two time stamp and what column we
186:30 - 187:00 need to use to convert it to time stamp we need to use talk date simple so let's run this okay this is finished so now in order to replace the S with t we will do our second transformation sales equals to DF Sales dot with column and then we need
187:00 - 187:30 to say order date order number sorry then we will use a function called reg XP replace yes so this function is used to replace the characters with some other characters so first of all I will just write the column name which is order number because this is the column in which we are applying replacing things right so after mentioning this column I want to replace
187:30 - 188:00 S with t or let's say yeah s with t simple and let me just apply it okay perfect now let's apply the transformation which is multiplication of these two columns order line item and Order quantity okay let's do that and we will create a new column for this one because this is just a random column we
188:00 - 188:30 just need to use the multiplication operation that's it okay DF do with column uh let me just save it DF Sales EF sales and then do with column it's called multiply right then you need to say column of order line item
188:30 - 189:00 asri column of order quantity simple all the three Transformations are done simple now I want to just look at my data frame simple wow I can see the transformation stock dat is convert to time stamp order number is replaced with t like s is
189:00 - 189:30 replaced with t and then we have a new column called multiply where we can see this number is multiplied with this one 2 into 2 = 4 simple now here comes the sales analysis let's create another heading and input sales analysis simple so here we will be performing a group aggregation how so the requirement is I want to aggregate
189:30 - 190:00 my sales that means how many orders I have received in one day like every day how many orders did we receive so we want to apply a group by on order date and we want to find the count of order numbers by every day like by like how many orders we received every day right let's do that so how we can do that we have to apply aggregation okay
190:00 - 190:30 DF do sales dot Group by so there's a function called Group by okay then we need to apply write the column name on which we need to apply Group by this is order date let me just confirm the name yes it is order date then we need to use AG function that stands for aggregation that says like hey I have grouped the data now you want to apply which
190:30 - 191:00 aggregation we want to apply aggregation count on which column it's order number simple then the new column will be called as total orders yes this is known as Alias function so now you are familiar with alas function as well simple so this was all about aggregation very nice let me run this
191:00 - 191:30 and let me actually display this as well okay do display so now I want to display it so this is our analysis number one how many sales we have received every day wow and now the best part I know the part you are waiting for we want to create charts okay bro okay bro so here is the area where you can build charts so as you can see the option table let's
191:30 - 192:00 say I want to show my manager that these many sales we have received every day so instead of showing table because obviously it is very hard to just see the trend because I'm am interested to see the trend so what I can do I can simply click on this drop down and I can see here Plus button and then click on visualization simple simple simple see now as I can see the total orders by day and I can see a trend so now this is
192:00 - 192:30 very much evident earlier it was quite stagnant but suddenly it boomed up when from 2016 July 1st so this is a kind of insight that is really helpful see how easily you can build charts so it is not like you cannot modify it simply go here and just choose area chart as well see you do not need to write any code
192:30 - 193:00 this is crazy now you can see you have so many options so if you want to change colors and all you can just do it from here like there are so many things if you want to choose group bu you can say it's like total order like anything thing and you can even change the Y column as well like the aggregations we are using for our charts and if you want to save it you will be saying hey it is okay I can see it but how we can save it in Notebook just click on save button bro
193:00 - 193:30 that's it that's it now you can just share this notebook with your manager this is crazy and obviously if any interviewer ask you to create visualizations just say we know how to create visualization and yes it is possible so it is really cool it is really cool let me show you one more uh you can say visualization so what I want to do I want to use category column because
193:30 - 194:00 I think in category column we have some categorization so I want to use a pie chart so this was analysis number one let me choose analysis number two okay so for that I want to First display the data that is available for category so DS category subcategory do display I think it was product category yes let me display it okay so if I want to see the
194:00 - 194:30 visualization for this like the pie chart or anything where I want to see the composition the distribution like which category is performing good like which how many how much percentage it has acquired so simple visualization I can just display the uh the bar chart as well right but I want to display pie chart click on pie chart wow cool man cool that's cool
194:30 - 195:00 simple that's amazing that's amazing so one more as I promised that we will be doing three things so third thing H what we should pick okay let's pick territory data because we didn't uh apply much Transformations on top of it so let's use it in our analysis okay so for that we will be using DF dot display and I know you're going to play with it a lot I know because I
195:00 - 195:30 also played it played with it a lot it's really cool feature so now I want to create what yes I I will be just using country so I will just pick visualization so it will say no data because it it does not know like what will my X column what will be my y column no worries so for X column y column you can just put General and you can just pick X column is region okay and you can say y column
195:30 - 196:00 is scale is automatic okay I think you just need to pick like y column add column and you can just put anything just play with it so you can say hey in every region or let's say in every country how many regions do we have a kind of thing that you can you can just put in your aggregations so as I can see one one one one one but
196:00 - 196:30 suddenly five regions in my us so this is a kind of analysis that we can use if you want to find how many regions are contributing in each region like in each country then I can also pick let's say continent then I can just pick says territory key and instead of count we can just say Min median average anything so you can just play with it that's I would leave it up
196:30 - 197:00 to you so let's save it because this is the third one let's save it and these three are the analysis that we have done and you can just play with it because it's really a cool feature and after playing with it it's time to just run write the data and I think I have not written the data like sales data it's time to write the data as well so I will just put the right command here above the sales analysis yes so that you can just refer the notebook and you should
197:00 - 197:30 know okay all the data has been written and now it's time to analys analyze the data so to write the data simply write DF Sales let me just copy it will just save us a lot of time okay so now we just need to write it DF Sales okay and we need to create a folder called sales simple let me run
197:30 - 198:00 it perfect so all the data is pushed to Silver layer all the data is analyzed as well we have just done the analysis as well so let let me just test it if we have data in silver layer or not because we didn't validate it for the other files yes we have all the data so that means that
198:00 - 198:30 means that means we are done with our phase two of the project as well we have successfully pushed our data to Silver layer from bronze layer and we have used this powerful tool data brakes we have like worked with pocket file format we have did analysis on top of our data we have buil visualizations as well so
198:30 - 199:00 we did so much crazy stuff within this phase so the next phase is serving layer we need to create a serving layer in synapse analytics yes we will be doing that in our next phase of this project so we have covered phase one phase two successfully and I know you have learned a lot just tap your back because you have done a great job so congratulations and now it's time to deploy now it's
199:00 - 199:30 time to work with our third phase or you can say the third resource synaps analytics so we are entering into our phase three of our project so I think now you know what you want to do next yes we will be creating our resource which is synapse analytics so basically synapse analytics is a data warehouse solution available in aure and it was really really popular and now also it is
199:30 - 200:00 really popular and obviously fabric is the next step after synapse but still as your synapse is very much in use and your snaps will be there so do not worry do not believe in myths your snap snaps analytics is there so do not worry at all and just learn the skill and without wasting any time let's first create our resource so first of all just click on create resource and now you need to search
200:00 - 200:30 synapse synapse analytics and it will take us to the marketplace yeah just click on this Microsoft One do you know what actually is synaps analytics let me tell you let me tell you just hold on let me tell you so let's configure this first and then I will just give you a brief and I will explain you what exactly snap ntic is when we will just enter into the workspace so it will be a better
200:30 - 201:00 experience for you as well right so first of all we need to pick the resource Group simple then again in the snaps analytics as well we have managed Resource Group as we have in data breakes so same thing here so I will just name it as r G that is Resource Group managed and aw project and this is for synapse so we do not need to worry about it because aure will be taking care of it and it will be saving all its
201:00 - 201:30 resources so just forget about it and just create a resource Group for it now we need to name name our workspace so I will say h synapse and I will add aw project as my as my prefix okay and then we need to create a storage account as well yes whenever we create snaps analytics we need to create a default storage account as well because snaps uses it as well so it's time to
201:30 - 202:00 create a storage account so if we already want to like if if we already have a storage account and if we want to use it as our default storage then we can assign it but ideally you should create a new one and do not touch that one because we have our data stored in different data l so so this is just for synapse so these are some kind of things that should be there for synapse ntic so we will click on create new and let's say I want to name it as
202:00 - 202:30 default synapse storage simple so it will oh it's already taken so let's say default synapse oh it's there yes we can pick it then we need to create file system name and there's no available then we need to create one create new I will say default file system perfect oh it's there yes
202:30 - 203:00 okay so basically it is a default storage because whenever we create something we need to upload data somewhere right so it's up to us if we want to use default storage or we want to use our own external accounts so obviously in most of the scenarios we use our own external data L external storage account so just ignore it just create it and ignore it then simply click on review plus create simple or you can just click on
203:00 - 203:30 next security so that you can have some configuration regarding SQL Server so as you can see these are some of the credentials that you can set for your dedicated SQL pool do not worry we'll tell you what is dedicated SQL pool so let's me just set my password so I can keep any password like obviously you can set your own password because it will be required whenever you want to connect with the SQL pool so it is
203:30 - 204:00 required and I will say here admin okay admin onch okay I can pick it then mhm select the ground workspace network access to data or using workspace identity just use it okay workspace encryption it's fine we
204:00 - 204:30 do not need to encrypt anything then just check double check everything everything is fine yes just click on networking just disabled virtual Network because we are not using vets here so vnet is like simply a virtual Network where we just create our own network you can also relate it with private endpoints so do not worry about that it's all for the restricted data that we use within the industry and when we are working with sensitive data so we use vux so then click on tags next review
204:30 - 205:00 plus create and then simply click on create so it will create a resource for you which is synapse analytics so once it is done I will give you a brief overview of the snap analytics like why it is in demand and what exactly it offers us so will tell you each and everything so let's wait for a few more seconds and let's see what it gives us
205:00 - 205:30 and yes this is our serving layer this is our serving Zone where we'll be creating data available for data analyst data scientist reporting analyst bi developers all of them will'll be using data from this layer so it is just creating snaps workspace just hold on for a few more seconds I think it should be [Music]
205:30 - 206:00 done by the way till the time it is being created how is this video so if you find it insightful then definitely tell me comment section or any suggestions any feedbacks I'm open to all of them so just give me suggestions feedback so that I can just improve my content and if you do not want to give any suggestions if you just want to give me
206:00 - 206:30 love so just type anything which can make me happy and make me feel motivated to create more and more content in future so I'll be just creating content for you guys so that you can practice and learn and just succeed in your data Journey that's my whole intent of this channel that's it I just want to share my knowledge with you all I just want to guide you and I just want to be a part of your success Journey just even one
206:30 - 207:00 person I'll be happy so that's the whole intent and yeah I know like you love me a lot so deployment still in is in progress hm let's fast forward it so finally it is deployed and now it's time to explore synapse analytics you going to love this tool trust me because of this reason
207:00 - 207:30 that I'm going to tell you right now so without wasting any time just click on this lovely button go to your resource Group just go there and here you can see your snaps ntic workspace plus just just just focus on this part here you can see default synap storage account I just told you that this is a default storage account that synapse uses right so you can put your data inside this as well
207:30 - 208:00 and you do not need to worry about any access issues because it will be automatically pulling the data from it because this is a default storage but we ideally do not use it we use our external data sources so do not worry at all so you should be aware of this thing so that's why I created this as well let's click on our synapse analytics workspace so click on open and it will just show the same window like where we can see the uh portal synapse portal or synapse
208:00 - 208:30 workspace okay so this is our Azure syapse analytics I know I know I know this is similar to aure data Factory so now it's time to expl expor synapse and you will know why I was saying this that this is such a lovely tool and let me give you an overview of this tool so first of all we have here beautiful area where you can see similar
208:30 - 209:00 stuff as we had and your data Factory so it's time to explain synapse analytics so basically synapse analytics is a unified platform why unified platform because in your synapse analytics we can we can combine ADF which is a your data Factory Plus data breaks not exactly data brakes but spark cluster so let me just write spark or P spark let's say spark right
209:00 - 209:30 Plus data warehousing so that's why it is called as unified platform So within synapse analytics we get the opportunity to work with a jaw data Factory and here we have data Factory as well embedded within this workspace let me show you and we have spark as well yes we have SQL data warehouse as
209:30 - 210:00 well yes this is such a powerful tool so as you can see here the symbol this one so this is the area where you can actually build your data pipelines with the exact user interface and with exact functionality let me click on it and let me show you see it is called integrate just name is different but everything is same click on this plus button then you will see pipeline just click on it and then you
210:00 - 210:30 will see your ADF your a data Factory is open no it is not a data Factory as a resource is this it is synapse analytics but it has some features available which gives us the exact ex same functionality and user interface as we get an aure data Factory so this is the power of synapse analytics so if you want to create your pipelines you can either create in your data Factory or synaps analytics choice is
210:30 - 211:00 yours and now you will ask me so do people still use a your data Factory yes because if they want to work with a your data breakes or just data bre they use like aure data Factory for orchestration and everything is same all the functions all the things that we created like Dynamic pipelines functions parameters everything everything is same everything is same exact same okay so this was all about aure data Factory
211:00 - 211:30 what about Spark from where we can just get this spark let me show you if you go to here this is called Script this is called Script let me click on it so here you can just build your all the scripts scripts scripts I mean SQL scripts or you can say data brakes notebooks but here is a catch we cannot call it as data brakes because datab Brak is a different Identity or you can say different organization different
211:30 - 212:00 management layer for spark synapse analytics has a separate spark cluster management layer which is called spark pool so if I click on this plus sign I will see an option called sqls script kql script which is custo query language just forget about that for now then we have notebook if I click on Notebook let me click it for you if I click on Notebook see here you can see
212:00 - 212:30 the exact same UI that we had in data Brak but I cannot say that this is data Brak no this is not data BR this is Park Pool which is managed by synapse analytics datab braks is a different entity spark PO is a different entity but both are managing spark clusters simple yes you can apply all your ppar functions ppar coding because that is an open open
212:30 - 213:00 source Library you can use it anywhere but I talking about about the you can say the management layer that we get in the form of data braks or spark pool so this was all about the spark functionality that we get in synaps ntic so here you can see languages we can pick all the languages that we want plus we have like all the same features we want to add headings we can just use magic commands everything then third thing where is our
213:00 - 213:30 data warehousing let me show you that as well so this is the data tab so when you click on it this one you can see workspace and when you click on plus button you will see the option to create a SQL database see this one so this is also known as you can say data warehousing solution where you can just create your tables and you can just create facts and
213:30 - 214:00 dimensions and build your data warehouse now what is this L database so as I just mentioned that it gives us the ability to use spark so in spark when we create the database it is called called lake house or you can say Lake database so if we are creating our tables using spark code then we call it as Lake database in synapse and in data brakes we call it as lake house simple so in this project we will be
214:00 - 214:30 creating SQL databases so do not worry about that thing as well so this was all about the three the like the top three layers of syap analytics ADF that is for pipelines spark clusters for by Spar coding third is data warehousing so this was all about uh atics plus like these two Tabs are similar to the tabs that we got in enj data Factory Monitor and manage tab so it is simple so how is it like what are
214:30 - 215:00 what are your views on the synaps analytics workspace because I like it very much because I can build my pipelines I can use spk code I can just create data warehouse and then I can create my data warehouse with power such a massive tool like you can say Unified solution everything in one place this is the kind of tool that we should use nowadays and obviously everyone is using not everyone but obviously as your data engineers and companies who are
215:00 - 215:30 relying on a your they use synapse analytics a lot I also do like using synapse analytics a lot so this is my personal favorite as well so now it's time to create the data warehouse yes so so so so how we going to do it let me just click on this home button first let's forget let's forget everything and let's start from scratch
215:30 - 216:00 so you just need to repeat one more step what is that but it is much more easier now so let me show you let's say this is our synapse workspace this one and and this is our data Lake right our external storage account same thing how this synapse analytics workspace will be accessing this data
216:00 - 216:30 Lake uhhuh a quick revision for you yes we need to allow the exess but here's a good news for you because your snaps ntic and aure data L both are aure product so we do not need to include a third party application here or any service application any we do not need to register any kind of application we can directly access this data Lake using this synaps analytics by
216:30 - 217:00 just allowing our workspace the permission because by default our synapse analytics has a credential and we just need to assign the access assign the role to that cred credential this is the best thing of using your services so without wasting any time let's allow our synapse workspace to access the data stored in the data Lake again this is regarding
217:00 - 217:30 the ownership of the data you as a data engineer need to take care of all these things like data access and all so let's allow our synapse workspace to access this data again using using the identity using the credential that our snaps workspace or any aure resource get by default and this can be an interview question what is the name of that identity it is called managed Identity
217:30 - 218:00 or system managed identity just keep this information in your head because this can save you in your interviews system managed identity SL managed identity right so it's time to just use this okay let me jump onto my aure portal and again I will first go to my Resource Group then I will click on my external data L which is this one okay then again the same steps we
218:00 - 218:30 will be going to access control am this is a revision for you guys this is a revision for you you need to do the following steps on your own do not do not look into the screen just do it on your own and do not worry if you stuck anywhere I'm here to help so after clicking on this we just need to click on ADD why I saying this because it will be good for you it will be a quick revision for you and when you revise
218:30 - 219:00 things you can just save the that information in your mind so I was being a little concerned about you so click on ADD and just click on ADD Ro assignment and here the same thing we need to assign a role and role is known as storage blob data contributor this one click on it and just click on next simple then here's the catch as we are not using any kind of service principle
219:00 - 219:30 or any kind of user group we are using managed identity again I'm explaining this thing bro because this is really important managed identity is a kind of identity you can say it is a kind of ID card that every Azure resource get by default so let's say we have created aure snaps analytics let me draw it for you so because you can capture visuals in your head I do not want you to just be silent when any interviewer asks this
219:30 - 220:00 question from you so let's say we created an aure synaps analytics workspace so by default by default it has this credential see this small ID card which is known as managed Identity or system managed identity it automatically gets it it is like a free combo you can just you can just follow this analogy this is a kind of free combo like you create a resource you get a free ID card as well and now we just need to assign a role on this ID
220:00 - 220:30 card simple now I know this information will be saved in your head just choose the manage identity and then click on select members what member we need to pick now we need to pick the ID card of our synapse workspace so just search your synapse workspace name here it is click on synapse workspace then it will ask you to select the name and you can just select your synaps workspace name which is aw project hyphen synapse it is
220:30 - 221:00 my synapse workspace so you will have a different name do not worry click on it and just click on select so as you know this is known as managed identity this is known as man man identity click on select and simple click on review plus assign so basically it will just show me that role is added but it takes like 10 to 15 minutes so if you still cannot access the data do not worry just chill have some coffee or tea
221:00 - 221:30 and just wait for 10 to 15 minutes and my gym lovers can have some shake and all so without wasting in time let's jump back to the synapse workspace so click on it and now we are good to go now now our synapse workspace has the exess now we can use the data stored in our data L because obviously we need to pick the data stored in the silver layer silver container right yes that was all about giving excess to over snaps
221:30 - 222:00 workpace let me cancel this one just cross it yeah discard changes because we do not want to save it discard changes yes because we need to just start from start from scratch so simply go on your develop tab this one which is also known as scripts I call it as scripts because we just create scripts there click on it simple and now you will see a plus sign in the develop tab click on it and then it will ask you
222:00 - 222:30 what you need to create we want to create a SQL script okay simple this is our SQL query editor where we will be just writing SQL scripts where we will be querying our data using SQL SQL like everyone is a of SQL I know SQL is like the bread and butter of every data professional okay so first of all let me just tell like like okay you tell me what is the first thing we should have in SQL if you want to query if you want to
222:30 - 223:00 save our data what is the first thing yes it's database we should have a database because in the database we'll be creating tables views everything we'll be quering tables or views inside the database so let's create the database first now you will see we already have a master database that is a kind of by default database but we will be creating our own okay how we can create a database you can either use command create database database name
223:00 - 223:30 but we'll be using UI because we are learning synapse analytics so it's really good using synapse instead of just writing the code so just click on the drop down button and you will see nothing instead of Master because we can only see the master command okay so now if you want to create a database you simply need to go on the data tab this one okay let me go there then click on the plus tab simple then it will show you the option to create SQL database
223:30 - 224:00 select the SQL database yes now is the time to discuss this as well okay finally so in the synapse analytics we get two options serverless and dedicated what is the difference between serverless SQL pool and dedicated SQL pool so H so basically let me explain you in like easy language so dedicated SQL pool is the traditional you can say
224:00 - 224:30 traditional way of storing data where data actually resides in the database like you would have worked with uh my SQL workbench post gr SQL Ms SQL Server right where data actually resides in the database then what is the difference so the difference is we users like distributions here in the dedicated SQL pool we use like compute node control node do not worry about that things so
224:30 - 225:00 it is a kind of traditional database but on cloud and obviously it is optimized for query reads for big data for data warehousing right we get so many things like we can just set up a distribution hash replicated and Round Robin we can set the partitions so all these things what is serverless then so you must be heard like a word called lak house right where we do not actually
225:00 - 225:30 store the data in the databases our data let me draw it for you let me draw it for you so what is serverless basically for that you just need to understand the lake house concept first so basically let's say this is your data Lake in which you have small small files like not small like you have data within this data L right now you want to create a database which is called serverless database so what it
225:30 - 226:00 will be doing this data let's say is in the CSV format right let me just write CSV for you don't mind my handwriting because I used to be sched in my school every single day because of my handwriting so just ignore so let's say we have CSV data in our data right and I want to create a database in which I want to query this
226:00 - 226:30 data this data I want to query this data I want to apply select test tricks from this this this on this file but but but but I do not want to traditionally save this data within this database what do I mean I mean I want my data to stay in this data Lake but I want to apply select statement so here comes the concept of lake house where
226:30 - 227:00 data will be residing in the data Lake but it will create an abstraction layer a kind of metadata layer in which it will store the metadata metadata means like columns headers all the information regarding the data so whenever I will be querying this data let's say this is me okay I am much more smarter than this guy but let's say this is
227:00 - 227:30 me when I'm quering this database right when I'm quering this database so I will be writing select as tricks from my table right what it will be doing it will apply this metadata the columns on this data stored in the data lake so I will feel like I am quering a traditional database but actually I am just quering this metad data and this
227:30 - 228:00 serverless SQL pool will do all the work behind the scenes in which it will pull this data it will apply this metadata layer and it will give return like it will just return the result result to me see this is Magic actually this is not magic this is called lake house concept where we use our data Lake but at the same time we want our data to perform as a data warehouse so that's why when we
228:00 - 228:30 combine data warehouse Plus data Lake it becomes lake house understood if not just rewatch this part because this is really really important for you so so just to brief our data resides in the data Lake why we want our data to be residing in data Lake bro because we want to save cost storing data in the data lake is very cheap as compared to storing data in
228:30 - 229:00 databases and data is growing rapidly it is growing exponentially so we do not want to spend much money on storing data in the databases here is the revolution of data engineering so that's why every company is trying to just Implement lak house concept but it is really difficult to implement that's why I'm sharing all the insights with you so that you can become
229:00 - 229:30 an efficient developer and can help the other organizations to create efficient Lakehouse Solutions and trust me in the upcoming videos I'll be sharing so much about lak house this is just an introduction ction so that was all about the lake house concept the serverless pool thing and it is seress so it will be automatically scaling up and down and it supports lake house concept it does not store data physically in the data base I hope you
229:30 - 230:00 understood this concept now and let's finally pick this seress I wanted to explain this concept in detail because this channel is not for just showing dragging and dropping things this channel channel is just to provide real education real knowledge so that you can become like you can become an outlayer outlier in the crowd you can become an efficient developer you can just succeed in the career because
230:00 - 230:30 Basics everyone knows Basics right everyone knows Basics everyone knows like what is snapse analytics and how to create table and all but these this knowledge like very much limited to few people so I want you all to be knowledgeable enough to answer all the questions in this areas in this area so that was all about the lak house so now it's time to just pick the serverless and finally create the database I hope
230:30 - 231:00 you like this concept and let's start oops again I need to click on plus SQL database so this is serverless and I want to name it as so I will call it as aw data base simple then click on create simple click on create so now it will create a database for us yes so now if
231:00 - 231:30 you will go to use database now you should see aw database see it is there okay so we will select our aw database so now whatever we will be doing it will stay in this particular database simple I know you L this concept so now it's time to actually pick the data stored in this silver layer and how we can do that we have a powerful function called open row set
231:30 - 232:00 what is it called open row set it helps us to apply the abstraction layer on the data residing in the data lay so let me show you how you can actually pull the data stored in the silver layer and then you will just create views on top of it and then just we will be using this data to populate in powerbi so now it's time to actually
232:00 - 232:30 create the open roset function in our SQL script but you just need to do a small work so as you remember that we assigned the role of storage blob data contributor to our synapse workspace space so you just need to assign one more rule not to the synapse workspace but to yourself yes so the thing is whenever we query the data residing in data Lake we also should have the permission to
232:30 - 233:00 access the data so we just need to do the same steps and now you have already done it twice so it should be just matter of few seconds for you as well so without wasting time let's assign us the role and then we will just use open ret function so simply go to your home Tab and then just go to your storage account and just click on your Access Control IM and then you can see add yes simply
233:00 - 233:30 click on it and click add Ro assignments and then you just need to search storage blob data contributor contributor not scanner contributor so here this time we do not need to pick managed identity because we do not have any kind of managed identity we are users so here we will pick users and click select members so you just need to select your email ID and then that's it review plus
233:30 - 234:00 assign so it will just add the role again just wait for at least 10 to 15 minutes because it sometime takes time to assign the role to the user or to manager identity so whenever you see error based on any kind of information that data cannot be listed so it means you just need to wait and just refresh data and it will be done so don't worry about that so let's jump onto our synapse workspace so now we are good to
234:00 - 234:30 actually use open roset function so are you excited to use that function so let's use it so how this function works you simply need to write select Ax from open roset this is the function just hit enter and then you just need to put two parameters one is the location from where we need to read the data
234:30 - 235:00 right so it called bulk and then you need to just use single uh quotes and then we need to just pick the URL like now how we can just pick the URL it's very easy just go to your storage account and from there you should just see the URL just go to your containers and let's see I want to read the very first file which is calendar just open your file so
235:00 - 235:30 basically this is your file like this long name part 00 so this is the default name that paret decides so do not need to worry and rest of the file just are just for the confirmation so so the main file is this one so simply click on these three dots and then you will see properties click on it and here is your url just copy it this is the thing that you want just paste it here here's the catch so when we read the data with
235:30 - 236:00 synapse we do not need to define the full URL what do I mean by full URL we do not need to mention the name because we create dedicated folders that is the main reason we create folders for individual files so we just need to remove the file name and just put the location till calendar that's it and then just close the single code simple put comma and then we just need to define the format and format as we all know is
236:00 - 236:30 aret simple now here is the catch as you can see this location it is using blob it is using blob but we have created a data Lake not the blob storage but by default storage account creates blob so that's why you can see blob but here we just need to remove it with DFS in lower case DFS
236:30 - 237:00 just remember this do not make any mistake that's why I'm repeating just remove blob with DFS that's it that stands for I think data file storage or maybe data storage something like that so so we just need to name it and like you can just name it anything query 1 query 2 or maybe anything so when I'll be running this actually I am running a data residing in data Lake but I will be
237:00 - 237:30 getting tabular format data as I get whenever I query a SQL table that is the magic of this powerful feature open row set that's really cool let me just hit run command for you okay now I should see the data perfect as I just told you that you
237:30 - 238:00 can query the data residing in data Lake and it should return the result in tabular format that is known as abstraction layer that this function has created wow wow this is lake house this is you can say uh logical data warehouse there are so many names the popular one is lak house so as you can see we can export
238:00 - 238:30 this result as well because you would have seen that managers always demand we need Excel files we need Excel files just export result as CSV Json XML as per the requirement and one more thing as we had charts available in data breaks we have available charts available here as well as you can see like obviously this is not a miningful chart you can just edit from here whatever you need to decide like line chart bar chart anything so you can create charts as
238:30 - 239:00 well in synapse analytics I know now you are like falling in love with synapse synapse is really cool I love synapse so this is the thing that is really revolutionary and that's why we prefer storing our data in data L because we can query our data we can query our data so why to just waste money in storing the data in the databases doesn't make any sense so this
239:00 - 239:30 is all about the calendar data that we have created so how we will just report this data in powerbi so what we will be doing let me show you so this is one data set right we will create views obviously you would be aware of like views it is same as we have views in SQL so views are just the you can say it stores the query and whenever we just query the view it queries the data itself we will we will uh build views on
239:30 - 240:00 the top of this query yes on the top of this query we will create views and that those views we will store in gold layer and that gold layer will be used in powerbi yes so let me first of all create my gold there so for that I want to create a schema right like I have a database and I want to keep all my views
240:00 - 240:30 in gold layer so for that I will create schema so I will first of all rename my SQL script let me call it as create schema right and let me just write the code for schema create schema schema name let's say gold and obviously I just need to use semicolon let me run it simple simple yep so now let's create
240:30 - 241:00 another file like another script from here click on three dots new SQL script and then we will say create views gold because this is gold Golder so how we can create view first of all like syntax is exact same as we have in esql so we will say create view view name will be first of all we need to put schema name which is gold then calendar
241:00 - 241:30 right then create view view name then we need to write as then oops I just hit shift plus enter like just ignore it because because I was just hitting enter and I mistakenly hit shift to Center both so then first we will write create view view name then as then we need to put our subquery which was select a from open open let me just rewrite it for
241:30 - 242:00 you select estx from open row set okay and then again like we just need to repeat some steps bulk and for location I just need to copy paste it and obviously remove the file name perfect and then format equals to
242:00 - 242:30 pocket perfect and obviously we just need to name it I can name it this time like sare one simple so when I will run this command what it will do it will create a view in my aw database database name and inside that database in Gold schema yes and then I can query this View using select statement or if I want
242:30 - 243:00 to create reports on top of it I can just connect it with power which we'll be seeing just in just few more minutes so just remember one thing whenever you query data to build something always double sure that you are in the right database because by default it is master and you need to pick your own database which is aw database just remember this then run it okay perfect our first view is completed let me add a comment because
243:00 - 243:30 I'm just planning to upload this script as well so that you can refer it yes so if you want to add comment just add Double Dash and I will say create View calendar and then I will say Dash D D D Dash DH d d d DH simple simple yes so our first view is completed so similarly we need to create all the
243:30 - 244:00 other views as well the data that is already in silver layer so process is same we just need to change the location and how we can change the location we just need to change this folder name because rest of the location is same like this one this location is same till here we just need to change this file name and we will be creating view for other files other tables as well so just create all other views with me and let's complete it together let's do it
244:00 - 244:30 so I have fin finally created all the
244:30 - 245:00 views and I know you have also created all the views so I didn't do much I just copy pasted the code and I just change this location this one like the last folder name that's it because obviously view name as well you just need to
245:00 - 245:30 change the view name so so far we have created all the views and you know what you have done so far first of all just save your work because I don't want you to just do the rework so click on publish all so it will just publish all your work and it will just save all your work so I I want to show you what you have done so once it is published let's click on these three dots and create a new script so now if we want to query
245:30 - 246:00 the data we do not need to use any open row set we do not need to use any location let me show you first of all pick the right database which is this one just query like select Ax from gold which is our schema name dot any table let's pick customers run it run it now you will see all the data and this time you haven't used any kind
246:00 - 246:30 of open row set any location why because you have created views in the gold layer now you can access your data now your managers can access the data now your stakeholders can access the data now data analyst can access this data so we have finally created views in our goal layer and after completing the views it's time to actually create the external tables because obviously we create tables within the databases and
246:30 - 247:00 the difference between external tables and manage tables is like very simple let me quickly tell you what's that so let's say this is your external table this is your manage table so in the scenario of external tables we save the data we keep the data for our tables but in case of manage tables we do not keep the data we do not keep the data so let's
247:00 - 247:30 say we are getting manage table in data braks or any other environment so it will be storing the data so that's not the scenario in this case inap we just create external tables so in our scenario just forget about the manage tables we will be creating external tables simple so to create external tables in synapse analytics we have to follow three steps what are those steps let me tell you so first of all we need to create something called
247:30 - 248:00 credential then we need to create external data source then we will create external file format just three steps so BAS basically when we create credential we tell synapse analytics to pick the data using managed identity as we just mentioned that we are just allowing synapse to use data stored in the data like using manage identity so this is a kind of credential because there are like so
248:00 - 248:30 many ways to pick the data right like SAS token accs keys and manage identity so we just Define that pick the data using manage identity simple now what is this external data source so basically you just saw that whenever we want to pick the data we need to mention the URL full URL so when we do not want to put the URL again and again we create external data source so in that scenario we usually keep the URL till container level let's say I will create uh an
248:30 - 249:00 external source for my silver container so I will just keep the URL till silver container and the rest of the URL I can just put in the location it will save me a lot of time and lots of efforts very good then what is this external file format so basically we have so many file formats available like Json CSV uh pocket so many file formats so we Define that we have data stored in Pocket file format CSV file format Json file format
249:00 - 249:30 so for that purpose we create external file format again you do not need to learn the code for it because we have code stored in the documentation and truly speaking once you practice more and more you will just remember the code in your head and otherwise do not worry just copy the code from the documentation I will show you both the ways because I remember that code now and I can show you the documentation as well so for that let's quickly create a new script just go here on the plus sign
249:30 - 250:00 and click on plus and here we will write external table so basically we will just create One external table because we just want to show you how you can create external table and then we will be creating some of the visualization based on this external table in powerbi so it will be an end to endend learning for you how to establish connection and then just build some visualization and we will just briefly cover powerbi like till like you will see how to build connection using SQL and points what is
250:00 - 250:30 SQL end point just hold on I'm coming to that point just hold on so first of all we will write create external table okay simple so so as I just mentioned we first need to create credential but there's a prequest for it as you can see I am using aw database so first we need to create master key for this database yes
250:30 - 251:00 so if you are familiar with SQL if you have already worked with Ms SQL Server so you would know like we need to create master key for that you just need to write create master key and then password equals to you just need to pick any password just make sure that you are adding one capital letter and one special character and you can just see how to just get
251:00 - 251:30 this code for master key just go on Google and type create master key and SQL just write SQL you will get the Microsoft on page and here you can see the code for like the whole syntax just copy it and just paste it here you do not need to remember it because that's something that admins do but you as a data engineer should know this as well so just remove these square brackets and
251:30 - 252:00 in password do not put this password you need to create your own password so once you have set your password we are good to go just select this and click on run obviously I have already run this because I have set my password and I cannot show you so just create your password and just make sure you are adding special characters and just make your password little bit complicated do not worry do not need to remember this password because it will not be asked yeah it is just the
252:00 - 252:30 database master key so let's remove it and once it is done we are good to go with our external data source credential file format everything we'll be building everything from scratch do not worry at all so as per the steps we first need to create credential okay so first we will write create external not external it's called database scoped credential database scoped
252:30 - 253:00 credential what is it so basically we create a credential so let's say we are using syapse analytics to pull the data or to read the data or to write the data to data it needs some kind of credential yes and there are so many credentials available and one of the credentials is managed identity and we have used managed identity so we just need to tell it that we have used manage identity simple and let me quickly create one let me
253:00 - 253:30 say gred unch simple and then we just need to write with identity equals to managed identity simple now you will ask me from where I can get this code don't worry just need to go on Google and just type uh credential and
253:30 - 254:00 synapse yeah just click on this link and then you will see a documentation I think this is not the one let me write database code credential yeah just click on this first link and yes here you go so here you can see the syntax you can just copy it from here as well and just paste it here it's up to you and obviously once you use these kinds of C codes again and again you will just remember the code so do not need to worry about that so simply we
254:00 - 254:30 need to run it okay this is done so we have successfully created ex uh database scope credential now the second step is creating external data source for that first of all we will create silver data source and I just mentioned that external data source is for the URLs that we are using we are pointing our synapse external data source to a container so we create a new data source
254:30 - 255:00 for a new container so we have to create two external data sources one is for silver because we will read the data from Silver and one from gold because we need to push the data to gold simple yes and the code is really simple let me show you just write create external data source and I will call it as Source silver simple right and then I can just
255:00 - 255:30 say with location now I just need to pick the location till container level and I will simply go to my view script because I already have the URL see here so I just need to copy it till silver container level just that's it so provide it here so now we are pointing our data source which is known as Source silver
255:30 - 256:00 to this particular container so we do not need to write this URL now again and again so this is the kind of file system that you can relate it with data braks or you can say this is the kind of variable that will save us a lot of time it's really cool I love using data source and okay we have assigned the URL so now this data source this data source will be going to the silver container but it
256:00 - 256:30 needs some credential that's why we created this credential and now we will say hey Source silver carry this credential which is known as cred on with you whenever you will be going to take the data from the data lake so we will hand over this credential to our external data source this is a kind of real life analogy I love explaining Concepts using examples using real life examples or relating it with the real
256:30 - 257:00 life examples like you can remember it for the longer period of time so we will simply say credential equals grunch grunch simple so it is done we simply need to run it okay it is successful I will just simply copy this code because I need to create one for gold as well so I will say source and then gold and I will just change the container
257:00 - 257:30 name obviously credential will be same because manage identity is supporting the whole synapse workspace right yes so simply run this as well perfect perfect perfect perfect now it's time to create external file format so basically why we create external file format as we just mentioned that we have so many file format so we just Define hey bro the data that is stored in the folder is in Pocket file format simple
257:30 - 258:00 so we will say create external file format and I will call it as format bucket simple then again same thing with and then we need to type format type equals to pocket and then we also need to add a data compression because when we add data compression it supports you can say better reads performance so
258:00 - 258:30 how we can just get this get that code simply go on Google same thing just write file format or let me just add external as well external file format just click on it click on the very first link then you will see this documentation and here just drag it down drag it down drag it down and then you will see something called uhhuh I think we just need to go
258:30 - 259:00 up just search for things that known as compression yeah here we have data compression so it has given us the code for all the data compressions so for pocket we use we have two options but I personally like using this because it is recommended by Microsoft as well Snappy codic so we just need to put here data compression equals to this once it is done we are going to run this code as
259:00 - 259:30 well H oh I didn't close the code sorry just run it okay perfect now finally after doing all the three steps we are good to create external table but before that I would like to just show you what will be happening behind the scenes because this can be asked in your interview questions plus this is a kind of deep knowledge
259:30 - 260:00 that not everyone will be having this so it can just present you as an outlier in the crowd and that's what we want right world is really competitive and you need to be competitive enough to just beat them all so the thing is let's say we have this container this is our silver container right and this is our gold container perfect and we have data
260:00 - 260:30 stored in this silver container and we have already created a view on top of it yes we know so what will be happening in EXT table so first of all we will be using C task don't worry I will tell you what is that it is really really an important Concept in the world of data and in synapse obviously so the thing is what we'll be doing we will use this view we will use this view yes so we
260:30 - 261:00 will create an external table which will push this data to this gold layer how I will show you the code it will push this data to this gold layer using that code then it will create an external table on top of that data this is the architecture and how we will achieve this using something called
261:00 - 261:30 Seas what is a Seas so the full form of this is create external table as select I think you must be aware of this part as because this is available in SQL as well like the regular SQL or traditional SQL databases that we use like we use subqueries so what we'll be doing we will build a table using a statement and we already have view built so we do not
261:30 - 262:00 need to mention the subquery we can simply write select ASX from view I know now you can just relate it with the SQL Concepts that's why we call like SQL is the backbone SQL is the backbone for any data professional and do not worry I will show you everything from scratch so if you cannot relate it now you will do it in just few minutes let me show you so first of all I will just create a nice clean comment for you so that you can refer this uh script because I'm
262:00 - 262:30 planning to just upload the script so that you can refer it it's just for your help so just write create external table known as let's say external sales right external sales simple so okay simple so the code is we will simply write create external table and table
262:30 - 263:00 name will be EXT sales and I will save it in Gold schema simple then we need to write with okay then we need to give three things first thing is location because okay as we just mentioned in the architecture it needs some data stored in the location so it needs the location URL but we have created external data source so we do
263:00 - 263:30 not need to put the whole URL we will just put the folder name and folder name will be external sales simple then obviously we haven't put the location uh URL like the complete URL in the location then we need to add the external data source right now I know you can just relate all the things like all the steps why we created like three steps before completing the external table because those are required as parameter in the external table so external data source
263:30 - 264:00 name was Source gold right yes Source gold then we need to put file format and file format was format file format pocket right so this is my external data table is the work done no because it will create the external table but how it will create it what is the query what is the as statement what is the ass statement let me tell you as statement
264:00 - 264:30 is Select SX from gold do sales what is it it is is a view that we created remember here create views gold and you will see sales here uh here see so what exactly we are doing we are using this query this
264:30 - 265:00 one see and instead of using this query we already have view so we can directly use view that is the power of view and that's why I always say try to create views because you never know where you will be using those views so it is really important to use the views being a developer you should know like why we are using this so just run this command and it will create an external table for you you do not need to put any uh column
265:00 - 265:30 definitions their data types it will be automatically picked by this command because you have stated everything and you select ASX from open roset function just click on run and it should run like it it will take like just few seconds and it will be done yes yes yes bro successfully you have completed your external table so I would love to query this select Ax from
265:30 - 266:00 gold. EXT sales so let me query this and I should see the data yes yes yes yes we have all the data now you would ask me hey bro I was seeing the same output using views and I'm seeing the same output using table as well what's the difference and why we created external table do you want to know why because in views we didn't store the data it was just a query and we know
266:00 - 266:30 when we create views we actually store the query not the result not the data but in case of external table we have the data you want to know where in our data Lake in our gold layer in our folder so we can refer it for the future we have the power to retain the data that is the difference between external table and view see and now I will just take you to my storage account and and I will show you that we have the data in
266:30 - 267:00 Pocket form it so this is my storage account and you simply need to go on your Go tab this one simply click on it and see the magic you have the data yes it has migrated the data to the gold layer using C Tas c as is a really really powerful command so when I open it you will see the data stored in the pocket format this is our data file you did it you finally did it and
267:00 - 267:30 now it's time to cover one more important step which is establishing a connection between synapse and powerbi now most of you will be thinking why we need to cover powerbi bro we are not covering powerbi basically we are covering only one aspect of it because when you will be delivering this data you should be the owner of data to distribute it you are responsible to establish connection between synapse
267:30 - 268:00 workspace and powerbi obviously data analyst can also build it but you should assess them if they need any help because you are the owner of this data you need to efficiently distribute it being a data engineer is not just about building pipelines it's about serving the stakeholders serving your Downstream serving your manager serving your all the entities which are using that data so we will be covering powerbi just from that perspective that we want to establish a connection that's it that's
268:00 - 268:30 it we will not be building you can say relationships visualization obviously I will just show you by creating one to two visualization that's it but nothing more than that nothing more than that so now it's time to cover powerbi so you simply need to go on Google and just type powerbi download simple click on the first link click on the download button and then
268:30 - 269:00 just click on all the boxes and click on this download button and once you download it you will be landing on power via desktop application so without wasting any time let's let's build that connectivity and we want to know which technology we will using we will be using I will show you once we land in the powerb desktop so I will show you SQL end points do not worry we will cover that what the SQL endpoint and from where we can get it do not worry at all I will tell you each and everything in just a few seconds let's
269:00 - 269:30 see welcome to powerbi so this is your powerbi desktop this is the homepage this is the you can say blank report of powerbi desktop so it's time to just complete the final phase of the project to establish the connectivity so that we can deliver our data and we as a data engineer we are free so the thing is let's say this is our synapse workspace right and this is our
269:30 - 270:00 powerbi so if we want to establish connection between synapse and powerbi we need something called called SQL Endo yes we need SQL endpoints so what is that so basically the SQL database that we have within this within the synapse workspace we have something called SQL end points I will show you
270:00 - 270:30 from where you can get it so we need to copy that and we need to establish the connection with power we are using that endpoint and once we put that endpoint we can see our database we can see our external table views everything and then we can just pull that data in the powerbi desktop and then we can just fill the reports simple so let me show you how you can just get that so go on Google and just open your Azure portal and click on your synapse
270:30 - 271:00 workspace and once you click on that you will see this area right and here if you focus you will see something called a serverless SQL ENT point there is something called as dedicated SQL endpoint as well but we we should not use that because we are not using dedicated SQL endpoint because we didn't go for dedicated SQL so now it's time to copy this serverless SQL endpoint because we use serverless SQL right so I will just go here and just
271:00 - 271:30 click on copy to clipboard once it is copied just go back to your powerb desktop so now it's time to actually pull the data how we can do that if you can see in your taskar just click on get data and then just click on the at the bottom which is more button we because we need to find the connector which is more inclined towards Azure so as you can see we have connectors for Azure click on it then just pick your synapse analytics
271:30 - 272:00 because we want to establish connection with our synapse analytics workspace click on connect and then you just need to put the SQL endpoint here in this server button then database is optional then click on okay and then just wait for few seconds and here you go you are seeing your database okay let me just tell you so sometime it asks for the credentials as
272:00 - 272:30 well so if you are just logging in in powerb for the first time so the thing is it will when you will click on okay it will ask you to provide the credentials on the left hand side it will like look like this let me just draw it for you see like this and on this side you will have some options Windows database and something like that so just pick the database credentials and now you will ask me what should I put in username and password so just go back in time when you created your synapse analytics workspace you entered
272:30 - 273:00 admin and password right when I was putting admin unch and then I just put my password so you just need to put your admin and password that's it once you click on that you will will be landing on this page where you can see your database that is aw database right so here you can see we have views we have external tables so let me just click on this table that we created known as gold do external sales and let me click on
273:00 - 273:30 load because I want to load this table only this external table only so now it will be just loading creating connection and everything will be running in parallel behind the scenes we do not need to do anything congratulations I'm not kidding I'm not kidding bro this was not easy so now you can see that our gold external sales table is here in front of our screen we have all the columns here and this data is not residing on our machine no it has
273:30 - 274:00 directly pulled the data from synapse from synapse from aure from data l so as we just mentioned that we will not be covering the visualizations in uh deep because obviously you can build your own dashboard B based on the requirements based on the kpis based on the business areas based on the manager requirements everything matters here when we build visualizations so we we will not be covering in deep but just for your sake just for your you can say experience I
274:00 - 274:30 would love to just draw one to two visualizations so let me tell you that I want to aggregate data okay if I want to just see the trend I can just pick this line chart I just clicked on this area and I clicked on this button like the line chart so just increase the size right and then if we want to add the data on xaxis and y axis you can see information here x-axis
274:30 - 275:00 yaxis secondary yaxis Legends so I will put order dat in my x-axis okay I have put order in my xaxis then I will put order number in my y- AIS simple so this is a kind of trend that I can see right now and this trend I am getting directly on the top of the data residing in data lake so this is my visualization one in which I just want to see the
275:00 - 275:30 trend okay now let's try to create another one so this is a kind of visualization tool that you can access so people do prefer using TBL as well people prefer using uh what was that I think it's non as I think Google data Studio I'm not sure because I have just used power bnw and these are the top most tools available in market right now so let's say if I want to just show the data or
275:30 - 276:00 this time I just want to create let's say total number of ERS so I can just pick a visualization called card this one just click on it and then it will just give you a card so let's say I want to build kpis like total number of orders that that I have so I will simply click on it and there here you can see the fields I will simply put customer key let's calculate total number of customers see we have 56,000 customer IDs wow and we can just
276:00 - 276:30 obviously modify it if you want to modify it you can simply see a brush here this one this is the brush that you can use see this one simply click on it and then you will see the a kind of options that you can leverage while creating filters and anything and then you have format options here size style
276:30 - 277:00 title so let's say I just want to give a title I can just turn it on and I can just add any title let's say total customers so you can bold it you can put it in center you can just increase the size as well you can change the font as well like you can do so many stuff so this is our visualization that you can see and let me just create one more visualization that's it and then we are good to complete this course so then let
277:00 - 277:30 me just create an area chart I think this is the area chart chart yeah this one I want to create this and this time I want to show data by stock date in the x-axis and my customers on the y axis so this is it this is it so this is a kind of dashboard that we have created do not judge us because we have just
277:30 - 278:00 created in just few seconds so we just drag Dr drop and we just build this kind of dashboard and obviously if you want to play with powerbi just play with it and you can just have so many functions like you can just change the color you can just change uh the area color then you can just add Legends everything there are like so many options you can do wonders with powerbi so it's up to you so now it's time to quickly review the progress that we have made so far
278:00 - 278:30 okay so congratulations to everyone who has completed this project so let me just give you a brief what we have done first of all this was our phase one where we just loaded the data in the bronze layer then this was our phase two where we just picked the data from the bronze layer and we applied crazy Transformations we applied analysis and we push the data to the silver layer then we pull the data from Silver and push the data to gold and establish the connection with powerbi as well so that's all for this project and just
278:30 - 279:00 tell me in the comments how was this project and if you genuinely love this just tell me that as well because I'll be creating more and more your data engineering projects and N to end projects for you you and you can just expect crazy stuff in the future and let me just give you a hint most of crazy your projects are already lined up yes I have already uploaded it on YouTube they are scheduled they are just coming to your way in the near future lots of tutorials are on the way just hit the Subscribe button to motivate me and I
279:00 - 279:30 will be helping you to achieve your goals achieve your dreams and obviously for free because I want to provide you the best quality of Education best quality of you can say data engineering skills that you can for free do not need to pay anything see you soon bye-bye