Hands-On Healthcare Data Engineering with Azure

End to End Azure Data Engineering Project by Sumit Sir

Estimated read time: 1:20

Learn to use AI like a Pro

Get the latest AI workflows to boost your productivity and business performance, delivered weekly by expert consultants. Enjoy step-by-step guides, weekly Q&A sessions, and full access to our AI workflow archive.

Summary

In this engaging video tutorial, Sumit Mittal presents an end-to-end Azure Data Engineering project. He covers practical use cases set in the healthcare domain, outlining how Azure's data engineering stack is employed. This project is uniquely designed to give viewers a comprehensive understanding of handling real-world data engineering tasks in Azure, which is beneficial for job interviews and day-to-day work in the industry. The project involves working with healthcare revenue cycle management and aims to create a pipeline involving EMR data, claims data, NPI and ICD codes, and more, ensuring a robust understanding of both technology and domain-specific knowledge.

Highlights

The project is set in a healthcare context, enhancing relevance for industry professionals. 🩺
Focuses on EMR data and healthcare revenue cycle management. 💼
Includes practical implementations using Azure SQL DB, Data Factory, and Databricks. 🚀
Learnings are applicable for job interviews and daily work scenarios in data engineering. 📈
Emphasizes creating scalable, efficient, and secure data solutions in Azure. 🔍

Key Takeaways

Get hands-on with Azure data engineering in a real-world scenario! 🌟
Learn to build end-to-end data pipelines in the trending healthcare domain. 🏥
Master the art of creating metadata-driven pipelines in Azure. 💡
Dive into SCD Type II and common data models to handle data changes efficiently. 🔄
Discover best practices with Azure Key Vault and data catalogs for secure and efficient data management. 🔐

Overview

Hello and welcome to an insightful session on Azure Data Engineering with Sumit Mittal. Diving into the essentials of working within Azure, this video leads you through constructing a full-fledged data pipeline set in the lively and ever-resilient healthcare industry. Learn not only the technical skills but also gain domain knowledge crucial for data engineering roles.

This video takes you through setting up and handling EMR data, which is pivotal in managing healthcare revenue cycles. You'll explore how the Azure ecosystem, including SQL DB, Data Factory, and Databricks, is leveraged to create a comprehensive and robust data pipeline. This session is tailored to equip you with practical skills for interviews and real-world applications.

Sumit also delves into advanced concepts such as slowly changing dimensions and common data models, which help in effectively managing evolving data scenarios. With a keen eye for detail and industry-best practices, this tutorial will refine your capabilities in managing data securely and efficiently with Azure.

Chapters

00:00 - 02:00: Introduction to Azure Data Engineering Project The chapter introduces an end-to-end data engineering project using Azure, developed in response to multiple requests. It aims to provide listeners with a comprehensive understanding of real-world project management, beneficial for both job interviews and day-to-day work in Azure.
02:00 - 05:00: Project Overview and Transcript Introduction The chapter introduces Sumit, a professional with a masters from NIT and experience at companies like Cisco and VMware, specializing in Big Data. Sumit has been focused on training candidates in data engineering for the past five years. He offers an extensive 32-week Ultimate Data Engineering Program.
05:00 - 19:30: RCM: Healthcare Revenue Cycle Management The chapter discusses a project based on Azure data engineering stack, highlighting its significance in the trending field of technology. It mentions the positive impact on various careers and lives, encouraging readers to check out the program through a provided link.
19:30 - 24:00: Financial Aspects in RCM In this chapter titled 'Financial Aspects in RCM', the speaker discusses the increasing demand for developing projects around AO data engineering. He emphasizes the importance of creating an end-to-end project using real use cases, highlighting its value in preparing for interviews as well as practical applications in the workplace. The project is intended to provide significant insights and aid in understanding the financial aspects pertinent to Reliability-Centered Maintenance (RCM) within an organization.
24:00 - 28:30: Overview of Accounts Receivable & Accounts Payable The chapter introduces the domain of Healthcare, emphasizing its current trendiness and relevance. It assures readers that even those not familiar with healthcare will gain a comprehensive understanding of the N2 and pipeline concepts, which are also gaining popularity. Those already knowledgeable in healthcare will find the chapter relatable, as it discusses pertinent datasets.
28:30 - 40:00: Metrics and KPIs in RCM The chapter introduces the concept of Healthcare Revenue Cycle Management (RCM), emphasizing its importance in the healthcare domain. The speaker, Sumit, addresses students or professionals, focusing on those working within the RCM domain.
40:00 - 50:00: Azure Data Engineering Project Objective The chapter introduces the rising popularity of a comprehensive data engineering program, described as 'the ultimate data engineering program.' The speaker encourages readers to explore a link provided in the description for more details. The chapter promises to delve into the specifics of 'RCM' and offers a brief context on the domain to help with understanding the program's objectives.
50:00 - 66:00: Dataset Description for Azure Data Engineering Project The chapter begins by introducing the project, which involves working on a problem statement related to the financial management processes that hospitals use, known as RCM (Revenue Cycle Management). The instructor emphasizes the importance of understanding the domain, particularly for those who are not familiar with it, and insists on explaining everything clearly and in simple terms to ensure comprehension.
66:00 - 79:00: Project Architecture and Tools The chapter discusses the concept of Revenue Cycle Management (RCM) used by hospitals to handle financial aspects. It begins with the patient's journey from scheduling an appointment, exemplified by a scenario where someone with a viral fever visits the hospital.
79:00 - 110:00: Pipeline Implementation with Azure Data Factory In this chapter, we delve into the implementation of pipelines using Azure Data Factory within a healthcare context. We discuss the lifecycle of a patient's appointment process - starting from booking an appointment, continuing through the patient's interaction with healthcare services, and concluding when the healthcare provider receives remuneration. We explore how Azure Data Factory aids in managing this flow, focusing on efficiency and streamlined operations to enhance the overall patient-provider engagement and financial processes.
110:00 - 121:00: Using Azure Data Bricks for Processing This chapter covers the financial processes within healthcare where patients pay hospitals, and hospitals manage salaries and other financial aspects. It emphasizes understanding the end-to-end financial transactions between patients and hospitals, ensuring hospitals have a sufficient revenue supply to manage their operations.
121:00 - 126:00: Working with Azure Storage Account The chapter titled 'Working with Azure Storage Account' focuses on providing a simplified breakdown of the process involved. It starts with the initial stage of a patient visit to a hospital.
126:00 - 132:00: Setting up the Project in Azure The chapter titled 'Setting up the Project in Azure' discusses the process of collecting patient details in hospitals or clinics, focusing on the importance of collecting insurance details alongside other necessary information. These details are essential for setting up healthcare projects in Azure.
132:00 - 310:00: End to End Pipeline Testing The chapter titled 'End to End Pipeline Testing' discusses the concept of a provider in the context of healthcare, noting that a provider typically refers to a hospital but can also mean a doctor in certain contexts. It emphasizes the importance of ensuring that healthcare providers know who is responsible for paying for services rendered. The terminology around patients and providers is aligned with domain-specific language common in the healthcare industry.
310:00 - 314:00: Conclusion of Part One The chapter discusses the payment system in the healthcare sector, focusing on hospitals and doctors. It explains that the service provider should be informed about the payment responsibility, whether it's the insurance company, the patient, or both. It also mentions the possibility of shared payment, where part of the cost is covered by insurance and the remaining by the patient.
314:00 - 320:00: Introduction to Azure Data Engineering Project Part Two The chapter "Introduction to Azure Data Engineering Project Part Two" discusses a hypothetical scenario related to healthcare costs. It uses the example of surgery costs in India, mentioning an estimated cost of three lakh. Then, it transitions to discussing this in the context of a U.S.-based project, considering the project as related to a U.S. hospital and mentions the cost implication of surgery in the U.S. The chapter sets the stage for understanding the financial aspects of the project in a healthcare setting.
320:00 - 333:20: Part Two: Project Architecture and Overview This chapter delves into the financial dynamics between patients and insurance companies concerning medical payments. It outlines different scenarios where either the insurance provider or the patient might be responsible for payment. The chapter discusses potential variations in who pays based on specific insurance policies, highlighting a scenario where the total cost is 20,000 USD. In this example, the insurance company might cover 15,000 USD while the patient pays 5,000 USD, though other arrangements are possible depending on the policy.
333:20 - 450:00: Bronze Layer Implementation This chapter discusses the initial steps of the healthcare process, beginning with the collection of patient information, including insurance details, during a patient visit. This data is gathered by the hospital or healthcare provider. The chapter then outlines how the hospital provides various medical services, with doctors administering treatment. All procedures and treatments performed are meticulously recorded by the hospital to maintain comprehensive patient records. This documentation serves as a foundational layer in healthcare management.
450:00 - 495:00: Silver Layer Implementation The chapter titled 'Silver Layer Implementation' focuses on the testing of procedures and medications. It discusses how both doctors and hospitals provide services, with an emphasis on the billing process that follows. The hospital is responsible for creating a bill for the services provided.
495:00 - 520:00: Gold Layer Implementation The chapter discusses the 'Gold Layer Implementation' focusing on the intricacies of billing and payment processes in medical services. It explains how services provided are recorded and sent to the insurance company to request payment, creating a bill either for the patient or the insurer. The chapter outlines the next steps after billing, emphasizing the reviewal of claims to ensure accuracy and compliance with insurance policies.
520:00 - 540:00: Enhancements and Best Practices The chapter titled 'Enhancements and Best Practices' discusses the process undertaken by insurance providers when reviewing claims. The insurance company examines the bill or invoice submitted to ensure it complies with their guidelines. They have established norms that dictate the amount they can pay and the type of surgery involved, among other factors. This ensures that the claim is processed according to their policies.
540:00 - 550:00: Project Integration with Azure Key Vault In this chapter, the discussion revolves around the rules and regulations for integrating projects with Azure Key Vault. There is a focus on the payment approval process, where payments can be either approved, rejected, or partially paid based on compliance with set rules and the presence of errors or uncovered services. The importance of adhering to regulations is underscored to ensure that payments are processed accurately, whether in full or partial amounts.
550:00 - 565:00: End to End Pipeline Execution and Linked Service The chapter discusses the process involved in payments and follow-ups in an end-to-end pipeline execution, particularly focusing on insurance policies. It explains how, after a decision is made regarding coverage based on existing insurance policies, the subsequent steps involve handling payments and managing follow-ups. Once the insurance company has paid its share of the expenses, the hospital may bill the patient for any remaining balance. This signifies that the insurance may cover a portion of the costs, but the patient might still have outstanding expenses to settle.
565:00 - 571:00: Conclusion of Part Two The chapter discusses the process of payment responsibilities between insurance companies and patients within a hospital setting. It explains how a certain amount is paid by insurance companies while the remaining balance is to be covered by the patient. If payments are delayed or denied, the hospital will follow up with the patients to address these issues.

End to End Azure Data Engineering Project by Sumit Sir Transcription

00:00 - 00:30 hello everyone based on several requests that I was receiving over the past couple of months I thought of coming up with a end to endend data engineering project and that too in the Azure site right so in this project you will get a complete idea on how a project is handled in the real world in the industry right and this should be really helpful for your interviews or even if you're working in aour this should help you in your day to work now I'm sure
00:30 - 01:00 most of you would be aware of me my name is Sumit I completed my masters from nit3 and then I've worked for companies like Cisco and vmw I majorly worked in the Big Data area for last five years I have been training candidates in the data engineering space and I offer an ultimate data engineering program which is a quite extensive program 32 weeks program and that is a program which has
01:00 - 01:30 impacted several careers and has changed several lives right so you can check that program the link to the program is in the description without any delay let's get started with the project so what is this project all about so first thing that I will talk is the technology on which this project is based this project will be based on Azure data engineering stack right which is a very trending
01:30 - 02:00 stack you know right so I was I was getting a lot of demand that okay sir can you develop a project around AO data engineering so I thought of developing an end to end project taking a real use case and applying whatever I could right so this will give you a very very good idea when you are going for interviews and even if you do not think from an interview perspective when you are working in your companies this should be helping you a lot
02:00 - 02:30 now what domain we will be working on right there are multiple domain Healthcare banking Retail Finance insurance right so this will be around Healthcare domain which is a very trending domain now even if you are not from this domain it's okay because you will get a complete idea about the N2 and pipeline which is trending right but for people who are from Healthcare then you can relate with the data sets and
02:30 - 03:00 all also so this project is around Healthcare revenue cycle management right commonly known known as RCM Healthcare revenue cycle management RCM right and I'm sure if some of you are my students working in healthcare domain many of you would be working on RCM domain by the way if you do not know about me my name is Sumit and I offer uh
03:00 - 03:30 data engineering programs right now a very trending program is the ultimate data engineering program and in case if you want to know more about it you can check under the comment section I mean you can check under the description of this right where you will get the link to my website so anyways let's get started so what is this RCM and what exactly we have to do so let's understand about this domain a little bit so that you get a context on what we
03:30 - 04:00 have to work on or what problem statement we are going to solve okay so RCM is the process RCM is the process uh that hospitals use to manage the financial aspect so I will first write it and then I will explain because I want to go little slow on the domain side because someone who is not from this domain should not get lost right so you should be able to understand it very well and I will try to talk in lament
04:00 - 04:30 terms so RCM is the process that hospital hospitals uh used to manage the financial aspects Financial aspects right and the financial aspects start from the movement of patient schedule an appointment let's say if someone is going for a let's say someone is getting a viral fever someone visits
04:30 - 05:00 a hospital from that time onwards till the time the doctor actually gets paid right so the End Cycle right from the time the patient the patient should use uh an appointment appointment till the time till the time the provider
05:00 - 05:30 gets paid provider in the sense you can think of a hospital or a doctor that means end to end that means first a patient pays to the hospital and the hospital internally will pay salaries and all and all of that stuff in between right so there is a lot I'll will talk about it but it's mainly to deal with this financial aspects about the patient paying the hospital and then uh hospital has enough of I mean revenue Supply so that they can manage their things
05:30 - 06:00 so I I want to give a simplified breakdown for this so here is a simplified breakdown and I am writing this note so that these notes become handy for you right so how how this starts it starts with a patient visit it starts with a patient visit right the process starts with a patient visit so when a patient visits a hospital it can
06:00 - 06:30 be a big hospital or it can be a small Clinic whatever is that their details the patient details like the insurance coverage or whatever details are there are collected so patient details are collected by the hospital or clinic now why these details are collected so mainly I'm talking about the insurance details here it will be other details also but insurance details is what is very very important and these details are
06:30 - 07:00 collected now this ensures that the provider provider in the sense here the hospital so this ensures provider knows who will pay for the services and I'm trying to use the terms which are used in this particular domain so provider in this context will mean a hospital but in some context it will mean the doctor also so patient is a patient as you know but provider is
07:00 - 07:30 something that you might not understand so I'm giving you a idea that context wise it will be either hospital or a doctor the one who is providing the services so this ensures providers knows who will pay for the services so who will pay either it will be insurance company who will pay or the patient will pay or both will pay that means some part the insurance will pay and some part you will pay because uh
07:30 - 08:00 for example you might let's say go through or someone goes through a surgery and the bill for that in India let's say it is three lakh right or let's say if you talk about us even though that the project that we are building is a us- based project think that it's a us- based hospital that we are referring to so let's say that uh the cost of this surgery is uh let's say uh
08:00 - 08:30 20,000 USD right now out of this let's say 15,000 USD U the insurance company will pay insurance provider will pay and 5,000 USD uh the patient has to pay right or otherwise might be insurance company will pay the entire amount or the patient will pay the entire amount it can be various scenarios based on the insurance policies and all that's why
08:30 - 09:00 it's important to know the details of the patient that what insurance that person hold and all of that so it starts with a patient visit where the patient details were collected by the hospital or the we we could say the provider okay now second step is what hospital will provide the services services are provided right so the doctor provides treatment and the hospital keeps the record of everything done that is the
09:00 - 09:30 test the procedures the medications so that means all the services are provided by the doctors and hospitals okay that's good third thing what happen the billing happens the billing happens okay in the bilding what exactly will happen the hospital create a bill I'll write the hospital will create a bill right uh for the
09:30 - 10:00 services provided and send it to the patient insurance company to request for the payment so they will create a bill and will send it to the insurance company if they are about to pay or if there is no insurance then the patient will pay whatever so they will create a bill and will give it to the patient or the insurance company fourth what will happen after this in this thing process claims are reviewed so if
10:00 - 10:30 this goes to the insurance provider the claims are reviewed right the claims are reviewed that means the insurance company review the bill insurance company review the bill or invoice whatever it is to make sure it follows their rules because insurance company also has certain Norms that how much they can pay what surgery it was and they have lot of
10:30 - 11:00 rules and regulations so it will see that whether it's following the rules or not so sometimes they approve the payment but other times they might reject or deny it due to whatever errors or uncovered services so it they might pay full so they might uh uh accept it or that means pay in full or partial
11:00 - 11:30 or decline whatever based on whatever policies they have or whatever the insurance policy person holds right now after this what happens the payments and follow-ups will happen payments and followups that means once the insurance uh pays its share the hospital May Bill the patient for remaining amount that means it might happen that insurance company pay a
11:30 - 12:00 certain amount and the patient pays for a certain amount so that will happen that once the insurance company pay it share the remaining will be asked from the patient itself and if payments are delayed or denied the hospital follow up for the issues right the hospital will follow up to the patient for the payment right so I hope you can understand that uh if the uh if
12:00 - 12:30 partial payment is done by insurance insurance company or let's say no payment or partial payment is done by insurance company then some portion or full portion will be given by the patient then some portion or complete thing whatever is the case is given by the patient and the hospitals will follow and the
12:30 - 13:00 hospital that means technically we say the providers will follow up will follow up for the payment that's what the followup means and then what will happen after this tracking and Improvement tracking and Improvement what does that mean so hospitals constantly monitor the process
13:00 - 13:30 to ensure they are collecting payments efficiently and reducing mistakes like declined claims and all right so they do not want to lose business so they want to make sure they collect all the amount it's not that some uh patient become defaulter they are not paying and so on right so they want to track it properly follow up properly and want to improve the process in case if they feel they are losing money and all right so this is how the process is and people from
13:30 - 14:00 Healthcare background would be aware but I want to give you a context because some of you will not be from Healthcare background right so in a nutshell if I have to say RCM ensures RCM ensures the hospital the hospital can provide quality Care Quality Care while also staying financially healthy
14:00 - 14:30 financially healthy right that means financially healthy in the sense it should not happen that they mismanage things and the patient has to pay but that patient is not paying and all right they're not following up so RCM ensur that I mean the hospital stays financially healthy so that they can pay to their doctors they can pay for instruments and all of that right that's the whole purpose about RCM so I hope this is clear so it's about
14:30 - 15:00 making sure everyone patients insurers right and the providers that means the doctors and all gets paid on time or whatever payment they have to do that's collected on time so basically as part of RCM as part of RCM uh we have two main aspects right we have two main aspects number one we have accounts receivable
15:00 - 15:30 that means the payment that hospitals have to take accounts receivable right also called as AR that means the money that they have to collect from the patient or the insurance company whatever and then second is Accounts Payable that means whatever they have to pay to their doctors their staff and the instrument whatever they have to pay right but most important component among this is
15:30 - 16:00 accounts receivable and whatever we will be doing in this project will be towards account receivable so that the hospital remains financially healthy right there are two parts always right getting money and paying money so as a hospital whatever money we are getting they are getting is accounts receivable whatever the money they are paying is Accounts Payable so we are more concerned about accounts receivable in this case so AR is the most important
16:00 - 16:30 and a key Focus area and the project will be inclined more towards it now in this entire process you have to understand one thing that what's the risk factor what does Hospital feel is the risk factor is insurance company paying a risk factor no those companies that's dealing a kind of business to business and that will happen properly but the risk factor comes when patient has to pay from their pocket so
16:30 - 17:00 patient paying paying uh is often a risk right that's a risk that means if patient has to pay the complete amount or a partial amount then that's a risk and if it the money does not get collected then it's kind of a lost business right so I'll talk about scenarios when patient has to pay scenarios when patient has to pay see a
17:00 - 17:30 case when insurance company pays the entire amount that's the best for the hospital hospitals can even make more money put make more but higher bills and all and the money will get paid on time but the main issue Lies when either customer I should not say customer rather a patient has to pay in full or partial amount so let's understand the scenarios when patient has to pay so
17:30 - 18:00 what generally happens these days this insurance uh providers right they offer a very fancy schemes saying oh now you can get an insurance in this much less amount but when we talk about low insurance right it looks fancy to the patients when they are getting that insurance but there are a lot of hidden terms and conditions because when you go with low insurance right these people shift most
18:00 - 18:30 of the burden these insurance providers uh uh put most of the burden on patients itself that means in their Clauses it will be written that oh for this surgery half of the amount patient will only pay or there will be certain things which will be deducted right all of that so sometimes we do not have a look and then we think oh we are getting insurance at the low cost and let's take so nowadays because of those
18:30 - 19:00 kind of insurance patients have to pay from their pocket and then that becomes a risk to the hospital right I hope this is clear now the slow insurance plans are attractive at first but many consumers don't fully grasp the implications of their deductible until they get the B simple thing I hope this makes sense to you now there can be some private
19:00 - 19:30 clinics private clinics where I mean insurance they won't accept let's say right or some dental treatment dental treatment right Dental treatments are also very costly where insurance claims cannot happen let's say right there will be certain insurance policies where deductibles are there where some things like uh uh surgery material and all gloves and all the
19:30 - 20:00 insurance companies do not pay right so all of that so such things are the times when um patient has to pay from their pocket right so this is the thing just have a look of what I have written so always the thing is hospitals want to have a healthy accounts receivable here that's the whole thing right they want to they want to make sure they bring the cash that means
20:00 - 20:30 collect the money from the customer or the patient and two objectives for account receivable two objectives for account receivable right that is first of all the patient should not default the money right that means they should pay and moreover it's not just about paying but they should pay on time right so also minimize the collection collection period it should not happen
20:30 - 21:00 that the patient is paying after five months because in five months let's say the money was supposed to be 10,000 USD after five months as per the infl inflation this 10,000 USD will be seen as less right so that's a loss too I mean that's a loss which you will not directly see but that's a loss right so that's the thing I hope this makes sense now uh as per whatever analysis that has been done by
21:00 - 21:30 some organizations what they feel is the probability the probability of collecting your full amount full amount decreases with time that means if you are able to collect early that's better because if you're not able to collect in 12 months then there is no guarantee whether the patient will pay or not right so as per this whatever stats 93% of
21:30 - 22:00 money right 93% of money due 30 days old that means that means whatever is under 30 days old you will be able to collect 93% of that 85% of money due 60 days old that means a hospital will be able to recover a approximately 85% of the money which is
22:00 - 22:30 60 days old and as you go beyond 3 months let's say 73% of of money due 90 days old so this clearly shows that as in when we the the patient kind of delays the process the chances of them paying becomes less right so I mean we can can always calculate certain Matrix in order to see
22:30 - 23:00 for example The Matrix or the key performance indicator kpi to measure to measure uh account receivable and set Benchmark and set Benchmark I I will show you certain things right so there are certain key performance indicators or things that you can calculate to understand is are you having a healthy uh account receivable or not right so I will just
23:00 - 23:30 tell three of those but there are many such kpis right account receivable greater than uh 90 days right I I'll tell what exactly that means I have a link actually I have a link I'll just show that to you uh this link I'll open so I have opened this link you can also check I will mention it in the
23:30 - 24:00 description and I have taken all of this whatever I am talking from these uh places you can see even these numbers right whatever I was mentioning right the probability of them paying will become less right you see this 30 days still you can recover 60 days less 90 days even lesser right and account receivable greater than 90 days what does that mean accounts that have aged over 90 days are at higher risk for going uncollected you know that right that means someone who has not paid for
24:00 - 24:30 90 days there is a chance that they will default right therefore it's very important to keep track of how much money is moving into this aged category right month on month so greater than 90 days equal to account receivable greater than 90 days that means how much money is there which is sitting in this bucket which is older than 90 days versus how much total account receivable that that means let's say let's say you have to
24:30 - 25:00 collect you have to collect uh let's say a million 1 million USD right total account receivable for all accounts is 1 million USD just an example and out of that out of that there is uh 100,000 100,000 USD which is older than older than uh older than I would say 90
25:00 - 25:30 days right then what does that mean that means 100K divided by 1 million into 100 right that's the percentage which is 10% 10% of your total money which is spending is older than 90 days right so your account receivable greater than 90 days become 10% you can see this I hope this you will understand you can check a lot right so there are a
25:30 - 26:00 lot of such key metrics like this I hope this is clear and there in account receivable let's say I'll talk about another metrix there in account receivable so uh consider uh consider let's say uh so I mean here let's say you set 10% Benchmark and now you see that let's say your 200 USD is older than 90 days that
26:00 - 26:30 means if 200k that means 20% but your benchmark set was 10% so you will check with your team what's happening why it's going wrong so this kind of metrics that's why it's very very important so that account receivable is proper healthy now days in account receivable what does it mean let's say uh you have your hospital has collected or or raised bills off around uh I would say let's say uh 1 million USD in 100 days last
26:30 - 27:00 100 days right so 1 million USD divided by 100 that means per day per day the collection is 10,000 USD right per day the collection will be 10,000 USD if I see that on an average right now if you see that your total account receivable the money which you have to get is let's say uh uh I would say uh let's
27:00 - 27:30 say 40 or 400,000 USD that means if 400,000 usds account receivable the money which you are about to get that means it's a money which you have generated in 40 days right so the it is 40 this are days in are is 40 days right days in year because per day collection is 10,000 and 40,000 us d right 40,000 USD is
27:30 - 28:00 pending just have a look 400,000 sorry 400,000 USD is spending so it is something which you your hospital would earn in 40 days so 40 days worth of money is pending right so you can set a benchmark let's say if this is AR is under 45 days then it's fine but if it goes more then we will just check our processes are that fine or not right so this ways there are lot of kpis right
28:00 - 28:30 you can check that out if you see this days in AR net collection rate right you can see multiple thing uh that okay you have to collect let's say uh uh one lakh or 1 million USD and you are able to collect uh almost .99 million USD that means around 1% you it went in a way that people are not paying it it's gone as bad right so
28:30 - 29:00 that means your account receivable or net collection rate is 99% so that is good so you can have set your benchmarks here and there are many other metrics which you can see so you you can check this document very very interesting document and this will give you a clear idea on what benchmarks you can set what kind of kpis you can have in the end in order to make this process strong and there is another link also uh I will also keep this as part of your description of the video uh these
29:00 - 29:30 two are very very informative documents I have collected all this information from here so you can see all these kpis are mentioned you can have a look and you will get more about the domain so that's about the entire domain which I would have to say right I hope this makes sense I'll close these documents now I'll close these documents now you should be good and you would
29:30 - 30:00 have understood about the domain clearly right now as a data engineer what do we have to do what do we have to do what do we have to do as a data engineer we said we will be building a data engineering project so as a data engineer what role are we going to play in this that's what we have to understand so you know we will have we will have
30:00 - 30:30 data in various sources and what data I will show you shortly I will show you that data shortly but we will have data in various sources now as a data engineer we need to create a pipeline you know data Engineers create a pipeline right so as a data engineer we would be creating a pipeline what that pipeline will do this pipeline we will create so that we end
30:30 - 31:00 up creating facts and dimension table at the end the the result of this pipeline I'm not talking about what we will do in this pipeline but the result end result of this data engineering pipeline will be fact tables and dimension tables fact tables and dimension tables and these facts and dimensions will help the reporting
31:00 - 31:30 team people who is doing the analysis and Reporting working on visualization those people can take these fact tables and can create reports and generate this kpis right will help the reporting team to generate the kpis just like I have mentioned uh you you remember some of the kpis account receivable which are more than 90 days spending that means you will say 20% of my money which is
31:30 - 32:00 pending for the hospital is more than 90 days old right so you would have set up certain Benchmark but how do you calculate that 20% using this whatever tables we will generate in the end as part of this entire pipeline right so all these kpis can be calculated from this final tables that we will have so basically as a data engineer we need to enable the reporting team so that they can take these tables and do whatever stuff in order to get the kpis
32:00 - 32:30 right right so let me now talk about the data sets that we will have so we have EMR data right so I will talk about the kinds of data right EMR data claims data and then we have the uh NPI uh NPI data and then we have ICD codes right ICD uh
32:30 - 33:00 data I would say right I will show all of that to you so EMR data is very very important which is what is EMR I mean in case if you are aware of AWS tag do not think that it is elastic map reduce right so this is a domain related thing and it stands for electronic electronic medical records right so nowadays everything is digitalized so whenever you go to a
33:00 - 33:30 hospital right for the first time your details will be stored in a patient's table right doctor's detail will be stored in a so I'll write it so in EMR data you have multiple tables so we have let's say patients table where the patient details is are stored then we have providers table always remember providers is the doctor right so
33:30 - 34:00 doctors's data will be stored in this that what doctor and in which this doctor is specialized all all the doctor related details are there in provider table and then we have department so every hospital will have certain departments like Oro or apart from that Dermatology whatever right and then we have transaction table transactions transactions is I
34:00 - 34:30 would say so there are two things encounter and transaction I'll explain you let's say you go to you are having a viral fever and you go to a hospital let's say right then it's a encounter they will Mark right that means the first first time you visit the hospital for that particular thing for this new thing that has happened that's an encounter and they will associate you with a doctor and all so that's the
34:30 - 35:00 encounter now for that encounter right for that encounter there can be multiple transactions that you pay the doctor fee you pay for the medicines right so there can be multiple transactions related to that encounter right and that means whenever you go to hospital they give you a file right for that particular thing and then they keep on I mean adding to that file let's say you show to a doctor today tomorrow also you are showing then the
35:00 - 35:30 file is the same so that that think it as a one encounter right even though you go for three continuous days it's just one encounter but there can be multiple transactions Associated you might be paying for let's say physiotherapy for some medicines for consultation there can be various transactions involved right so encounter is one for that particular case transactions can be many and other tables you understood patient provider Department right I I
35:30 - 36:00 will show all of these tables also to you shortly so all of this EMR data is stored in a database all of this EMR data will be stored in a database in our case we will assume that this is stored in Azure SQL DB Aur SQL DB right and I will show you these tables now before we proceed so I will go to my
36:00 - 36:30 Azure account uh you can create a free trial account or paid account if you use it wisely you will not spend much I have created uh database trendy Tech Hospital hyphen a as your SQL DB right SQL database now let me connect to this I will say query editor and I will just connect to
36:30 - 37:00 this so this will be connected let's wait until we are connected to this database I want to show you what tables we have so basically this is a EMR database right where electronic medical records are stored and these five tables so data will be stored in form of these five tables patient related data doctor related data transactions encounter and department so now I mean before it loads
37:00 - 37:30 uh think of it which tables will be bigger which will be smaller uh patients table there will be new patient so this this can still be a decent size providers will be a very small table because there will be hardly few doctors in a hospital department will be a very small table because there will be few departments only transactions will be a huge table encounter will also be a huge table table so the biggest table among this will be transaction sely and the second
37:30 - 38:00 biggest would be encounter I would say then a patient and of course this provider and departments are smaller right so one patient over the year can have many encounters right uh that means this person might have fallen sick two times so two new files would be created for that person right but in each time that person would have done multiple transactions whatever right for
38:00 - 38:30 this medicine for this physiotherapy consultation whatever it is okay so we are now logged in I already have these tables right think that this is at the hospital end and we have uh five table Department encounter patient provider and transaction now I will show you the data I'll show you the data so let me show you for Department it's
38:30 - 39:00 very simple department will not be much Department ID and the department name emergency Cardiology neurology whatever right um encounters uh so before encounter I will show you patient the smaller tables relatively so we have patient ID first name last name this is a sample generated data by the way right uh middle name and uh SSN number just like we have Adar and all
39:00 - 39:30 right uh same way a unique number for a patient uh forone number gender date of birth and the address of the person and when this record was modified might be the address changes and all right so this record will be modified then then we have provider that means the doctors a provider id the doctor ID it means first name last name the specialization of the doctor the department for this which this after works and the NPI so NPI is basically
39:30 - 40:00 National provider identifier which is a 10 digigit number right that uniquely identifies healthcare providers in us right so this is a unique identifier for each doctor right and we can get it through a public API also so NPI is a number which uniquely identifies each doctor okay then let me show you encounter encounter ID patient ID right of course
40:00 - 40:30 because for which patient this file is file number is Right encounter date inpatient so there are various encounter type inpatient outpatient so basically one of them is when the patient has to stay within the hospital for night right uh like they take a bed and all and one is when they visit and come they come and visit on the same day right so inpatient outpatient and routine checkup tele medic medicine means a phone call right routine checkup means person comes
40:30 - 41:00 who comes regularly just randomly they are coming they do not have any issues but they are just coming emergency you know right all of that you can see different encounter types that why this person has come provider id that means to which doctor they have been that means first time when you let's say have a fever and you visit a hospital which doctor I mean they have suggested and to which doctor you are going to show that's the provider id Department ID that means to which department you are
41:00 - 41:30 showing right let's say uh General physician right and Doctor name and the procedure What will what will happen and the date on which This Record is inserted that's encounter now transaction so transaction ID transactions is related to encounter so for one encounter there can be multiple transactions right and of course the encounter is related to a patient provider id uh Department ID the doctor
41:30 - 42:00 involved Department the visit date right so let's say you visit today uh right and the service can be the same day or next day see these dates are messed up because the data was uh created and these dates are matched up idly the service date should be equal to visit date or after right it cannot be before in this case you see before so that's wrong data generated but think it that service dates has to be equal to visit
42:00 - 42:30 date or after visit date and the paid date also is same visit date or after that right on which date the it this transaction is paid the visit type was it a routine visit or emergency visit followup visit whatever amount right total amount for the bill the paid amount how much they have paid right amount type right uh so whether insurance company is paying copay means insurance is also paying and you the patient is also paying right both are
42:30 - 43:00 paying some some amount right partial amount Medicare and Medicare and medicate right there are two different things these means so just like in India we have uh government some yoga right where they say five lakh scheme from government s same in us also it is Medicare and medicate one of them through the central agency one of them through state agency right so that's what these are different through government central government or state
43:00 - 43:30 government right um and the claim ID payer ID procedure code ICT code right so I mean this ICT code basically it can be mapped to the description of it that means this code would mean some of the code let's say it means that okay this is let's say a tyho right so later this ICD code we can pull the data from a API and then this can be mapped to a particular description of
43:30 - 44:00 disease that okay what exactly is the disee type of disease right I hope this is clear so ICD codes are a standardized system you used by Healthcare Providers to classify all of this right so based on this code we will understand that what disease it is so we will see that how we will map to the actual disease right uh line of business Medicaid ID medic ID right if person is holding any of this uh right government
44:00 - 44:30 IDs then that and the time when this record came so I've given you some idea about these tables so that you understand what kind of data we have as part of EMR now interesting thing there are two databases so EMR data let's say if you are from India who is watching this video then you understand that if you're staying in Bangalore um there is uh a chain of hospitals
44:30 - 45:00 called as manipal Hospital right hospital now let's say they initially have one hospital here in Bangalore then they would have multi second hospital then they would have third then they would acquire let say something else I know recently they acquired a hospital called as Colombia Asia right so that will also come under them now what if let's say this manipal Hospital have slightly different table structures these five tables they have
45:00 - 45:30 but they have different column names and all structures and now when they acquired Colombia Asia their patients the patient details from Colombia Asia will also be now associated with manipal hospital now there can be chance that this patient table has different number of columns or the column names are different or more interestingly what can even happen let's say if I show you this if I show you the data for
45:30 - 46:00 patients right let's say we have this uh patient ID now consider somehow the patient IDs for Colombia Asia and manipal Hospital were same like it was 101 now how you will identify what is that so how you will try to build a common data model later so that you try to you should not face these difficulties
46:00 - 46:30 that how do you merge these two schemas together because they might be using different column names or if they are using same column names might they are using the same patient IDs then how will you know which patient ID is coming from hospital a and which coming from hospital B if the patient ID is the same right so we would that's where I taking two hospitals to Showcase a common data model later so we will be taking Hospital
46:30 - 47:00 a and Hospital B Hospital B think that there are two different branches of a hospital in Bangalore right but they have slightly different schemas for some of the tables and uh some of them have like same patient IDs then we would Implement a surrogate key later so that we are not dependent on that right so we would Implement a common data model later that's that's important
47:00 - 47:30 thing that's why I have taken this scenario so that's where if you see uh if I go back now um if I go back if I go to home you see I have two different databases trendy trendy Tech Hospital a and trendy Tech Hospital B and this Hospital B has same number of tables but I will slightly modify the
47:30 - 48:00 schema so that laterally show that how we merge these two so that we do not face any difficulties later right so this also has same number of tables I will just show you quickly uh I'll quickly connect this is going to be really really interesting right because that's how things work in Industry I mean if I give you a dummy idea then it will not not suffice your purpose so U let's say this all these
48:00 - 48:30 tables are there and some data is there some data is there you can see I hope you can understand that so we are good with both these uh we are good with this EMR data and we understand there are two different databases so Hospital a data is stored in uh a different database right which is trendy Tech Hospital a trendy Tech
48:30 - 49:00 Hospital a and this will be in Trendy Tech Hospital B two different databases are there that means two different instances of Aur equal DB right both of them have five five tables so that's my EMR data which is very very important data now this will be in the form of tables as you understand now other kind
49:00 - 49:30 of data will be claims data which your insurance team will send insurance company or the payer we say them as payers generally the payer will be sending us this data and they will upload in the form of flat files they will upload in the form of flat files and they will put it let's say in our specified container container or specified folder in our data
49:30 - 50:00 link right so what we have done is we have created a data Lake we have created a storage account trendy Tech ADLs Dev you see a normal storage account you can create I'm not creating in things here because if I create it will take too long these are simple things right I expect that you know some level of uh Azure Cloud so that you can create this instances at least right so TT ADLs
50:00 - 50:30 Dev now I go to Containers I go to containers and here there will be a folder called as Landing right do not worry about other folders so there will be a folder called as Landing in this these files will be uploaded by the insurance company and for each hospital they will upload right I will just show how it looks like a CSV file flat file that's why you say as a
50:30 - 51:00 flat file it's a CSV so the claim right claim ID transaction ID patient ID encounter ID provider id Department service date claim date paay right and claim amount paid amount claim status right I'll talk more about this but you you understand that this is related to the insurance insurance that have they approved if approved how much amount and
51:00 - 51:30 so on right so we have this data which will come in the form of flight files and this you think that they will send it monthly once monthly once so till now we have seen two different data sets one is our EMR data which is in our database we have two different databases that means two different instances of Em are running and we have claims data which insurance comp will be sending in the form of flat files they will upload it in a folder
51:30 - 52:00 right what folder I will again show later but to be at a very high level lending folder right okay now there are two other kind of data that we will be using and that is NPI data right so claims data you understood flat files which will come from insurance that is providers now NPI data what is is that National provider data or national
52:00 - 52:30 provider identifier identifier so you remember that I mentioned each doctor is given a 10 digigit unique code or number and National provider identifier right a unique number that identifies each doctor and there is a public API available for this public API we can call that and get all the list of doctors from there along
52:30 - 53:00 with the NPI this thing details so complete details of doctors that means the providers is get we will get from NPI this thing API public API I'll show that API also later you can see this NPI data how how does this look like right NPI ID uh first name I have some sample data first name last name position of this person organization
53:00 - 53:30 name for which this person is working last updated when this record was last updated and uh is current flag true right like that so this is more about the providers provider data right I hope this is clear this is about NPI now in case if I am see I am not from Healthcare domain what whatever I could understand right I'm trying to explain and if you feel somewhere I am
53:30 - 54:00 incorrect then do mention in comments right so because let's take it in a very constructive manner so that we all learn right so in case if you feel somewhere I am not clear with the domain like NPI and all you can always feel free to mention in the comments so that everyone can learn right but as per my understanding I am talking about these things and can be a chance that I'm sure I would be 98% correct but 2% if there is any issues you can always mention in
54:00 - 54:30 the comment so that everyone learns from that right so NPI data we have seen and we have ICD data what is that so we have ICD codes which is nothing but a standardized system used by Healthcare to classify I mean uh that means you have a quote like this for example this is a code and then we have the description that means Cola du to whatever right for example uh this code you see if
54:30 - 55:00 someone's diagnosis is this code that means it is typhoid fever so we have a API which gives us this kind of mapping that means the ICD code and description so we will run these two apis also to get uh this data right NPI data and ICD data so I will write ICD codes are a standardized uh
55:00 - 55:30 system used by Healthcare Providers to classify and to basically map diagnosis codes and diagnosis code and uh description right a mapping kind of that you will get from this ICD data so there is a API for that there is a API that we
55:30 - 56:00 will see but in this particular part one so as I said this project I will be uh having two parts this is the part one and in this I will mainly talk about our EMR data so you kind of understood that we have now EMR data which is in database right we have uh claims data which is in form of flat file flat files like CSV and then we have data coming through apis right which is
56:00 - 56:30 nothing but uh uh NPI right and our ICD data ICD codes these are the data sets at a high level that we have right so I am sure now you are clear with uh whatever data sets that we have so with this you understood about the data sets very
56:30 - 57:00 properly now I will talk about the architecture that we will be following in this right so the solution architecture so we will be following a medallion architecture by the way right I hope you know what is a medallion architecture in case if you do not I will just give you an idea what what is Medallion architecture so what happens in a medallion architecture that
57:00 - 57:30 let's say let me tell that let's say if you have a landing zone or a landing folder so just like if you go to our uh storage account right if you go to our storage account by the way the storage account that we create if we turn on uh this hierarchical this thing on right so hierarchical name space enabled then it will become a ADLs Gen 2 account otherwise it will be a normal blob storage so it is recommended to turn
57:30 - 58:00 that on for big data analytic workloads and all right ADLs is quite performant that way so I have created a storage account and in that if I go to Containers you will see there are multiple containers I have bronze ignore that configs for now bronze gold Landing silver so I'll write Landing bronze bronze uh
58:00 - 58:30 silver and gold right so if I have to draw a line or lineage it would be this way like what happens first let's say you have some of your data or all of your data in the landing folder right and then you try to move it on the right side so from Landing you move it to bronze from bronze to Silver to silver to gold and you keep improve improving
58:30 - 59:00 on the data in terms of cleaning in terms of better data and all of that right so I I'll talk about it so right now I mentioned what data sets we had EMR right and claims data right and then apart from that uh quotes right so we have this data so EMR was in uh SQL DB Azure SQL DB this was flat files flat
59:00 - 59:30 files and this also we can get as form of files and all uh like we can run the API and we will get it right in terms of files now this we can get in some other format not flat file let's say we can get the data in park format using the apis let's assume that right I have to check but let's assume so now in your lending in your lending if I show you this
59:30 - 60:00 lending they will the the insurance provider will be putting uh these files here right insurance provider will be putting the files in landing so that's fine that are flat files now so in landing you have insurance files but what about this EMR data EMR data is in database and now you when you bring it or when you try to load it you will directly bring it to
60:00 - 60:30 your bronze layer this Landing Zone will not be coming into play so Landing zone is only when the data is in form of flat files so if it is in a form of flat files we are keeping here FL files initially right and then uh in my bronze layer I will keep it in the form of Park F Park f right par files and in my silver layer I
60:30 - 61:00 will create Delta tables I'll create Delta tables I'll explain more right in Silver T layer I will create Delta tables and in gold layer also there will be Delta tables Delta tables right data break Delta tables that way so not necessarily everything will be in
61:00 - 61:30 lending right so uh the insurance company load something in landing in form of flat file CSV and all then we take it to bronze to silver to gold but this EMR data it will directly when we load it how how we will bring it we will ingest it through AO data Factory and then when we do it we will directly bring it to bronze layer in the form of Park files right so bronze layer will hold everything in the form of Park files and when we get
61:30 - 62:00 this codes and all when we run the API we can directly go get it to bronze layer so Landing zone is only for Insurance data in our case otherwise uh the EMR data which we get from tables and the codes like ICD code and NPI we directly get it to bronze layer in form of Park files right and then uh we take it to Silver what we do from bronze to Silver we basically I mean I will write
62:00 - 62:30 it I'll write so bronze has Park format is in park format and you can think it like source of Truth source of truth that means if sometime you have to you have confusion in silver and gold and you want to go back to the actual data you go back to bronze layer now talking about Silver from bronze to Silver what changes basically we do data
62:30 - 63:00 cleaning while loading to Silver before loading it to Silver we do some cleaning of data cleaning of nulls and data quality checks we might enrich the data I'm taking talking in the general sense we might enrich the data by joining it with something and all right now in our case we will also Implement a common data model because we know we have data from two different hospitals and we will assume that patients table have different column names and the patient
63:00 - 63:30 ID might be same similar so we will Implement a surate key so that we can properly merge these two databases without any problems right so we will Implement a common data model so before loading to Silver we will Implement a common data model right CDM also the we know that okay the patient detail can change right that means the address can change the doctor details can change and so on so we will
63:30 - 64:00 Implement a STD type two so that we maintain the history also so you you know that we have different kind of slowly changing dimension for example in our case uh a patient table provider table all of these are Dimension tables right kind of so the things can change and we can end it the previous record and the new record will come so we maintain the history we will talk about SCD type 2 so we even Implement SCD type
64:00 - 64:30 2 so that we maintain the history so that's our silver layer right and once the data is in silver layer in silver layer data is cleaned enriched it is I mean it follows a common data model and it it has previous history also right using SC type to then final is gold layer gold layer in gold layer I mean gold layer is basically we if there is any aggregation to perform
64:30 - 65:00 we can do that but in our case what we will do be doing not aggregation but aggregation is more in a general sense I'm saying but now what we do finally we want the answer in the form of facts and dimension fact table and dimension table so that my reporting team can build their kpis right can understand all of that Matrix so I need to build the right data model for facts and dimension and I will show you how
65:00 - 65:30 exactly we build that right and my reporting team will take this and can whatever build reports on top of it so I hope this is clear and you see the more you go on the right side the better you have the data and each of these layers serve different personas or different kind of person if you ask me who will be the end user for gold I would say business users business users who have to take business related
65:30 - 66:00 decisions right gold layer now who will be using data from Silver layer it's more clean data so basically if there is a data science team right data scientist can use machine learning experts right or even data analysts because they need still the granular level data right they do not need aggregated data as in Gold but they would need granular data but they need clean data right so each layer serve
66:00 - 66:30 different personas or help different kind of persons now bronze layer is for whom I would say bronze layer has raw data as such source of Truth so I Le a data engineer who can Implement whatever they want right each layer serves different personas in The Medallion architecture now I have mentioned three layers sometimes people might say instead of bronze raw layer and they can have four five layers but each layer would serve a different use case right so there is no hard and fast rule that
66:30 - 67:00 there can be only three layers in this and so on in our case Landing folder if anything is there in landing folder that will be flat files in my bronze I will have Park files right I I do not have tables still in bronze and from Silver level silver layer onwards I will create Delta tables data bricks Delta table right from this layer onwards I have tables I hope we are good with this now I will show you how exactly uh
67:00 - 67:30 this AR high level architecture looks like right so you see this we have Hospital a Azure SQL DB Hospital B Azure SQL DB two databases again we have a folder like a container in our ads Gen 2 which is our data Lake right uh where we get
67:30 - 68:00 the files from the pairer paer is nothing but your insurance company and we have public API ICD code and NPI whatever right this public apis now from the databases EMR data directly when we get the data how we will get through AO data Factory AO data Factory is for data inje right we can use GE dat Factory for data injection and this data will directly come in the bronze layer and
68:00 - 68:30 then it will come in the form of Park files that means we take the data from databases put it in our bronze layer bronze folder in the form of Park files it will not be there in landing now pair files let's say we have given the access to the insurance company for the folder Landing folder they will dump flat files in CSV format every month right so data in CSV we take the data from Landing folder CSV format put it in bronze and
68:30 - 69:00 convert it to par right again we do this using our uh whatever we can do through uh data breaks or a Geo data Factory whatever we will talk about it later right now I'm in this part one I'm interested in only my EMR data and then this uh uh public apis we call through data brakes notebooks and we we again get it into bronze so anything that we can get it directly in par we will put in bronze we
69:00 - 69:30 will skip the landing layer now in bronze everything is in my par format right everything is in my Park format now next is I take this data from my bronze layer this folder and I clean it I Implement a a common data model I Implement slowly changing dimension and then I put it in silver layer but now it's not in the files rather it's in a
69:30 - 70:00 Delta table right Delta table where we can even write perform asset transactions and so on right we'll talk more about it once we see that uh in the later part but that's how it is and then in gold layer we create fa and dimensions so this is how uh I do and the major technologies that will be used will be your Azure data Factory and Azure data bricks right so there will be
70:00 - 70:30 a heavy role to move from bronze to silver silver to gold using data braks notebooks right so most of the logic will be in data braks then right so if you ask me sir many people ask me Sumit sir what are the most important technologies that we should be knowing in aure Cloud I would say aure data bricks and aure data Factory and these two are the two heaviest portions of this project the best part about it but we are using more Technologies we are
70:30 - 71:00 using Azure SQL DB we are using uh our Azure storage account ADLs Gen 2 right and we are using the apis whatever so so we are using various things here I hope this is clear and towards the end what's our end goal as a data engineer have we do we have to do visualization no or reporting no our end goal is create facts and dimensions so that my reporting team can take care and that
71:00 - 71:30 will be in gold layer right so in gold layer we have to generate fra and dimension I want to show you the ER diagram for that right so at a high level this might have slight changes but at a high level this will be the thing we will have a lot of Dimensions patient is dimension diagnosis is dimension so fact is what fact is nothing but a fact like a transaction right and any supporting let's say I my name is Sumit I purchased a laptop for
71:30 - 72:00 $1,000 right which is a MacBook I purchase it from a store in Bangalore let's say indagar now the actual fact here is the amount that is I purchase something for $1,000 right that's the amount I purchase two quantities that's the fact so fact is nothing but numeric things like amount number of quantities I purchase and so on and any supporting thing like who purchased when they purchased from which store they
72:00 - 72:30 purchased who was the seller right all of that those are additional information for the transaction those are dimensions right so those are dimensions and dimensions might change slowly I'll write it I'll uh write it so if I want to get talk at a very high level fact and dimension facts is nothing but a fact which is a truth right something that has happened now numeric
72:30 - 73:00 value numeric value now do not say Sumit sir employee ID is also a numeric value it should be a fact no right so idly a numeric value like total amount right total quantities all of this can be facts so anything which is a numeric value or a decimal point is a good candidate to be a fact but Dimension is all the supporting things right supporting things a fact never change right but a dimension can change and that's where we
73:00 - 73:30 have to implement STD type two to maintain history we have different uh types to implement this right std1 where we do not maintain history we override it std2 where we maintain history and std2 is what is a industry practice so anyways I will close this diagram we have understood this uh this one I was showing so we have a transaction fact transaction right which is nothing but the fact and it has a lot of
73:30 - 74:00 references to the dimensions you see so patient ID we have patients Dimension Department ID we have Department Dimension provider id we have provider Dimension that is doctor right and then ICD code we have right this code Dimension right encounter ID claim ID right so lot of this references are there but we have pair paid amount these are actual facts uh adjustment amount
74:00 - 74:30 patient paid amount all of this right so these are facts so you can see even this Dimensions have link for example you see this dim department has a link to dim provider because a doctor can be in a certain Department we could have Club these two but if we separate then it's more like a snowflake kind of style where uh I mean we would have to join these two also right because these Dimensions can also relate to each other otherwise if you go
74:30 - 75:00 with a normal facts and dimension then a fact can be connected to a dimension a dimension cannot be related to another dimension but if a dimension relate to another dimension that's more like a snowflake style schema right so this is what we want to build at a high level this might have few changes but at a high level our understanding is we want to build facts and dimensions after doing all the cleaning implementing a slowly changing Dimension uh implementing a common data model in the
75:00 - 75:30 end our end result is this so that my reporting team can take this and build their reports and all on top of it I hope we are good so with this you would have a lot of clarity now on what exactly we are aiming for now what we have to do is we have to implement the solution right so what uh in this uh part I will talk about EMR data only that how to bring the EMR data to my
75:30 - 76:00 bronze layer that's the thing I will talk about and now you might think it's a very easy thing it can be easy but the way we will solve it is in a more industry related fashion where we create a generic pipeline very very important right we will create a generic Pipeline and you will see how beautiful things will flow right so I want to bring this EMR data to my bronze layer that's what I would
76:00 - 76:30 show you now so now the technologies that we will be using will be Azure data Factory for what what what do we use AO data Factory for basically inje because we will get the EMR data uh through AO data Factory right and we will put it in our data Lake which is ADLs Gen 2 I'll talk more about it anyways that's the
76:30 - 77:00 main part and then we will be using in this project aure data breaks so aure data Factory and data breaks will be heavily used we are using aour SQL DB and we are using or we will be using Azure storage account right and we will be using keyal also U not now but later for storing our passwords and all of that credentials keys right which is a good thing so our raw files or Park
77:00 - 77:30 files all of that our uh data will be stored in the storage account here right uh raw files or Park files and then keyboard for storing the credentials so that it does not get leaked right and aour SQL DB is where our EMR is there we will also create one or two more tables just for our reference and AO
77:30 - 78:00 data breakes is for our data processing right that is our ETL whatever we do moving it from bronze to Silver all of that cleaning implementing a CDM STD type to all of that will be done through AO data brakes right and we might use more Technologies but at a high level I just want to give you an idea that this is the stuff and data Factory and data breakes are the two most important components that are
78:00 - 78:30 there right now first thing that we have to do is we have to create a Azure storage account storage account right and in our case we have created a your storage account with the name TT ADLs de TT a d LS de that's the storage account name now we have made it a hierarchical name space so that uh kind of uh it becomes a
78:30 - 79:00 ADLs Gen 2 account right ADLs Gen 2 which is preferred for our data analytics workloads or bigger workloads optimized for those right so this we have created if you see here if I come on this so you see hierarchical name space so when created you just have to check that option of hierarchical name space and it will become your ADLs Gen 2
79:00 - 79:30 account otherwise it will be a normal blob storage fine now inside this we have to create containers right you click here and create containers I'm not creating here because it will take a lot of time just to create so the containers that you have to create are I'll just mention the hierarchy because you can do the same you have to create a lending container bronze
79:30 - 80:00 silver and gold to implement your medallion architecture right so you see all of those four folders are there but apart from that I have one more folder which is called as configs which has nothing to do with uh I mean with this four for my Medallion architecture I have these four confix is for me to keep some configuration files so you see uh blending bronze silver gold configs is for me different not a part of Medallion as
80:00 - 80:30 such right now I want to keep some configuration so that I Implement a generalized pipeline a generic pipeline or a metadata driven pipeline I want to create a metadata driven pipeline right which is something which we should be doing right that's where I want to have this config folder and inside this I mean I will create a folder called as
80:30 - 81:00 EMR because I want to create a generic pipeline for EMR data so I will go to configs where is that yeah and EMR I have created a folder you can add directory I have clicked on ADD directory and created EMR directory right you see this under configs I have EMR so config is a container inside this EMR directory now I will have my config file which is load config do CSV inside this folder I
81:00 - 81:30 have load config do CSV right I will show you that I'll show you this file and we have this stuff right so in this file I have this that means database right you will understand why are these columns but at a high level I will explain you First Column is database
81:30 - 82:00 that means uh whether it is trendy Tech Hospital a or trendy Tech Hospital B we have two databases right in aor SQL DB right the second column is data source right whether the source is Hospital a or Hospital B I mean this we will use it later that's why we have right this we have created newly a new column that way table name now we have five tables in our uh in our database a and five tables
82:00 - 82:30 in hospital B like that right so this is database a and this is database B so five tables here the first five so you see we have dbo dbo is nothing but the schema right in the aure SQL DB just like we have app schema dbo we can have various schemas right so dbo do encounters DB is a default schema right in Azure SQL DB so encounters table
82:30 - 83:00 patients transactions providers department Five tables and then we are talking how we want to load it is it a incremental load or full load so load type is another column that we want to specify incremental or full right so if incremental then it will see this and do a incremental load that means it it won't redo the work what it has done earlier if 100 records are loaded previously two new records have come
83:00 - 83:30 only two will be taken care so any transactional table like uh transaction or encounter right or even like transaction encounter mainly these two should be definitely incremental because these will be having a lot of data these are more like your fact tables right so you see encounter and transaction even patients we can patients will be a bigger table so we can have it incremental only the provider which is
83:30 - 84:00 hardly 20 30 doctors and five six departments this can be full load because this is very less data right so this load type is incremental or full now Watermark so if I mean if it is a incremental table that means if we have to load the data incrementally then we have to define a watermark column that means based on what column we perform that incremental load right based on what column so Watermark column
84:00 - 84:30 will be applicable for incremental only so for incremental here we have modified date that means any new records after this particular modified date we consider as new data so we stored the last modified date here and any new records we'll pick after that so that is our Watermark so for incremental modified date here modified date here and if it is a full load then no need for a watermark column because we do not have to perform incremental so only when
84:30 - 85:00 performing incremental water mark column will come into picture right based on which our incremental strategy will be taken care now is active this is a flag where if we make it zero we say that my pipeline should not when I invoke this this particular table should not be ingested that means this is inactive that means it should not run for this when I make it zero that means this is not active pipeline right zero means
85:00 - 85:30 inactive one means active only the records with one should be taken care as part of my part one I have not implemented that but as part of part two we will do that right so sometimes it happens that you do not really want to run your pipeline for some specific tabl for some reason then you will put mark it as zero that means you will make it as inactive right so this is called as a metad datadriven architecture or pipeline
85:30 - 86:00 right and the target path Target path you will understand why we have put because we'll be putting it to some folders either a hospital a or Hospital B accordingly right so you will understand Target part data source right that why we have put this but these configurations we have put so that after reading this the system knows what to do and how to do right so we have this particular file in our config CSV and where we have loaded
86:00 - 86:30 under configs EMR folder loadconfig CSV I hope this is clear right so this is about our uh this particular folder so bronze silver uh gold landing and configs everything is clear I believe okay so now let's come to the
86:30 - 87:00 most important part now let's come to the most important part where we have to implement the pipeline and the pipeline is what we want we want to take the data from EMR that is our aure SQL DB and we want to load where where we want to load in our ADLs Gen 2 that means we want to configure or set our data Lake right that means we want to bring it to a data Lake in what folder in
87:00 - 87:30 bronze folder in form of par we want to bring right in form of par format I hope you can acknowledge this part that we want to take the data from database and bring it in bronze folder in par format bronze folder in par format right that's what we want so we have to
87:30 - 88:00 create a Azure data Factory or ADF pipeline for this so now when creating or configuring a ADF pipeline to ingest the data there are few components of a ADF pipeline you remember in case if you know link service right then we have data sets and then we have uh certain
88:00 - 88:30 activities right and then we create the pipeline I'll explain all of this I'll explain all of this to you right so link service and what it is link service so what link service is link service is basically let's say you have to get the data from SQL database right so you need some connection parameters in order to get the data from there right you no one randomly can connect to
88:30 - 89:00 my Azor SQL database right there should be some connection string or connection parameters in order to connect to it so link service says that how do you connect to your source and how do you connect to your target so we would need link service for the source we would need link service for the Target so in this case I want to connect to my a SQL DB so I need a link service for my AO SQL DV so that I can connect to it right
89:00 - 89:30 and I need another link service to put my data to ADLs Gen 2 the Target so I need another link service for my ADLs generation 2 right this is for my source this is for my target I need connection parameters that how do I set that okay how do I connect to aure SQL database how do I connect to ADLs gen2 right these two are
89:30 - 90:00 there right as of now I mean there will be more link service but at a basic level you can understand that these two are definitely required now we will also create one audit table right I'll write it as part of this pipeline metadata driven architecture we will create a audit table where whenever whenever the pipeline has run we will put an entry
90:00 - 90:30 saying oh the pipeline has finished at this time right the pipeline has finished at this time so that we mention the time so that next time when we run we try to pull all data after that time so for our incremental load that last modified will help us there so that's where we need such an audit table I will explain or show you what audit table is but remember that as soon as our pipeline finishes we will put an entry to the audit table and tomorrow we can check oh is our pipeline successful or
90:30 - 91:00 not what time it has last R and all of that now this audit table where I want to create this will be my Delta table in my data briak it will be a Delta table so if I want to create a Delta table and put an entry in that I need another link service right and this link service should connect to what should know how to connect to Delta right Delta
91:00 - 91:30 link right this knows this should know how to connect to Delta tables or Delta L whatever right whatever you can write and now going forward when I would be putting my passwords and credentials in uh keyal then I should be knowing how to connect to keyal right so I would be going forward creating a keyal uh link service also in thisa in this part I
91:30 - 92:00 will not create this but I would create these three right one is to how to connect to a j SQL database to get my data how to connect to a ADLs Gen 2 to put my data how to connect to this Delta uh Lake in order to uh create this audit put entry to a audit table right because that is a Delta Data break Dela table right these three Link services I will be creating keyal I'll be creating in my next part when I uh create Implement a
92:00 - 92:30 key and all right I Implement all the things in a better way so these are Link services so Link services basically a connection string which tells how to connect right to a source or a Target whatever it is now another thing that we have that okay I'm saying okay I want to let's say I'm saying I want to read something from ADLs gen to my storage account then I will create a link service but now now now I have to say
92:30 - 93:00 Okay what is the file name what is the file path and U let's say the file name file path and the file format right these things how will I specify using data sets right so then we Define data set which says okay can I know what's the file path what's the file name what's the file format where I have to put or in which format I have to put the
93:00 - 93:30 file I hope you can understand this in case of your aure SQL DB let's say you connected to the database then don't you feel then our role is to tell what database name what table name the database name right when connecting to a database database name table name schema name right all of that so that is where we Define a data set that means I
93:30 - 94:00 know that I I know how to connect to a SQL DB but inside this what database what table what schema that is we Define as part of the data set I hope you have got an idea mostly you would be aware but still I'm giving you an idea so that even if you do not know AO data Factory you should be able to understand this in a very layman term right just have a look at
94:00 - 94:30 this so we will so you understood that why do we create data set so we would need data set for what uh in case of azur SQL DB in case of azure SQL DB we would have to define a database name table name schema name like that right in case of ADLs D2 We would have to define the container name uh file name file format all of that uh I'll just put it in
94:30 - 95:00 brackets I want to give you a rough sense of why this right okay and uh then uh after that so let's say uh when we are doing for ADLs Gen 2 let's say your uh format is is let's say delimited text flat files right flat files is delimited text so you will create a data set for
95:00 - 95:30 delimited text right which will again be related to this particular link service so we are saying what data set it will handle the limited text and it will be related to ADLs Gen 2 because either to store the data or to take the data right then if we want to put the data in park format we can create par data set right par and again this is
95:30 - 96:00 related to ESS than two this is related to ADLs gen to link service so basically you should associate a link service to a data set right in this when we are defining a aure SQL DB then we associate aure SQL DB link service when we are defining delimited text right then we Define ADLs Gen 2 Link service whether it's a source or Target whatever same way when we Define Park we Define ADLs Gen 2 Link service
96:00 - 96:30 and when for for this Delta Lake right uh we Define data brakes Delta Lake data set for this Delta Lake Link service I'll show all of that to you do not worry then it will be more clear right so we need to have write data sets so I will just show you this things I'll just show
96:30 - 97:00 you uh so I will go to my AO data Factory I go to home and I will go to TT Healthcare ADF dep that's the uh ADF I have right I will say that that's the ADF instance we are in I will launch the studio I've launched the studio right and now what I will do I will go to manage to
97:00 - 97:30 show you the link Services which I have created right so now if you see that we have to get the data from database aure SQL DB so we should be having a link service for that so here is that link service naming convention I will improve wise in the part two right now the naming convention can be weird but that's fine so you can see this uh it is to a link service to connect to
97:30 - 98:00 database right and it's always a good practice to say underscore LS towards the end so that by seeing it you understand it's a link service right naming convention I will talk in my part two so um then what we have I have defined the ful qualified domain name for this database so I will just show you uh if I go to home and I go to let's
98:00 - 98:30 say here right I have this server name I have this server name which I have mentioned you can see this the server name right now we have two different databases we could have easily created two different Link services for one for each but I want to develop things in a generic manner that means is there a way that I can just develop one link service and then that link service will help us
98:30 - 99:00 to connect to both of these databases EMR databases the hospital a and Hospital B that's how I have developed so what we will do is we will generalize it I will show how to do you ignore this for now right authentication type SQL authentication my username is SQL admin my password what whatever I have given the password when I Implement a keyword I will Implement a keyword here also right now it's a normal password which I
99:00 - 99:30 have shown right normal password which I have given that time right and if I show here I have given a parameter here dbor name because I have two different databases right one is Hospital a and Hospital B right two different instances of a SQL DB so dbor name which is string and now I will pass this
99:30 - 100:00 dynamically right so you can see here that database name I am saying linked service. DB name that means I'm not passing I'm saying it we will pass it later we will pass it later so I I'm I'm saying database name will be basically whatever the value we will give this but we are not giving the value here if we would have given some value my database name would have been taken that way
100:00 - 100:30 right I could have mentioned my database name here hardcoded here but I saying no no this will be the value of this parameter and I have not given this parameter because I will be passing it later I will be passing it later right now later that means from where I can pass the wrapper for this would be my data sets as I mentioned as I mentioned we have link service and we connect this link service with the data set that means when defining a data set
100:30 - 101:00 we mention the link service so we have option that we can basically Define this database name when defining the data set and if we do not give it during data set then we can do it when defining the pipeline because how it goes is if you think that pipeline is at the top then we have data set below and then we have uh we have our link service below right so if we are not giving a
101:00 - 101:30 parameter here then we can pass that in data set if we are not passing here then we can pass that in pipeline so if we pass something here that can flow here and then that can flow here right so that's how it is so data set is on top of Link service and pipeline is on top of data sets right right that's how it is so right now I have not defined the database name because I want to create a generic one otherwise I would have
101:30 - 102:00 created two different Link services there which I do not want because let's say there are five different databases now will I Implement five different uh Link services no it's always better that we pass this dynamically from the pipeline later and we will see that I hope this is clear keyw we will will Implement later so I will uh just cancel so you implement you you understood that how to
102:00 - 102:30 implement this particular link service because this will act as our source to get the data so we need to know how to connect to this right so we have given the database uh username password right but to which database to connect that we will get to know dynamically now now this was our source right now ADLs Gen 2 is our
102:30 - 103:00 Target where we want to put the data so we need to create a link service for this also whether source of Target doesn't matter but we need to connect somehow whether it's a source or Target it's not our concern so link service is required so now we have to create a link service for ADLs Gen 2 so let's see uh this is the one that is there so idly I could have given it a better name and it should have been
103:00 - 103:30 ending with underscore LS but we will rename it properly in our next part where we will see best practices and all so now I have given the name for this uh authentication type so basically if I go to my storage account storage account right and and I let's say go to access Keys here I can show and copy it
103:30 - 104:00 from here this is my access key right I have put the same access key here right I have put the same storage account key here and before that I have to give the name of this right this URL how do I get this URL uh I will just show you the overview of this and let me show you where to get this so you go to endpoints and you will get this data Lake storage this
104:00 - 104:30 one so you give this URL right here and you give the storage account key how do you get storage account key again I'm showing you you go to security networking access keys and here you click on show and then you copy that key right I'm not showing you because uh I mean I want to use it later also but that's how you copy it and paste it here and then uh you can test the
104:30 - 105:00 connection this should work this should be successful there not uh I mean only authorized person can connect right so you see connection successful because we have given the password of it credentials for it right later we will store this in keyal so that the credentials are secured right so no parameters as such here all good so we have created these two so for Azure SQL DB and ADLs Gen 2
105:00 - 105:30 now once we create we can use it as Source or Target that's okay anything is fine but once you have given the credentials it's anything you can use now we should be creating for Delta Lake also because we will be storing our I mean that audit table and all right so AO data break Delta Lake that that's the uh name I'm giving AO datab break Delta Lake and what I am
105:30 - 106:00 doing uh so you can see this AO datab Delta L that's one you have to select so if you see the previous ones for SQL it is aure SQL database so what you have to do I mean uh when creating a new link service click new aure SQL database like that you click and then start configuring and for ADLs Gen 2 same way you click on
106:00 - 106:30 ADLs uh this one AO data L Storage gen to right so the one that one is this right this is AO data L Storage gen you click there and start configuring this way and now I was showing for a datab BRI Delta link now give it some good name and uh then then now if you have to do that your data brakes cluster should be on right
106:30 - 107:00 because you can only create a Delta table when your data breakes cluster is on right because this is a Delta table that you will be creating so uh what I will do I have let me go here home and I have created a test workspace again not a good name but right now a test workspace uh for data breaks I launch workspace right I'm launching
107:00 - 107:30 this so the workspace is launched you can see that I can go to workspaces if I want to see my notebooks and all I can go to compute to see if any cluster is on or not so I mean I initially I mean have created a single user single node cluster and I use 10 F4 which uses four CES and 8 GB memory and terminate after 10 minutes of in activity right so it
107:30 - 108:00 consumes only 0.5 dbut per hour the the least that I can spend right so but make sure you have a termination time and click on terminate after right so now let me just start this so that once it starts then I can uh I me have a cluster on I can show you how we get the cluster ID and so on let's conf confirm so let's wait for the cluster to be turned up okay so you can see the cluster is
108:00 - 108:30 turned on it took around 5 6 minutes in order to do that now since the cluster is turned on I can go to my workspace I could have seen the notebooks before also but I could not have Associated it with the cluster to run right so now uh if you see audit table just want to give you an idea that what is the audit table that I'm talking about so we have created this particular audit
108:30 - 109:00 table create schema uh so yeah so if you see I'll go down actually because I've written it here so if you see here we are creating a schema audit so generally in data breakes a schema is nothing but like a database right just like we create a retail DB database and all so consider we are creating audit as one database
109:00 - 109:30 and inside that we want to create a table named loore locks right and in this ID which is auto increment data source we are putting that what's the source of data so if you can now recall if I go to my storage account if I go to my storage account and uh go to containers and then uh I go to uh
109:30 - 110:00 configs right now if you see you can relate that I am saying what is the data source right so I will put like Hospital a or Hospital B right table name so table name we will put right like encounter table or patient table that which table just completed the pipeline right so data source table name number of rows copied will put like 50 rows
110:00 - 110:30 copied or something Watermark column name so in most of the case it's last modified date right and apart from that load date that what time this is last loaded right what time it is last the record is last loaded so uh this uh audit table we will create and how it helps is that you want to see did my pipeline ran yesterday successfully you can go to this audit table and check these locks
110:30 - 111:00 this will help you a lot right so this audit table we have created load locks and in this we will be loading the locks after the pipeline succeeds so if I show you this we have created this right so now to start with we will basically the table would be there but I would truncate this I'll truncate this it has truncated so that uh I mean it
111:00 - 111:30 should not show anything but if it shows it will be like Auto increment ID data source Hospital a hospital B which table how many rows copied uh and Watermark column name only in case of the full load incremental only it will be there right for full load it will not be and load time right load time I'll talk more about it but right now if I run this it should not show anything because we have truncated it just
111:30 - 112:00 now we have truncated right nothing is shown okay now we want to create a link service to this right that's how we came here because not a random person can connect right there should be a way so I click on this I mean that means if I have to create a link service I click on new and then I type in AO datab briak Delta Lake and then select that and put this now here what I have given I have
112:00 - 112:30 given access token how do you find the access token right I believe you can get it here and under settings yeah so settings developer and access tokens you can you cannot see it later but you manage and generate new and put it there right I I'm not doing that because it I mean I have already done so you can take the access token
112:30 - 113:00 from here and put it here right put put this access token here but what is this domain and cluster ID go to uh this my data breaks workspace so here I will find this URL ADB uh right let's see if this URL is the same ADB 130 that thing aod datab breaks. net right so you
113:00 - 113:30 see this is this one this URL you have to provide this URL you have to provide here and this cluster ID we can get from the cluster so when you come here you can see either somewhere under the properties or you can just get it here right so if you see this 1107 right so here is the cluster ID that you have so go to compute right click on
113:30 - 114:00 your cluster and you can have it there will be more ways to check that out right but uh again uh that's one of the easiest way that you have your cluster ID you mention that cluster ID here so two things domain and cluster ID and uh that's it that's it test connection it should be successful of course yeah connection successful anyways I will cancel that so I've have created a link service to connect to my
114:00 - 114:30 SQL DB I have created a link service to connect to my ADLs Gen 2 I have created a link service to connect to my AO data brick Delta Lake right so these three Link services I have created and keywall I anyways not going to talk now we will see later so that's fine Link services is done so Link services all the three which were main we have created right these are the three we have done created now we would be creating
114:30 - 115:00 data sets data sets from ajour SQL DB we have to get the tables right uh so let's create the data sets now so what I will do I will uh click here author tab if you want to create a data set you click here and click on uh data set select what's your let's say
115:00 - 115:30 are you looking to create it something for your ADLs Gen 2 or you are looking to create it for your SQL DB let's say I want to create for ADLs Gen 2 right U Gen 2 right now in this what kind of file I want to store what kind of data set I want right delimited text or let's say a par or whatever that way we have to create that means first more like a link service name and then data set name
115:30 - 116:00 when you are creating this so uh let me first see the data set for aor SQL DB right so you can see this you can see what we are doing so we have created this for a SQL DB and we have selected our link service for your SQL DB you can see this right we mentioned in data set we have to mention what link service it is related to so
116:00 - 116:30 now we understood we have given our credentials for Azure SQL DB but in that what table what database all of that right how do we know so that thing so link service we have selected and then DB name you remember last time when we created a link service we have to give the DB name but we still have not given here also we are not providing the database name right so this is inherited you see link service properties this particular thing is inherited from the link service and uh
116:30 - 117:00 it says you give the value but we are we will still not give here we will get it from the pipeline which is a third level so first is link service then data set then pipeline so let this value come from pipeline only right to make it Dynamic so basically I uh this is a dummy one actually I should have clicked on the proper one which we did for a SQL DB which was this one generic SQL DS because again Our intention is to make
117:00 - 117:30 generic so let me delete this anyways U so I'll delete it later so this delimited text one and aure SQL table one these are of no use so anyways as part of part two we will clean these two things but this is the one that I want to showcase so when we have created a link service for aour SQL DB we are creating data set for that and that is generic SQL data set right okay now in this we selected the
117:30 - 118:00 link service whatever we have for our aure SQL DB this is the one and DB name we are saying okay we still we are not passing we are not passing this name right you see at the data set. DB name right so still we have not passed right so you can see there are three parameters if you want to connect what parameters you need right if you want to
118:00 - 118:30 connect to aor SQL database right so we know credentials are said but we have to still tell the database name which actually in link service we were supposed to tell right in link service Only We were supposed to tell then we have uh uh schema name and database name sorry table name right because if we have to specify and get this then we have to mention schema
118:30 - 119:00 name and table name also so we have two more parameters schema name and table name we have not defined the value we are saying these are of string type and when I go to connection here right you see DB name is something which is coming from link service itself we have not created newly it is coming from link service right I hope this is clear and the table name and the this schema schema and table name that we
119:00 - 119:30 have defined here as part of parameters right so we are not passing the value we will get it from our uh pipeline itself because all of these things will come from our configuration file which we have uploaded in ADLs dento that's why we were saying it's a metadata driven architecture so all of these three things these parameters nothing we are passing right it's all we are keeping generic right we are seeing at theate data set. DB
119:30 - 120:00 name schema name table name we we are not passing the value we are not hard coding we are saying we'll see it later later means from the pipeline itself right so this data set we have created right then this is done this is done as you see equal DB data set number one so in terms of uh data sets
120:00 - 120:30 right aor SQL DB we have done and we have not passed any of these parameters this one we were getting from link service itself this parameter we were getting from link service but still we have not passed the value and these two new parameters we have created but we have not passed the value we will see we will see it from pipeline itself now next data set that we have to create right now we want to uh we want to
120:30 - 121:00 basically read our configuration file also our configuration file is kept where right our config file to implement the metadata driven architecture we have kept created a configuration file where we have kept that that is in ADLs Gen 2 and that file is is in what format a flat file basically flat file uh delimited file comma separated right delimited
121:00 - 121:30 file so we have to attach this link service and we have to create a new data set delimited file so that we can read this configuration file so I'll show generic ADLs flat file that means any flat files we will be able to read so using this generic SQL data set any any SQL table we will be able to read because we have not hardcoded what table name what schema name same way if any flat file we have to read from our ADLs
121:30 - 122:00 Gen 2 we will use this data set generic ADLs flat file data set right I hope this is clear and what link service we have Associated a your data link storage one right that link service we have created right we have Associated that okay now in this we have to tell from what file we have to take like what
122:00 - 122:30 we have to give is the uh file path file name and apart from that container name in order to read this we need all these things right so as part of Link service we know the credentials to connect to ADLs D2 but now we have to give more details that okay what's the file format file format is delimited file CSV right so we
122:30 - 123:00 mention in this data set what's the container name from which we read what's the file path from which we read what's the file name all these things we Define as part of the data set right so you see we have Associated the link service and this container so we have defined container file path and file name as the parameter all of this string type and now now we are just saying the file path from what
123:00 - 123:30 file from where to read again if you see it should be landing that's a container slash inside that we have created a folder EMR that's a path and then inside this uh the name of this file was the name of the file if I go and check quickly it was uh so configs EMR not Landing sorry configs
123:30 - 124:00 EMR and then load underscore config so configs EMR load uncore config CS right so this is nothing but the container name this is is the file path the folder name and this is the file name right so we have defined you can see
124:00 - 124:30 uh you can see this that we have defined basically uh all these three parameters container file path file name and here we are saying data set do container data set. file path. file name that's the total path from which it has to read but these things I'm saying I will pass from my pipeline itself I'm not hard coding here I'm not hard coding because I'm naming it as generic that means I should be able to use it for any kind of flat file then
124:30 - 125:00 later right and no compression uh comma separated file all of those things details related to the file and first row is a header so we have mentioned that oh it's a comma separated file first row is a header and how to read the container the folder name and the file right I hope this is clear so we have uh
125:00 - 125:30 created this also so this will be able to read any flat file this will be able to read any table right we have created generic way now what other link service we need now we take the data from aour SQL DB through data Factory and we load it to ADLs Gen 2 in park format so we also need in ADLs d 2 in a DLS d 2 Link
125:30 - 126:00 service we need par par file right so park uh data set we have to create so generic ADLs Park data set just like flat file data set same way park it also connect to same link service which is our storage account ads gen to and here we have given the parameters again same
126:00 - 126:30 way container file path file name same way just like we have given for our text limited but here the file type is Park we have selected Park when creating this right so for example if we have to create this this data set right I will just show you uh gen two and then uh I select par here like that and then I continue and do all of that
126:30 - 127:00 that's how we have done the only difference is this file is Park here and the compression type is Snappy and again container file path file name I'm saying okay let it come from pipeline because it's a generic thing I'm not hardcoding these values here so this is also set so we have created a data set generic data set for a SQL DB a generic data set for delimited or flat file generic data set for Park file
127:00 - 127:30 right Associated the Link services also right so these are the three that we have created these two are dummy ones I will delete it later now when writing to the audit table audit table right audit table that it means my uh if I have to say data bricks Delta link right for that also I have to
127:30 - 128:00 create a data set so that's one that I have created AO data bricks Delta Lake data set right I created Associated that link service and if you see the parameters when I am writing something to that or reading something from this I need two things one is schema name and one is my table name so I have these two parameters schema name and table name both of string type not passing the
128:00 - 128:30 values and this database and table name will we will get from so schema is nothing but like a database name as I mentioned schema name is nothing but database name when you talk about your data breakes scheme and database is synonymous right so this is again we will get from pipeline when we run the pipeline so we have created four data sets that we want so link service is done on top of that we have created data
128:30 - 129:00 sets now we will create the pipeline we will create the pipeline and have activities inside that that's how it is so now comes the part of pipeline how to create pipeline otherwise plus Pipeline and pipeline create right so if if you see I come here pipeline okay now I will show you this
129:00 - 129:30 what does that mean so first of all so pipeline is made up of activities right pipeline is made of activities now the first activity that we have is a lookup activity you select from here so if you just say lookup you I mean bring it here right so I got it look up activity so basically what it do is it looks up for a file so if I basically
129:30 - 130:00 show you this show you this what we have done is I gave it a name give some proper name so what this will do this will read the config file so the lookup activity lookup activity will read the config config file config file is where
130:00 - 130:30 in config folder EMR config container EMR folder and we have that file you remember this part this path we want to read so this is nothing but the container name this is nothing but the file path this is nothing but the file name right I hope this is clear so let's see so if I uh go to settings if I go to settings you can see
130:30 - 131:00 here I select the data set right in when creating data set we put the link service but now when creating pipeline we put the data sets right so what data set I mean since I have to connect to my config file that's my that's my uh FL FL file data set right generic ADLs flat file data set so generic ADLs flat file data set now three things that I have to
131:00 - 131:30 mention the container name the file path and file name container name is configs file path is EMR that's the directory and file name is this load configs right we were avoiding to give this details but now finally we have to give that right so it reads this and you can see file path in data set so it forms the complete file path recursively it reads this file
131:30 - 132:00 recursively it reads this file I hope this makes sense you can just check this out right okay I think you should not have any doubt I mean if you want you can preview the data but it will start a small cluster right so you you can see this I'm doing a data preview the data preview session would have been on
132:00 - 132:30 right okay I I I hope this should be pretty much clear to you so when you preview the data it will start a small this thing cluster in order to see that so if you see right you can see your config file that you have this was the same config file right that okay which database data source table name load type incremental or full Watermark
132:30 - 133:00 column so Watermark column will be in case of load type incremental only right is active right now I have not implemented that even if it shows zero the pipeline will still be running for that right when I run and Target Path hospital A or B okay so that's fine I hope this is clear now this file will be loaded like that and now after that for each entry in this file that means for each entry in
133:00 - 133:30 this file so we have how many 10 entries right we have 10 entries now we are using a for each activity for each so if you want a for each you just search for for each right this one so for each entry in this file we want to do this right all of these things right so this for each will run in a loop 10 times that way I let me click here now just to see what is
133:30 - 134:00 there for each entry that means for first line that it reads the line that it reads let's say would be uh like this right initially for this line it will do all of these steps so first it checks that okay file exists so this is a get metadata activity you can search this activity get metadata right so what this get metadata activity do it
134:00 - 134:30 is it checks whether this file let's say it checks whether this there is any file with the name let's say encounters because first row was this encounters right so is there any file with the name encounters which is there in our bronze folder bronze folder right it will check that so how
134:30 - 135:00 it checks you see if it has to check something in the bronze folder and of course now it has it is checking for Park files because if the pipeline has run earlier it would create Park files only right for this you will understand that later so what data set that we want to see here it is generic ADLs Park DS because we are now looking for Park files right related to park files now right so we have this
135:00 - 135:30 park data set here now container name bronze so for example if I uh go to this and I go to bronze container bronze so you see in inside inside this Hospital a inside this Hospital a is there any file with the name let's say
135:30 - 136:00 what was that name encounters because first file was there right encounters. par this would be Park file right anyways so if you try reading the data it will show in some random format there is a lot of data so yeah anyways so we wanton to see is there any park file present inside uh bronze container right bronze
136:00 - 136:30 container and uh Hospital a or Hospital B folder right so if you go to this particular thing you see container bronze and file path file path is nothing but when we ran so if I go one step back here when we ran this right it gave me it gave me
136:30 - 137:00 this particular data right this particular data now in this in this if you see that table name is table name is nothing but this data source is Hospital a and Target path is h a right all of this entities we get as part of item so let me go here and show again now it will be more clear so you see at item that means
137:00 - 137:30 whatever we got from the previous Step at item. Target path Target path you see this what is Target path in this case hor right so that means this will dynamically give you a value of hor for the first line right so file path will be so I'll just write container is what hardcoded as bronze you see this container bronze file path is what file
137:30 - 138:00 underscore path is nothing but whatever this at item that means whatever previous T has given previous tab has given this particular line so I will just copy this here right so previous step has given this particular line first line because it's iterating one by one so Target path is nothing but h a that means hosital a h a
138:00 - 138:30 right and file name file underscore name we are saying item. table name item do table name is what this one so we get this oh sorry this one right and in this what we are saying split based on dot so if we split based on dot we get items of zero and items of one so this encounters will be items of one so we are getting of one that means we get encounter we are getting rid of
138:30 - 139:00 DBL right so that means that means file underscore name is file underscore name is what uh encounters right that's how dynamically we get this from the first line now we are getting this that means container bronze file path this and encounters that means what it's checking
139:00 - 139:30 is what it's checking that whether this a file with this name exists or not and since we have selected Park DS that means automatically it will check for Park file right so file named slash uh bronze SL Hospital a slash encounters. par Park that means when previously we would have run the pipeline we would
139:30 - 140:00 have got a file like this right now same case is there actually so uh if you see here inside container we would have we have run the pipeline earlier so in bronze for say we have all this because our pipeline was run earlier so now when we run next time the pipeline it sees that this is there so we should move it to Archive that's what we are planning to do now right so it detected that oh this is there so before we invoke it
140:00 - 140:30 again let's move it to Archive that's what we are doing right so exists if it exists right if it exist then what we do right if it's true that means if the file exists right then what we do is we basically so I will write it here if the
140:30 - 141:00 file exists that means if it is true right what activity we are running for that you see this if condition so activity is if condition if exists right and if condition is true then what we are doing I'll just show you that if it is true then you think what we want to do we want to move it to
141:00 - 141:30 Archive from where source is this so uh source is this and Target would be what Target would be the container is same let me just take it and modify um container would be bronze right U then in that
141:30 - 142:00 archive archive and inside this also we are creating a hierarchy of year month and day right before archiving so uh archive and in this year month day and then finally we put the five finally we put the file right that's what we are doing so let's see how we have done that Source anyways is container bronze Target path
142:00 - 142:30 which is hor and file name which is uh encounter right you know that we are this is our source but syn what whatever Target Target is nothing but you can see again we want in Paro same data set Park data set now container is bronze and you see the file path we are seeing item. Target path what was Target
142:30 - 143:00 path horse a right horse a right so you can see this after that we are seeing archive so before this hor a also and uh archive and then we are getting the current time using UTC now so it gives me current time stamp out of which I'm extracting the year right year I'm extracting the month and I'm
143:00 - 143:30 extracting the day right so this is my path so the path is nothing but the target uh path SL archive that means Target path which is nothing but H A or h b archive year month and day for it gets the current time stamp and extract the year month and day and that's my file path and after that after that the file name is nothing but
143:30 - 144:00 whatever item do table name and separated by this so same we will get encounters so we are getting this I hope this is pretty much clear to you right so you see it's a similar path it's a similar path path if I basically uh go here containers bronze so h a then archive right so before archive this was
144:00 - 144:30 there right the target path was there so I hope this is clear okay so this is fine we have so in case if the file Park file was present already then we are moving it to Archive so that when we run it newly it should not collide with that right that's what we are doing so this is taken care if it was true we are doing this if it was not
144:30 - 145:00 present then nothing we have we do not have to do anything then so only file exists we are doing this particular thing that means when it is true we are archiving the file we are archiving it and for archiving I have shown you this code otherwise no need to Archive because only if it is present it is archived otherwise no and once it is archived then we have to do the remaining activity that means we have to get the data from so the flow
145:00 - 145:30 was the flow was first read the config file right iterate over each entity one by one in our case 10 entries are there so 10 entries were there so it will iterate over each entry one by one right using for each using for each right and for each entry it will try to
145:30 - 146:00 run the pipeline and we saw that the first entry was this let's say so it read this and then what it saws that if the par if the uh encounters do RK file is present in bronze folder right then we move it to Archive
146:00 - 146:30 before doing any other thing and then after moving to Archive if it is present move it to Archive and follow the remaining steps if it is not there in archive if it is not present then no need to Archive because it's not there and follow the remaining steps right what are the remaining steps now after now what we want is take the data from aure squal DB and put in
146:30 - 147:00 bronze uh bronze container right bronze container that's what we want right so take the data from Azure equal DB and put it to bronze container now that means when you have to take from a SQL DV what data set you will use internally so something we have created for aure uh SQL database right we have done and when we have to put to bronze
147:00 - 147:30 container we have to put in par format so park generic par generic right so this will be for my source and this will be for my target right these data sets I will be using I'll show you that so you see now next step is you check basically
147:30 - 148:00 if it is a full load or not if it's a full load or incremental how you will check because you get this entry right uh in this you have this uh load type incremental or full so you get items. load type and if items. load type is incremental we have a separate strategy if items. load type is full we have a separate strategy that's what we are doing if condition if
148:00 - 148:30 condition and you see this what does it say what's the expression we are giving we are saying equals item. load type full that means if looat type is full that means if it is true do this if it is false do this the below one so let's see in case of true what we are doing that means if load type is
148:30 - 149:00 full right if load type is full what we need to do I'll just show if load type is full what we need to do we basically take all the data from a your SQL D the entire table data that means like select star kind of from table right and take it and put it to our uh Park file right select star kind
149:00 - 149:30 of thing there is no logic of the watermark column and all so let's see very simple logic so if you see here I show you the source source will be what a SQL DB right so that is SQL data set right generic SQL data set now we have when we have to Source uh SQL
149:30 - 150:00 DB connections and all were given in link service but still we have to give three things the database name right we have not given that the schema name and the table name right we have to give these three things we have not passed it earlier now how do we get the database name the database name so we get it from here we get it from here the
150:00 - 150:30 database right that okay it has to read from trendity Tech Hospital a so that is items. database isn't it how do we get now schema name schema name schema name will be schema name will be we uh take this table name and kind of split it based on Dot and then of zero right this particular part we extract this first
150:30 - 151:00 part from this table name and how do we get table name here we extract the second part from this table name so items do database will give the database name items. table name split of zero will give this schema items. table name split of what one will give us this table so we get all of this items. database items. table do0 and one so
151:00 - 151:30 this is how we are getting these things so we are not putting it we are getting this line from configuration file and it's automatically taking from here based on the configurations metadata driven architecture right okay and it it connects to this table what it has to get from the table what it has to get from the table it has to just say select Star right it has to say select star uh
151:30 - 152:00 apart from that we get all the columns from the table whatever apart from that I want one more thing data source item. data source item means whatever this config file is given do data source it will give me H A or hosp so just for my reference in my Park file I want to have one more column that okay what's the source is it Hospital A or B so that is
152:00 - 152:30 again uh added Hospital a or Hospital b as data source from from what item. table name right from item. table name I hope this is clear so we have given a query and this will will get all the data from counters table it will let's say if it has 10 columns it will bring all 10 columns plus it will have one more column which is nothing but the data
152:30 - 153:00 source like H A or h b that means we understand that whether it's coming from hospital a or Hospital B so I hope this is clear so now once the data is copied what you want to do we mentioned right we are maintaining a audit table so that we can can uh keep a entry for I mean once this is done we want to add an entry in the audit
153:00 - 153:30 table so now audit table is nothing but your uh stored in your Delta Lake right so if I go to settings you see a your datab Delta Lake data set right and this schema and table you can provide randomly because you are providing a query see if there are two ways either you provide the schema name table name and you you can select table here or when you are giving query this does not come into picture so what we are doing
153:30 - 154:00 right I take this and uh just talk about it right so we are inserting into audit audit was the schema this was a table name we are storing the data source the table name number of rows copied all of these column we are storing now we have to put the values what values so data source is what item. data source we got it from the config file itself right this this one this whatever we were
154:00 - 154:30 getting was item so item. data source so many things we will get from that so item. data source item. table name we are getting from this uh again config and whatever previous activity ran right in terms of copying the data that full load activity will also tell that how many output rows copied so activity activity name do output. rows copied will give us how many rows copied items.
154:30 - 155:00 Watermark will from that config it will give me the watermark table column I mean right and UTC now anyways Watermark column and all does not come into picture because it's a full load and UTC now is current time that means at what time this pipeline has finished for this particular table so that's how I am putting each after this finishes
155:00 - 155:30 I'm putting an entry right I hope this is clear let me see if my data brakes cluster is turned off yeah that's fine so I hope this is clear to you so you understood about the full load right you understood about the full load so how the pipeline was for full load was we saw the config file for each entry let's say we are talking about the
155:30 - 156:00 first entry now right if uh the file is already Park file is already present we archive that after archiving we see whether it's a full load or incremental if it is a full load we come here right and we kind of take all the columns apart from that one extra column and we create a park file once that is done then we put an entry in the audit table which is in Delta link Delta table
156:00 - 156:30 right now in other case when this is not a full load right when full load is when this particular thing uh is not correct equals item. load type full when this is false then it will come in this part this false part and that will be our incremental pipeline where Watermark column will be handy so what does it say so whenever we have to handle
156:30 - 157:00 incremental load first we have to check the Delta table that okay what's the last time when the pipeline ran and we have to take the data only after that time so we have to have a lookup activity whenever we want to read something from a file or a table we have to have a lookup activity for that right so we have a lookup activity and if I go to settings I am mentioning the data set so that I can read something from my Delta Lake data
157:00 - 157:30 set right this anyways I'm giving dummy right not required because when we I'm giving a query then this all of this is overwritten and now I'm just showing what pipeline I am mentioning for incremental right for incremental first I I am going and checking the audit table table first step is check the audit table and this is how we are checking right what we are doing so from
157:30 - 158:00 audit table what do we want the load time that means when this or till what time the data is loaded so what we are saying select right uh we are saying Max load time Max load time in the audit table in the audit so I just will show you the audit one uh audit table anyways I have not run it till now
158:00 - 158:30 so in in the audit so it will be uh load time right that what time it was loaded earlier so current time stamp we were mentioning so only we have to we have to get that and only load data after that time onwards right isn't it because that time the pipeline has run so what we are doing is so what we are doing is we are checking the max load time from that
158:30 - 159:00 right Max load time and if we are running it for the very first time it won't be there so that's why we are saying Coes like so that if it is null then please give me a very back date some some very old date if we are running the pipeline for the very first time then we will get this and and if we are running it for for the second time and all then we will get the max load time right as last F did that means we have feted the data till this point right and this Max load date will give
159:00 - 159:30 us some time stamp time stem means date along with some time but we are casting it to a date so if we are getting let's say yesterday's date along with some time then it is truncating to date only it is removing the time part of it right now where data source right data source is item. data source we get it from the config file right item means we get it from the
159:30 - 160:00 config file and table name equal to item. table name so this will be uh item. data source if you see what it will be uh hos a and table name will be db. encounters right so I hope you understand that uh you can get the data from this particular audit table you can get the data from this
160:00 - 160:30 particular audit table because you remember we have created a schema also the schema was either uh this right H A or hos B right which is nothing but data source right so we are appending the schema and the table name right I hope this is
160:30 - 161:00 clear so from this audit table right we are getting the record so if you see uh if you see this audit table I I do not have the audit table but these will be the records so I basically see for that table that data because a table might be from two data sources Hospital a or Hospital B so I need to check what
161:00 - 161:30 hospital I am referring to and accordingly see that particular row and I have to see the last load time right so now once I get the last load time right let's say it is uh some uh date let's say today is what 23rd November let's say it is 21st November number 2024 and sometime 854 I mean do not check the format and all so let's say this was the last time so after getting
161:30 - 162:00 truncating right after casting to date we will just get this part right we will just get this part so we understood that we have handled the data till this part but now when we handle newly we handle this date and Beyond right we will still handle this date again so that if there is any overlap for this date that it will be taken care let's say uh we handle for this date again that means all the records which were processed
162:00 - 162:30 till 8:00 P.M will also reprocess that's fine on a safer side we will consider still consider equal to or greater than this date right next time so that means all the records which I have processed till 8:54 p.m. earlier will also be reprocessed next time that's fine because I want to have a safer approach generally dealing with EMR the systems are not that great and might have this kind of issues so I want to be very very
162:30 - 163:00 safe and make sure okay I'm ready to reprocess because anyways uh my pipeline will be a itm poent pipeline it's not that if I reprocess or reload it will create issues or create duplicate records it won't right it will merge it my pipeline will still have the same results no matter how many times I run it right so I still consider for this date and any other date beyond that so let me go here right so I from
163:00 - 163:30 the lookup table I understood that when last I read that right and after that I can now Implement my copy data activity in my copy data activity I will use this particular date whatever I have fetched so in copy data activity what data sets I will use I will use something for SQL data set SQL generic and I will use par so Source will be SQL data set because I
163:30 - 164:00 have to read the data from SQL right DB and now again uh the database name the schema name the table name that I have shown for full load also same way we are getting no change right no change and the sync that means the target target is the park data set right so we have to load it to bronze container so we have to load to bronze and inside this where
164:00 - 164:30 we have to load uh we have to load to the Target path Target path will be either your hospital a or Hospital B folder right uh Target path that's a folder right so if I have to say this target path and inside this the file name file
164:30 - 165:00 name file name anyways will be uh if you just take this table name you get this uh split this and get the second part encounters we have seen that earlier also right so you kind of understood that we loaded here and after loading after loading right after loading the data you can see this query first of all to load you can see this query to load I am
165:00 - 165:30 selecting select star and additional one more column for my reference as data source uh from item. table name where right so from this table where item. Watermark greater than or equal to this particular date right and this date was I mean how do I get this uh I have fetch loocks activity right just now I have the previous
165:30 - 166:00 activity this is fetch loocks right from Fetch loocks activity the output and the whatever first row I getting right from Fetch log activity the output first row whatever I'm getting and in that this column last fetch date right last last F date so this this is nothing but this particular last F date and this is how you have to give activity fetch logs.
166:00 - 166:30 output. first row because only one row right and last fetch date this column name so row name column name right and that's how you are saying that get me only records after this particular date and this date will be let's say replaced with this after or equal to so whatever you have processed might come again but since later I mean you would Implement your slowly changing Dimension and all it
166:30 - 167:00 will merge properly you do not have to worry I hope this is clear right so this is our incremental and Watermark columns help us here and once all of this is done of course then you have to uh again go back back to your Delta table right and store this last I mean current time stamp so that you say till now I have I mean sorted all the data all the data
167:00 - 167:30 taken care till current time stamp so after that to write again to a table you have a lookup activity right lookup activity and again the data set will be your Delta right Delta Lake data set and this anyways Dain not required because when we are writing a query right this is all fine and now I take here and show you this so finally we update
167:30 - 168:00 this uh or add one entry to this audit table insert into audit. load logs right so we are inserting now right and we have this column names what values item. data source that means the config details from the config file whatever we are getting and activity incremental load CP do output. rows copied that how many rows were copied so I mean once we get to know about this activity we can get
168:00 - 168:30 output. row copy right and what uh was the watermark column and the UTC now that means the current time stamp we are mentioning that means all the data sorted till current time stamp that means all the data we have taken care loaded till the current time stamp right and that's how we add our entry to the audit table I hope this is clear so now let me go back and with
168:30 - 169:00 this you can see we have taken care about it so how this pipeline flows first you read the config file so whenever you have to read a file or a table look up activity right look up and then uh anything if you want true or false then if condition right and if you want a loop kind then for each right so all of these things so if you see look up activity to read from the config file stored in ad less than
169:00 - 169:30 two then for each entry you have this for each and then for each entry what we are doing again get metadata to understand what elements are there right uh if the file was present then we archived it and after archiving we see whether it's a full load type or incremental using the config details and if it is full Lo we kind of uh take all
169:30 - 170:00 the data put it dump it in a park file and update the audit table if it is incremental then we first fetch the that what was till what time we have loaded the last time we loaded right we get that and then we take all the data after on and after that date right and then we uh then again finally update the uh this thing uh audit table again that's how we have designed this pipeline
170:00 - 170:30 now in order for me to run this uh right now you see audit table is clean now I can definitely run this thing so to run this what I will do uh I will u i will basically go to the pipeline Pipeline and I will say add trigger trigger now and I will run okay
170:30 - 171:00 now this will start I can go to the monitor Tab and start monitoring this right so this you can see will start now one thing one challenge that is there with our current pipeline is there are 10 entries right that means 10 data sets will will be loaded from the tables or 10 tables will be loaded one by one so there are two options it could have been done sequential or parallel for parallel anyways we need more resource but that's
171:00 - 171:30 a different activity but right now our pipeline is sequential now you might say why sir why sequential the thing is that the thing is that when you see our audit table it has this ID which is a auto increment column right so in parallel we cannot update this in parallel this cannot happen only in sequential this can happen and in my part two we will solve this problem and we will make it
171:30 - 172:00 parallel right because it should not be sequential because one table is not dependent on another so it should not be sequential so this is one of the limitation that this is sequential right now because with this Auto increment if you try to update this Auto increment in parallel it will fail right so that's why the pipeline is sequential and how do you make it sequential when you go to the pipeline when you uh go to the pipeline I'll
172:00 - 172:30 show so you can see to settings we have checked this sequential if you do not check this sequential it will be parallel but it will fail because of Auto increment that each thread will try to update that right that will not happen so we will solve it in our uh next part but for now the pipeline is sequential if I go to monitor tab the pipeline will be running right the pipeline will be running you
172:30 - 173:00 can see uh right anyways when it runs I'm not sure if our cluster is on um the pipeline might fail also because our cluster data bricks cluster might not be on let me quickly turn that on the cluster on uh compute okay it's automatically trying to turn that on because we are trying to uh see that right so it will try to uh
173:00 - 173:30 turn the cluster on in this case and once the cluster is turned on then it will be able to write and read from this uh Delta table right which is the audit table so that's why all of this is waiting for now because this fetch locks is not will not happen right until the cluster is turned on and we are able to see the table so it is trying to turn on the cluster whatever we had so let's wait for it by the way if you have been watching
173:30 - 174:00 this video and enjoying this uh please make
174:00 - 174:30 e so initially when we are running this the uh audit table is empty right so any full load anyways will take care it will happen normally uh but the incremental
174:30 - 175:00 it will take 1990 as the last time when it ran right because we use KS function so uh let's see if cluster cluster is created things will happen now uh things will happen now you can see things are in progress and things will happen soon now if cluster is created then I will go to workspace I will go to uh audit table and I will just say select
175:00 - 175:30 star earlier we saw nothing was there now also nothing might be there but it would be working towards that once any of the table is finished ingesting it will uh have the entry here mentioned so you can see fetch logs has worked because the uh I mean cluster is created and it would be able to read the Delta table right so right now incremental load is
175:30 - 176:00 happening because the first table itself was incremental load you remember right that was the encounter table that was the incremental approach we have to follow so uh it is you can see incremental load CP activities going on and it has around 10,000 record and we have very minimal resources so that's why it's taking time so this pipeline will take some time to run because it is sequential first of all right we will make it parallel later so anyways let's
176:00 - 176:30 see I think probably because of query timeout uh let me see it will try to again uh run that it will try to re run that basically so let's see if anything has come now I can see one incremental load is refresh this then definitely I should
176:30 - 177:00 see something because it it would write to this right or it's yet to be return so incremental load now insert lock so so you see first it fetch the locks so how how is the flow so it checked the if condition whether it's a full load or incremental it saw it was a incremental so it fetched the locks it understood when it was last feted and after that uh what it did it did the
177:00 - 177:30 incremental load based on the time it inserted uh once the incremental load is done is it in inserted one entry into the audit table right now you can see definitely one entry would be there in the audit table and you can see this Hospital a uh db. patients right uh number of rows copied 5,000 modified date load date all
177:30 - 178:00 of that right so this it has done and I think for some reason when we were F first row was encounter and I think we do not have we have not set retries but uh due to the the server took lot of time or the cluster took lot of time to start so it would have failed because of that right so we can check out why it failed right but you can see this is
178:00 - 178:30 working but now things will happen one by one and it will take some time right now if anything else fails then definitely we would have to look whether it's able to not connect to a database or something but only that time things have failed when the cluster was not started right if anything fails now we will have a look idly the cluster should have been
178:30 - 179:00 running the cluster was off and we started the pipeline so let it but now um we have covered a lot right and I am sure you would have learned a lot from this now if you basically see that in our part two what I want to show in our part two uh which will uh be coming soon and once it comes it will be mentioned right U we will
179:00 - 179:30 Implement implement the silver layer in silver layer what you will do you will take the data from bronze and you will take the data from bronze you will clean it based on certain data quality checks you will Implement a common data model you will Implement a slowly changing Dimension type two and then data will be stored in Delta tables right not no longer just plain Park files Delta tables right and then apart from that we
179:30 - 180:00 will be implementing a gold layer implementing gold layer right and gold layer where we will be creating facts and dimensions facts and dims basically right now we will be also implementing a keyal implementing keyal right as to store our credentials and all as part of best practices we
180:00 - 180:30 will improvise on our naming convention improve the naming Convention as per industry um conventions right uh we will make our ADF pipeline uh pipeline uh parallel uh apart from that we will also see that how to get the data from apis right whatever we said that we get doctor details and the codes right we
180:30 - 181:00 will see how to get the data from apis and apart from that uh what we will do we will understand more about the claims data because till now we have not dealt anything with the claims data right which insurance provider has putut is uh uh flat files right now you remember we have a is active flag zero or one we have not implemented it till now we have to implement that is active flag
181:00 - 181:30 implementation right and uh then we have to implement we can implement the unity catalog right now uh our catalog is a Hy metast store so right now our catalog is a hive metast store which is local local to a datab
181:30 - 182:00 briak workspace not recommended because other workspaces cannot see what tables and all you have in that right so that's where it's always good to have a Unity catalog centralized meta data repository so that it become it acts as a centralized thing and other workspaces can also interact with it right so let's see what's happening I hope more tables would have
182:00 - 182:30 done by this time okay four tables are done and you can see uh two of them were incremental two of them were full load so provider and Department full load patient and transaction incremental that's why we have Watermark column which is modified date and you are getting that load time you can see that so total 10 tables it will take another five seven more minutes you can understand that but but you kind of got a gist so these are the things that I
182:30 - 183:00 will be mentioning as part of part two if you feel that I should be it's good to mention some more things do mention in the comments right that what more things you want me to implement as part of part two because I want this to be a very very good project until now you would have understood about the quality of this right you will generally not get it as part of even paid contents this kind of project and this is not a dummy scenario this is a real world project of
183:00 - 183:30 course data set we have mimicked but this is a real world project and if someone is from Healthcare you would very much relate to this right and I was talking about our uh Hive meta store right so if I go to this catalog if I go to this catalog you can see we have this uh audit schema that is a database like and load locks this is our table name right this is our table name and we
183:30 - 184:00 have these columns so this is a hive metast store which you cannot see when you are in some other datab breakes workspace so better thing would have been that if we would have done it or created this audit table as part of of our uh Unity catalog so that it can be discoverable outside also right that's as part of good practice we should have done which we will certainly Implement
184:00 - 184:30 later so whatever parts we have missed or where we have not followed best practices we will try to implement in our part two so let me see what all is done till now uh so if I go to my uh workspace and audit table let's see how much is done right totally uh uh six uh tables
184:30 - 185:00 are done four more are yet to be done and it will be done but you understood kind of what is happening right so the data would have been moved to Archive also so I guess it takes generally uh not the I mean current is some why as per those time zone it will be 22nd November that's fine but you see uh this is coming in the uh this particular
185:00 - 185:30 table so basically the data will be taken from aor SQL database and or h b right Hospital A or B uh right all of this so you see these file 23rd uh now you can see it's 4:42 a.m. actually right it's almost morning now uh I wanted to release it so uh yeah so you you can see this time you can see
185:30 - 186:00 right so this uh files have just come now some time back only so you can see Department patient provider and transaction have come now this encounter was little later because it failed because the cluster was not up right so anyways and same way for Hospital B also it would be there you can see encounter uh patient provider transaction so almost done right almost done uh we can see total
186:00 - 186:30 eight or nine tables would be done eight eight done right so with this we have implemented a very very beautiful solution to bring it to our bronze the EMR data to our bronze right uh now with this uh I would end this but before all of that what I would need is that you make sure you clean your resources in terms what I mean to say is
186:30 - 187:00 that uh go to your computer stop that right go to your computer stop that apart from that ADLs account is fine it won't charge much very very minimal uh your data Factory even if you have created but not running then it's fine so nothing compute resources is something which you have to turn off right so make sure you do that and then things should be all fine so with this we are done and I hope you would have
187:00 - 187:30 truly truly enjoyed the session do mention in comments about how you liked it and subscribe to the channel and I will come up with the part two very soon with all of these things plus some extra things which you would mention in the comments because I want this to to be the topmost project on YouTube for AO data engineering so with this we are done hope you liked it thanks a lot so let's get started with the part
187:30 - 188:00 two of the AO data engineering project part two Azure data engineering project by the way I'm me I'm pretty sure you would be aware that my name is Sumit mital and I offer a ultimate data engineering program you can check out the link in the description and it's a
188:00 - 188:30 program which has changed several lives so without any further delay let's get started with part two now I have a very clear assumption here that you are only seeing this part when you have seen part one so if not seen part one then please see that please have a look at that and I will provide that under the description again link to the part
188:30 - 189:00 one because it's very much related to that it's a continuation of that okay so now let me show you the architecture diagram that we were talking about I'll show you the architecture diagram and this was the one that we referred so do you remember what data sets we talk about so we have EMR data
189:00 - 189:30 coming from two different hospitals let's say so one second I'll move this here U okay so we have EMR data we mentioned about right electronic medical records and this is coming from two databases the data sets were if I can recall the patient data right patients we have uh providers providers nothing but doctors
189:30 - 190:00 so patient is let's say someone visiting to the hospital provider is the doctor itself then we have department and then we have transactions we have talked about all of those and encounters right in case if you you cannot relate to this please check the part one so we are getting this from two different hospitals let's say so if you see this we have five data sets coming from uh SQL DB here and five data sets
190:00 - 190:30 coming from this both of this Hospital a and Hospital B now then we have some claims data which insurance company upload in the landing folder so we we have claims data right and it will be put in the landing folder by the way this EMR data was there where exactly it was it was in
190:30 - 191:00 Azure SQL DB and we have already uh created a data Factory pipeline to bring this Cod bring this data from Azure SQL DB to our bronze layer in par format so we we already got the EMR data from both the databases right in uh in bronze layer in par
191:00 - 191:30 format and we wrote the I mean we created a ADF pipeline for that you remember the pipeline was very generic it was a metadata driven pipeline so that we get the parameters from a config file and so on so these five data sets are already sitting in bronze layer so in this case The Landing layer is skipped because Landing layer is when some third party dumps the data in the landing layer or the landing zone
191:30 - 192:00 right so now we have the spay file spare is nothing but the insurance provider mainly right so the insurance provider insurance provider uh dumps dumps the claims data in landing folder Landing folder or Landing zone right so you can see this insurance
192:00 - 192:30 company a third party right would be putting a CSV flat file so in landing Zone whatever is there will be flat files in bronze layer whatever is there is Park files in silver layer we will have Delta tables gold layer also Delta tables right so again if I have to say Landing we are having flat files that is like CSV in our case it's not always that in your case also you would have it the same way but in our case that's the
192:30 - 193:00 thing we have to be consistent bronze par silver Delta tables and gold again Delta tables right so in some cases Landing Zone will be skipped what are those cases when we are directly let's say pulling from a database but if some third party has to
193:00 - 193:30 put they will put it in landing Zone itself as you see pair files which will come here right and also uh when we are calling the API let's say we have some public apis When We call we can directly bring it to bronze then right if we are ourself calling the API we can skip the landing zone so if you see if you see EMR data whatever data sets is in
193:30 - 194:00 what Azure SQL DB right a SQL DV and then we have uh claims data claims data which is a flat file in landing Zone and then we have this NPI for the doctors and the ICD codes I have talked about these two codes also which we will call public API and we will directly get
194:00 - 194:30 it to bronze so this we directly get bronze and this also we will directly bring it to bronze no role of Landing Zone here you can see this these two no role of Landing Zone this also no role of Landing Zone only this pair files are put in landing Zone and we take it forward now there is one more uh type of code which is called a CPT code CPT code I'll talk more about it
194:30 - 195:00 right this also we could have called a API but let's assume this is provided by a third party vendor and it comes as flat file like a CSV and again it is given in uh Landing Zone Landing zone right so our architecture diagram do not show show that but assume that there is something called as uh CPT code right CPT code here which again is
195:00 - 195:30 dumped in landing Zone here like a CSV file and then we take it forward so it's more like the claims data the the flow will be like claims data so this is about data set so now what we will be doing Our intention will be first we will set our bronze layer that means we have to make sure our bronze layer is complete right right
195:30 - 196:00 now in our bronze layer we have got this data which we have done in the part one itself where the pipeline we have created was a very good pipeline we created uh generic pipeline now we have to bring claims data right from Landing to bronze then we have to bring this uh ICD and NPI codes from the API call to bronze directly and then we have this CPT code which we have to bring from Landing to
196:00 - 196:30 bronze right so if I have to write claims data which a insurance provider has put in landing right so we have to bring from Landing to bronze okay then NPI and ICD data right this we call Api and directly get it to bronze call Api and directly put to
196:30 - 197:00 bronze of course when I'm saying bronze I'll put it in par format all of these things and again CPT data we again bring it from Landing to bronze so CPT data usually follows the same thing just like claims the flow will be the same so again I'm saying note that in bronze note that in
197:00 - 197:30 bronze everything we are keeping we are keeping in par format which is quite optimized right park format it you it is a column based file format and when you use column based file format we get really really good comp impression also right and that's the best format to be used along with Apaches spot just to let you know now in case if you want to know what is the difference between CPT code
197:30 - 198:00 and ICD codes I have a link here I have a link here uh so let's see what does it say you you can go through that but at a high level if I have to say uh ICD code is a 10 digigit code you can see this 10 digigit code and it stands for international classification of disease right so in other words they refer to specific condition that's being treated
198:00 - 198:30 let's say someone uh has attention deficit disorder ADHD right so it's the kind of uh basically condition that someone is treating right on the other hand when you talk about CPT codes CPT codes is it's more like the D the procedure that the medical is giving right they explain what healthcare provider did during an interaction so it's more like a procedure right so you
198:30 - 199:00 can read more about it I do not want to go too much in the domain but I just want to highlight that this link you can go through to check difference between ICD and CPT codes right I hope this should be clear there is some difference between these two you can go through them and then you will also get to know from this link how frequently these are updated
199:00 - 199:30 and so on so we we are good now now we will set our bronze layer bronze layer will be set when we do this thing and before that we have already brought in the EMR data from both the databases right but this is yet to be brought so that we say oh bronze layer is set and this can act as our source of truth right anyways when we feel there is some problem in the silver layer or gold layer we can always go back to the
199:30 - 200:00 bronze layer and assume that it's our source of Truth now again as I say this is like a medallion architecture where we have bronze layer silver gold and each layer of different personas bronze layer is more for data Engineers right silver layer clean data is more for data scientist machine learning people and gold layer is more for business people where who run reports and all right so each layer can serve different personal now once the bronze layer is
200:00 - 200:30 set then what we will do what will be our task in this project we will try to set our silver layer right that means we will get the data from bronze to Silver that so I'm giving you high level idea first that what we will be doing what is left right so our activities to set the bronze layer will be this and then once bronze layer is set we will have to get the data from bronze to Silver and when
200:30 - 201:00 doing that we would have to clean the data that means if there are nulls we have to mark that oh this is bad data right if any of the fields are null let's say we clean the data we uh Implement common data model we will Implement common data model what is that let's say we have two different hospitals and each hospital has a different patient table let's say some column name is different or the schema is different then you have to bring it
201:00 - 201:30 under a common schema and I will show that practically to you do not worry we will Implement common data model we will Implement std2 so that if some record gets changed we do not lose on the history we maintain the history using slowly changing Dimension which is a industry practice even I will show how to implement STD type 2 right and finally in this silver layer we want to keep the data in Delta tables now
201:30 - 202:00 underneath Delta tables is par right with some uh blcks on top of it so that we can perform transactions updates and all right so this is even a better version of Park you can think of so internally the format is Delta when we talk about Delta table and Delta is nothing but a refined version of par right par plus something on top of that is Delta right and what is that something something is that extra logs
202:00 - 202:30 which helps us to make updates we can be acid compliant and all of that right okay now once our silver layer is set then we uh uh bring the data from bronze to sorry silver to gold we set our gold layer right we'll implement the gold layer where we will create facts and
202:30 - 203:00 dimensions facts and dims now if you are again confused I have shown you that that how exactly that will look like we will have a fact transaction and some dimensions and together these can be joined to create various reports it can solve various queries and various uh kpis can be calculated using this right that okay I want to understand for how long the like
203:00 - 203:30 how many claims are not processed within three months right or many such kpis you have seen that earlier in part one right so basically in this if you see this part was done this part is left this part and apart from that CPT codes right this part is left and then we take all of this forward to silver and gold right so that's what we have to do I'll just close this diagram for time
203:30 - 204:00 being okay now after this silver to gold that's fine setting up bronze then setting up silver then setting up gold and then as I mentioned we have to follow certain best practices right certain best practices or I would say enhancements to what we have done uh as part of our part one so we did something in part one it might not be done the best way so we would have to
204:00 - 204:30 correct that and I highlighted those points towards the end of part one right so right now we were just openly keeping our credentials we do not have to do that rather we have to implement keywall so that I mean our credentials are secured it's not open to everyone right I will show how to implement keyw world uh we have to improvise on naming convention right okay we have to make
204:30 - 205:00 our ADF pipeline parallel you remember that last time when we were bringing those 10 tables five from one database five from another it was all sequential and the reason being in audit table we have the auto increment key and if we have the auto increment key people cannot update that in parallel right that's where we have to keep it sequential and the pipeline was taking a lot of time to run now we realize that this Auto increment key is not required
205:00 - 205:30 why do have that so we remove that and we can now make it parallel with that right so that pipeline can run in parallel and it can complete very very fast now we have not implemented the is active flag even though we we were getting that in our configuration file is active right is active flag so we will Implement that that means if is active is zero that should not run if is active is one then that pipeline should run
205:30 - 206:00 that we will Implement and last time we were using hi metast store for storing our metadata but the thing is that it's not recommended to use the hive met it's kind of dicated right so it's something which will go away soon I believe so if you really want it to be shared across multiple workspaces and everyone can see that then it's always better to use Unity catalog for your metast store so
206:00 - 206:30 implement the unity catalog that we will be doing and of course as part of best practices we should be adding retries in our uh AO data Factory pipeline that means if due to some Network issue or whatever if it fails it should try again so we will add retries also right and uh uh that's it I believe at a high level if I have to think of now before we continue and start
206:30 - 207:00 implementing these things right that means in bronze this all is left then from bronze to Silver we have to take the things silver to gold we have to take the things and we have to implement all of this before proceeding now few things to note right few things to note I remember in my part one some people mentioned that there is some data
207:00 - 207:30 discrepancy and all that can happen right even I have tried to correct it to some extent but still data discrepancies can happen because all the data provided to you is generated using using Faker module right using Faker module it has been developed so if I show you if I show you uh my Azure data
207:30 - 208:00 bricks right one second a your data breaks and I launch the workspace I'll give you this notebook also the where I generated the data data sets and you can see data generator uh Faker module right this notebook is there so here the I mean code on how to do that is there I can
208:00 - 208:30 even pass on the code you can just try it out right but I do not want to go into it so all the data is generated generated using Faker module and there can be some discrepancies I have tried to minimize that but but still you can assume some discrepancies will be there okay so that's number one thing because some people were mentioning about data issues now anyways you have to understand the logic and the complete flow even if data is discrepancy that's
208:30 - 209:00 fine uh right uh so some of the due to this uh Faker module data discrepancy can be there and some joins and all might not work data discrepancy and join might and all some join conditions might not work but that's totally fine so just wanted to highlight you this part okay uh fine then so with this we are
209:00 - 209:30 good and now we are in a position to proceed implementing all what we have to do now so we have even improved our naming conventions uh just to let you know improved the naming convention and I me even if you see we have created a different folder structure to organize the code well
209:30 - 210:00 right organize the code well I'll just show you that right so if I go back to my uh data bricks right uh I created a uh new uh it's the same workspace or new one not sure uh TT HC a BD WS that means trendy Tech Healthcare AIO data brakes workspace
210:00 - 210:30 right so the name itself indicates what this resource is right so the workspace name is given accordingly and inside this so you see this right this because it is uh kind of synced with GitHub right so we created a new feature branch and just checked in there so anyways I will even talk about at a high level how to do that but if you see now the code is organized well right trendy Tech a your project setup uh right this is
210:30 - 211:00 setup code API extract so this is where we are calling the apis silver that means from bronze to Silver whatever code is there here we are doing gold that means silver to gold whatever code here right all of that so you you see the code is organized better and we have changed our naming convention to make it definitely better than a previous thing right so uh we are good now now I will just show you the setup part in setup we
211:00 - 211:30 have brought the audit ddl right you remember we were creating that audit table if you can recall so that we have brought to the setup folder and ADLs Mount mounting and audit table these are part of setup you can think of so if I show you this audit ddl create schema if not exist TTC a dbws that means this particular this particular uh kind of uh uh
211:30 - 212:00 database we have created under uh our Unity catalog I'll just show you that so if I go to uh catalog open a new tab right so you can see TTC ADB WS right here we have audit here we have audit so basically if you see uh this is nothing but so you
212:00 - 212:30 see this is nothing but the catalog name this is the database name and this is nothing but the table name so catalog database table and this catalog is a Unity C it's inside the unity catalog not our hi metast store right so this again change that I have done you can check it out so create schema if not exist this that means if if this particular uh schema is not
212:30 - 213:00 there so how it is called is in unity catalog we have catalog name right inside a catalog name we have uh schema SCH name again schema name is nothing but database name think schema and database are same and then we have table name so what we did is in our Unity catalog we can say click Right add a
213:00 - 213:30 catalog we can create a new catalog right create a new catalog using so go to catalog plus add a catalog and this way you will be able to create a new catalog right once you created a catalog what you need to do create schema if not exist inside this catalog we are trying to create this audit either you say this
213:30 - 214:00 is schema or database whatever that means you are creating a audit database if it is not existing inside this catalog now I just show you so audit is there now created so if I run this again it will not throw me an error because if it exist it will skip that so now if it is not audit is not there it will create now let's say audit is created then create table if not exist that means if table is there then do not show me an error but if table is not there then
214:00 - 214:30 please create it so create table if not exist inside this catalog inside this database I'm creating load locks so if you see it will be load locks this is a table you see the column names right and I am giving data source table name number of rows copied Watermark column name and load date now Watermark column name helps for our incremental load load date also because till what
214:30 - 215:00 time it was led again things are taken care from that time onwards number of rows copied we are putting table name and data source you see we have removed that auto increment thing right the first Auto increment this field is removed because that was kind of not helping us with our parallel thing right that was uh stopping us from doing parallel activities so we have removed
215:00 - 215:30 that right and then in case if we want to truncate that this is the command and select whatever so right now let me see if I mean anyways the cluster is not running so I not running this as of now that's fine so you understood about this audit ddl now audit ddl slight change was there it that we remove the auto increment now ADLs Mount now in the previous Mount what we
215:30 - 216:00 were doing we were mounting uh The Landing zone right but now let's say we want to create a mount for our Landing bronze silver so we created a aray we created instead of just one mount we want to have multiple Mount points like SL mn/ Landing SL mn/ bronze SL mn/ silver like that I hope this is clear one second okay so now if you see
216:00 - 216:30 here if you see uh everything is same and we are iterating over a for Loop so that one by one it creates the mount point now one interesting thing that you have to see we are not g the storage credentials here earlier we were hardcoding the credentials storage account access key we was giving directly here but now we are not hardcoding we want to do it through keyw world and we have created a keyword and
216:30 - 217:00 we are using this here I'll just write it I'll just write it to my notepad so we first created a keyword I will show how to create and then we are using that means from this scope right from this uh scope of keyword we want to get the password or the authentication token for this ADLs access key right that means a your data Lake storage whatever is the account
217:00 - 217:30 access key right please give me that and it is stored in keyword it will retrieve from that right so this is the scope from this scope I want to get the value for this key right and then it will be replaced with the account access key right so this is how I'm not disclosing my credentials here rather my credentials are stored in let's say a locker and I am just trying to highlight
217:30 - 218:00 that right no one can access it and this is a secured way in Industry if you're working it will be this way you will not hardcode the credentials right so db. Secrets doget right that means I'm getting it from this scope and this particular credential there will be many different credential credentials stored one for let's say ADLs account one for SQL DB one for something else like that so I'm saying I want the credentials for our
218:00 - 218:30 storage account ADLs Gen 2 right now I will show you how to configure this only if it is configured then we are able to call it if it is not configured we cannot just call it this way right so let me show you how how do we create this keyword how to create or set up the key VA let's see
218:30 - 219:00 that okay so what I will do is what I will do I will come here home okay and we have created TT Healthcare KV I'll write it we have created TT Healthcare KV this is what health Hyun care okay this is what we have created as a keyword once you create the
219:00 - 219:30 keyword you will have the secrets right secrets you can see so once the key world is created you will add some key and value key will be this kind of thing and value will be kind of the password and all so how to add that just like we have added this so we have added the password for SQL DB we have added the password for ADLs account
219:30 - 220:00 right uh password for uh we have given the password for this AO data briak access token right access token is given and aure data breaks so multiple things we have done right so idly we are using only one of it right for a data bricks but two are mentioned that's fine I'll show that but generate right you have option manual give the name give the name that what
220:00 - 220:30 kind of credential it is so that you can remember it later and give the secret value here give the secret value here and even if you want to give a expiration date do mention here right so if you see uh for example uh if I giving a your SQL DB password so you can see it's being added right it's being added you can see you're not able to see the secret value but you see what's the name given SQL DB
220:30 - 221:00 PWD was the name given and secret value is stored right so that means think that this is like a locker here you can store multiple things right multiple secrets you can store right so in our case we have stored four Secrets password for SQL database password for to connect to the storage account and the credentials to connect to my data
221:00 - 221:30 breakes right data break so SQL password whatever password we would have said while uh creating the SQL database that would be the one for storage account you remember right for storage account if I go to uh my this thing storage account and I go to uh access keys right this will be the key that I'm
221:30 - 222:00 talking about this key this I have stored there right that's the like a password think it this way and for my data breaks for my data breaks if I come here go to settings and go to developer access token right here I generate a token once you generate you cannot see it here so once you generate you see it you copy that and that access token I have put there right so I go to keyal
222:00 - 222:30 again right so there are four secrets that way these two are same think it this way but uh sqb storage account and data breaks okay this is all set so this way you have done it now I I'll write it uh create the
222:30 - 223:00 secrets right create the secrets now now one more thing when you are accessing so I will go to workspace setup adless Mount now you see TT HC KV our name was different right our name was different TT Healthcare KV then what is this this is the scope that we created in datab break
223:00 - 223:30 so there is a way that we create a scope in data so first is you create it in Azure portal right we created it with this name we enter certain Secrets then after that in datab breakes what you need to do you take this URL till n right this thing you take this URL and then put hash here and then say
223:30 - 224:00 Secrets slash create scope you have to create a scope in data brakes right so this way you give a scope name so we have given the scope name as this one Whatever I was showing you this scope name we have given TT hcv right give that scope name Creator or all workspace users whatever you select DNS name and resource ID how do you get it
224:00 - 224:30 DNS and resource ID so go to properties of a keyal right I'll just show you that uh properties and you see this Vault URI and resource ID so give the DNS name and resource ID and you can find it here you can see resource ID and URI this one so this is given and that's how uh
224:30 - 225:00 scope is created in data bricks and now we can use that scope so if you see if you see here that scope name is used and this is something which we have already set in our keyal so if you go to the keyal we have added this as the secret one of the secret I'll show you so TT ADLs access key Dev so TT ads access key da right so I'm
225:00 - 225:30 saying I want the what is the secret key for this one right we are trying to get that rather than how hard coding the account key right so I hope you can understand this part I hope this is clear and also one more thing right so if you want to uh access this
225:30 - 226:00 keyal if you want to access the keyal from different Services we have to basically do uh do a thing that we have to create the app registration right so these are basic things related to Azure you would be aware but not if not then at least I'm highlighting to some extent okay all applications so you can
226:00 - 226:30 see we have already created some apps for so there will be one app which we would create for AO data Factory right there will be one app which will create for Azure data breaks right so you see see Azure data Factory some spelling mistake but whatever right and then uh one for Azure data bricks and uh even uh we can like we have for
226:30 - 227:00 uh uh storage right one for storage one for so different Services if they want to interact with keyal right if they want to get access to the keyal then you have to do that app regist so AO data Factory a data braks and then what we need to do once you create this app so how to create this app new registration and create it uh you can see how exactly has
227:00 - 227:30 to be done for example if I click on this let's say so you can see this app is created this app is created and once this app is created what you need to do go to the the keyw world so if you want access to the keyword go to the app registration create a app create a app for the let's say uh
227:30 - 228:00 service let's say it can be aure data Factory or Azure data braks whatever it is and then go to the key I'll write it here go to the key VA right here you can see and here go to access policies access
228:00 - 228:30 policies okay and now here you will see you can say create and whatever permissions you want to give let's say you want to give get and list permissions right just an example uh I'll just move move this up and then you say next and then you enter that app name here whatever app name was there so let's say I have something related to ADF let's see if or TT hyen something yeah you can see this app
228:30 - 229:00 names whatever I have so to this app you can give that permission and that app is bounded to that app service let's say data Factory or data Bas so this way from the portal your portal you are able to give access to AO data bricks to interact with the keyword and what per permissions we have given let's say get an list or whatever it is so that's how it has to be done
229:00 - 229:30 right so uh I mean in case if you find it confusing then do let me know I can mention more about it can create a document but these are basic things so I do not want to go to too deep into it so you basically created a app with contributor access and all and then you basically go to keyboard access policies and then select that app and give some permissions whatever it is right and that's how you I mean we can
229:30 - 230:00 access that otherwise you will get access denied and all of that so at a high level this is the thing but this is a basic thing I would not say that you get hooked to do this thing too much in the project because this you can learn separately also because I cannot cover it in too much detail as part of this project okay so this is fine keyal is set and that is where if you go to our
230:00 - 230:30 uh go to our uh this thing data Factory right our data Factory you remember we have put hardcoded the to to and all we will use the keyword now we'll use the keyword now so this was the one let me just see so basically link service is where we hard code right Link services to
230:30 - 231:00 connect and that's where we provide the tokens so let me just show you the link Services it's here link service and let's say I'm not sure if this okay uh let's say this uh I have used you can see this right so we are using access this
231:00 - 231:30 key here right and one one more thing one more thing important thing is you remember what all what all Link services we have created earlier what all Link services we created earlier we earlier created a link service for to connect to Azure SQL TV if you can recall right we created a link service to connect to ADLs Gen 2
231:30 - 232:00 right so this is using a password this is using a access token and then Delta Lake right for the audit table thing right we have that and then we said we should be creating a link service for keyw world but last session or part one we have not created a keyw world so we have not created a link service so now in this session we should be creating a link service for keyword right so this will be a new one that we will create in this particular
232:00 - 232:30 session and also we would have data our code written for silver and gold in our data bricks right so we should connect connect to data bricks also because our data Factory should be able to connect to data bricks to execute that notebook that means the data Factory should be able to connect to data breaks and for that also link service will be required our data Factory should be able
232:30 - 233:00 to connect to keyal also to get the access tokens right so for that also uh Link services required so these two new Link services we have to create let let us first create these two then right so a link service to connect to keyal a link service to connect to databases so for keyal you see the name
233:00 - 233:30 TT HC stands for healthcare KV stands for keyword LS means link service a good naming convention right and here we are entering the base URI for the keyword so if you go to the keyal uh if you go to the keyal you will see this base URI this one this URI is mentioned here right and then you can test the
233:30 - 234:00 connection can test the connection it works that means fine we have created this we have created this link service that means from keyword it can access the things now now this database we have done earlier uh Delta Lake we have done earlier this this one is not in use zero right so ignore that this is the one that is used uh ADLs store storage account is
234:00 - 234:30 done earlier now data breaks again we have to do so this one is done this one we have to do now right so we have created TT HC ADB LS ADB stands for a data breaks so this is because my data Factory should be exe executing my notebooks right when we complete the pipeline so now you see this name by the way what is this Auto
234:30 - 235:00 resolve integration and time so I mean in case if you want more resources for your AO data Factory then you can create your own integration run time but I'm going with the available one right now uh I'm entering the datab bricks workspace URL how do you get this is here right this one this one is the one datab break URL and I so if I have to connect to data
235:00 - 235:30 bricks not everyone can connect right I have to give some token what token was it again I mentioned it uh if you go here settings developer access tokens generate new token only once you will see copy that otherwise you will not get it again right that same token you could have you could have provided here if you would have mentioned access token but I do not want to hardcode that others can see that's where it is there as part of the Azure key world right it's it's part of
235:30 - 236:00 azure keyo and that will be one of the secrets right that will be one of the secrets and what's the secret name here you can see TT HC a b d WS pet right this one this is the one that means I stored that token as part of this I've stored that as part of this basically right this one I hope this makes sense to you and
236:00 - 236:30 same thing I am using here right I I'm using here so I mentioned what link service and what secret name and there can be multiple versions of it so last version last one and whenever you connect to uh this data brakes then you have to say whether you want to create your new job cluster or use existing interactive cluster so I'm saying use existing
236:30 - 237:00 one right use existing one I hope this is clear right okay so this way you would have understood the things I'll cancel this now I'll discard the changes I do not want to do any of my changes here because it's all set properly so we have created key world and we have created data brakes now when we were connecting to Delta lake or ADLs Gen 2 now we would
237:00 - 237:30 be having that as part of keyal only now right so you see this ADLs right so I think in this uh still keyal is not there but now we can mention key and select the link service select the secret name right I I won't save it but because I do not want my things to break but you see the secret I can see here ads access key Dev like that right but I I will cancel it for
237:30 - 238:00 now so this is how I can basically use the details from the keyword rather than hard coding right so with this we have created two New Link services and totally we have five Link services now now in terms of data sets what data sets we have created earlier data sets you remember right so we have created uh a data set
238:00 - 238:30 for Azure SQL right we have uh created for delimited text more like a flight file right and then we have for par and then we have for data breaks Delta L all of this we have done earlier and we do not need any more data sets as such this is done right and and
238:30 - 239:00 always you remember we create a pipeline pipeline would basically we refer to a data set here and data set internally refer to a link service right this is a flow which I have mentioned very well in my part one so we we are good now we have implemented two Link services so Link services are done now what I want to do so what we have done we have implemented the keywall we have implemented the
239:00 - 239:30 additional Link services now let me show you how to implement active and inactive flag right active and inactive flag so let me show you that so I go to my data Factory sorry data Factory correct uh I'll go to my AO data Factory okay launch
239:30 - 240:00 Studio I go to uh pipelines earlier we had just one pipeline which is uh uh this one source to lending this was the one right now now we have broken it down and all I will explain again but uh let me just see once okay earlier we had this pipeline where we
240:00 - 240:30 had a lookup activity right that means we were reading the config file here if you can recall config file and for each entry in the config file we were seeing if the file exist if it exist move it to Archive right I'll just click here and show that I'll anyways uh yeah so if file exist move it to Archive if file does not exist then okay fine move ahead now earlier we had
240:30 - 241:00 this if condition if then we were checking whether it's a full load or not so if you see if we are getting full load from the config file then Implement a full load else incremental load now you can see we have disabled this we have disabled this activity now right why because now we want to before executing this we want one more thing whether the
241:00 - 241:30 active flag is zero or one if it is zero do nothing if it is one then do this thing only right so you see if condition so this is disabled even though it's a part of pipeline just to show what was there earlier after this we have added one more step if condition right and if item dot is active one that means from configuration file if we are getting one that means the pipeline is active actually the pipeline is
241:30 - 242:00 active you can see this the pipeline is active and uh if it is active that means if it is true then do execute pipeline one else do nothing now if I click here for this right and I say execute pipeline one so it is basically the same thing which we have commented there or disabled there right so here we are
242:00 - 242:30 checking if load type is full then do a full load as do a incremental right so whatever we have disabled there we brought a if condition checking whether the pipeline that active flag is zero or one if it is zero do nothing if it is one then check if it is full load or not and if it is full load do this otherwise do
242:30 - 243:00 this we have seen that the only thing is we just brought one more thing right so if you see we just disabled this earlier this was was there we removed that we got this to check active or inactive and if it is true we are doing this again if it is true you see this right we execute pipeline one which is nothing but the same I hope this makes sense so that's
243:00 - 243:30 one change we have done also we want to make this pipeline we want to make this pipeline parallel right we want to make this parallel so if you see here if you see here uh sequential earlier we have checked in sequential but now we have removed that means this for each activity this main part everything will be done in parallel right and you see
243:30 - 244:00 the batch count is five that means in a batch five activities will be running in parallel so you you see uh we will be doing things in parallel and since we are not having a auto increment thing there parallel will be supported earlier if Auto increment is there in audit table it will not work that's what we have mentioned right so now what have we done till now what have we done till now if
244:00 - 244:30 you see at a high level we implemented implemented the active inactive flag right that means if in a config if active this flag is zero that means please do not execute right now we do not want this one to execute if it is one then execute right we implemented the key
244:30 - 245:00 Vault right again I have given some highlights about key world but in case if you do not know then you can learn it right that uh thing and apart from that uh we made the pipeline uh from sequential to parallel sequential to parallel how we did that by removing the uh this thing Auto increment from the audit table and uh by not checking the sequential button
245:00 - 245:30 in the pipeline these two things we have done also we uh created a linked service for uh aure key world right and uh we created one more link service for data breakes for data breaks anyways we have not used it till now data breaks why we have created this link service so that
245:30 - 246:00 my data Factory can connect to data brakes and execute that notebook right till now we have not done that I hope this would make sense to you also we have improvised on the naming convention improvised on the naming convention naming and folder structure this is what we have done right now what we will be seeing
246:00 - 246:30 now is that how to bring other data sets to the bronze till now we have just brought the EMR to EMR data to the bronze but we have claims and the uh quotes right what quotes were those if you see we have NPI ICD and CPT NPI
246:30 - 247:00 ICD CPT right these codes we have to bring to our bronze layer right so we will see this now so now as as I mentioned uh the claims and uh CPT will be almost the same style because those will be in lending the flow is lending to
247:00 - 247:30 bronze right and uh in landing it will be in CSV in bronze it will be in park that's the difference now talking about NPI and ICD codes right this will be from your uh uh this will be from API you will call a API and bring it to bronze so there is no involvement of Landing right so let us first see these
247:30 - 248:00 two API calls npia and ICD codes right how to bring it to Bronze in par format so I have my data breaks right I have a folder called as API extracts and ICD codes and NPI NPI is Doctor related codes and ICD you have understood right so NPI gives me doctor related details so I will first show ICD
248:00 - 248:30 codes and I will show at a very high level I am I mean even I cannot write this full code myself right so you can always uh take help from chat GPT and any such sources on how to call these apis and understand more about it so request module will help us to call the API uh spark session because we'll be writing it uh right as a uh like spark this thing and some spark functions all
248:30 - 249:00 of that now in this is a public API right this is a public API so we have a client we have registered and got a client ID and client secret and here is a the base URL date time. now. that means from the current time Fram it will give us the date whatever it is now Au URL so that means how it works is that from this using this client ID and client secret we will get the authentication token so to get the a token we again call this
249:00 - 249:30 particular this Au URL right so it says connect with this connect token so it will get us the token and what we have to pass to this client ID and client secret and we are saying these are client credentials that we are passing please give me the token now this will give us the token actually right so this will give me the token which we are collecting so from the Json we are getting the access token now so that means what we have done is we passed in
249:30 - 250:00 the client credentials and got the access token right access token this access token we will use to call the API now so we are as part of authentication we are passing the access token we are seeing API version is V2 language English right whatever it is and uh after that uh I'm not sure accept language what does that mean but yeah
250:00 - 250:30 you can check it out these are basic things boiler plate codes that you can see and now what we are doing we have this extract codes right we have this extract codes uh function right Fetch and from this basically we are calling this fetch ICD codes right so we have this we from this
250:30 - 251:00 internally we are calling this and what it's doing basically is if child in data right that means see there might be some top level things and then until we reach to the leaf level please iterate over it keep iterating until we reach to the leaves right so that we get the exact codes right so we keep doing that and we extract various Fields out of it like
251:00 - 251:30 ICD code ICD code Type code description inserted date updated date is current flag so I mean I could have done a override to it so that we do not require insert update and inserted Date Update date is current flag this is more like if you implement SCD type of SCD type 2 which we have implemented for EMA data if you want to do the same then you can keep it this way and do upend right instead of override you can do upend because let's say you end up rerunning this two times then somehow your data
251:30 - 252:00 should not get messed up right or if new updates are coming how will you tackle so either you can do a override or if you are doing a upend then make sure you implement some logic right and that kind of logic we have implemented for EMR data you can just see that same thing right so these are the fields we are extracting and now this we are calling
252:00 - 252:30 this particular function right with root URL this right and this root URL we are giving a very specific thing release 10 2019 and get me all the quote from a00 to a09 otherwise see there are so many things it will take a lot of time to run so I just want to give you an idea how it works so that's why we are saying only get get me record for this these things right not more otherwise it will keep running right and next time you can run
252:30 - 253:00 from a z A10 to a 20 let's say like that and once you get this you want to put it as part of this structure that's where I'm saying struct type right so whatever we would have got normal everything is string we want to give data types to that right so that's why we have defined a schema uh using struct type and then uh we are creating a data frame spark data frame so you see spark.
253:00 - 253:30 create data frame ICD codes we have got the data and we are imposing this structure on top of it right earlier it would be all string we are now imposing this structure that okay it's not all string make sure this is in this is date all of that so you give the data along with the schema which you want to impose and then this data frame you want to write in par format in upend mode right you better to write it in override mode so that you do not have to deal with anything but if you feel there can be
253:30 - 254:00 changes which you want to maintain history and all keep it aend right and save what place you want to save it SL am SL bronze we have created the mount Point earlier right for bronze Landing silver everything so you want to keep it in bronze under the folder ICD codes right that's how it is and if you want to see the data I mean this is how
254:00 - 254:30 it is this is working code I will give you the code also ICD code 1.0 ICD code Type ICD 10 code description C due to uh vibo whatever right very technical language for healthcare right all of this but I do not want to go too much in the API but at least you would have got an idea that we want to give client secretes to get the token so that we can
254:30 - 255:00 call the actual API in the actual API we are calling a subset of thing right so that we get limited records and uh we will get the parent level thing which is not the actual these details so we have to go down or until we get Leaf level details like this right so you can explore more I will give you this code also to you and this will be now sitting it our uh I
255:00 - 255:30 mean it will be going to our ADLs Gen 2 and here storage browser blob container and under bronze you will have this folder which we have created what was that ICD codes this one and data will be in par format and internally you see part files and all because it's all parallelism right so if we have a parallelism level of four then four part files will be created and so
255:30 - 256:00 on so if you know spark you would know why these part files okay so this ICD code I am sure you would got have understood so now I will show you NPI NPI is nothing but doctor related details right uh what is that National provider identifier what is the full form right so again importing the
256:00 - 256:30 modules uh getting the today's date creating a spark session and base URL this is the URL right uh and then what version version 2.1 again I'm calling a limited set so that I do not get stuck with API calling State for California City Los Angeles limit 20 that means I'm just resting it to this
256:30 - 257:00 much request. getet base URL I'm passing this URL and this parameters I'll get a response and this response will give me the NPI quotes that means the 10 digit or whatever identifier which we have right it will not give me more details but those 10 digit codes only right so it will give me an NP list with some numbers now once we get this numbers I want to get more details for this person
257:00 - 257:30 right so that's where I iterate over this list of 20 uh npis right which is a number and I say give me detailed response for each of them give me detailed response where I will get let's say the first name last name and lot of other details like first name last name position organization name right all of that so we take all of that and again we write it in par format
257:30 - 258:00 right as a override mode and/ mn/ bronze NPI extract so how it works is first we get the NPI numbers right and then we get detailed information from those numbers and we are writing it in park format in NPA extract folder I'll just show you that uh NPI extracts this one so you can see again this is was a
258:00 - 258:30 spark job so multiple part files I hope this is clear I'm not going in too much detail you can explore I will give you this code right if you are from Healthcare you understand these API much better than me all right but at a high level you see this is the code so you see NPI ID right which is a unique 10 digit identifier right and first name last
258:30 - 259:00 name position organization last updated refresh that whatever right so you see this is how the details are there this we are calling for 20 such doctors right I hope this is clear and this details are from uh California Los Angeles right limit 20 that way so we have got this data and I just want to
259:00 - 259:30 give you a highlight on this now one thing to note in our data like patients data and all right also we have NPI code and all we have we have used dumming codes so actually we we should not be using the actual one because it points to the actual doctor actually so NPI code is generally starts with one starts with one so let's see if it is starting with one here right NPI code starts with one but
259:30 - 260:00 in our case it might not be starting with one because we have created dummy when we were using this patients data and all wherever NPI codes we have used but whatever we this is actual this is actual so NPA code starts with one but in our data sets it's different because we cannot use the actual one we should not be using however it will be 10 digits in
260:00 - 260:30 our what wherever we are using also it is 10 digits but we have created dummy data sets and we should not be taking the original one now another thing that you have to understand is data extracted from the apis we have done a very limited call right is very less because we restricted that so when you make join
260:30 - 261:00 we so when we uh make when you make join you may see lot of nulls but that's fine we have to understand the logic because otherwise these are public apis will take a lot of time to run right if you can somehow collect a dump from somewhere it's fine so keep in mind that this API extract is for demo if you want full data please change the url and extract complete data but it may take a lot of
261:00 - 261:30 time make a note of that now let's talk about uh so we we understood about uh we understood about NPI and ICD now talking about claims and CPT how to bring this claims and CPT how to get it from Landing to bronze so it should be there in landing actually so if I go to
261:30 - 262:00 lending a vender or third party would put it CPT codes and claims to lending like like this right so you see claims this claims is from hospital 1 this claims is from hospital 2 and you can see this also you can download and see uh but anyways if I check the other one CPT codes if you find the right API you
262:00 - 262:30 could have got from API also but I just could not find it so CPT codes I'm showing as part of this right so this is a CSV file again so we have to get this this from this lending folder and put it to bronze in park format so how to do that we will be doing as part of our notebook and we could have created so as part of this notebook just like we have created
262:30 - 263:00 Silver we could have created bronze and got the data from Landing to bronze that way but I mean just I did it as part of silver only but you can take this segment out and put it as part of bronze if you want right so for claims and CPT codes I have done that way that particular logic a very small logic you can move to bronze folder if you want but this in silver also it is okay to keep so even though I'm under silver
263:00 - 263:30 folder but I'm I will be showing these two notebooks and I will be showing a partial code here which will populate my bronze this thing path so let me first show for claims right now this is my data bricks notebook spark. read.csv mn/ Landing claims so if you
263:30 - 264:00 see if you see here claims and then you have CSV files here start. CSV that means if even there are two files it takes okay header true then what we are doing claims DF do with column with column that means we want to add a new column to this data frame right we want to add a new column and what is the column name data source because that file internally there is
264:00 - 264:30 data source will not be mentioned as part of the data right if I download let's say so anyways same same file is here also right if you see so the data source was not mentioned but we want to we want to I'll just show you that create one column with the name data source this was not there we want to create that's where we could get it from the file name
264:30 - 265:00 itself from the file name itself let's say Hospital One hospital 2 right from the file name itself we can get it so that's where we are at adding data source as a column which we want to add and what will be the value for this when input file name contains Hospital 1 then H when input file name contains Hospital 2 then hos otherwise none that means when input file name
265:00 - 265:30 contains Hospital one it will give H as a data source right otherwise it will give cosb and that's what you can see is populated here that is what is populated data source hos a or hos B based on the file name right we are displaying this claims data frame so only one change one extra column we have added and how we derived using the file name we derived okay and what we want to do now
265:30 - 266:00 what we want to do so now we want to write this data frame claims dfri format par mode override right and do save I want to save it in my bronze under claims folder SL mn/ bronze SL claims and it's in par format so these two things only if you want you can
266:00 - 266:30 create a bronze folder just like I have created for silver and you can do but it's fine I mean even you can do this here also it just that I read the data from Landing added one more column and now putting it as Park in bronze remaining is part of silver which I will explain later right first I want to set my bronze layer itself right so this is fine now next is CPT codes same way we we have for CPT codes so I will open CPT
266:30 - 267:00 codes here right only one file so CPT codes DF spark. read.csv we are reading this right whatever number of files and what we are doing we are uh basically renaming The Columns right we are renaming The Columns to lower case and wherever there is a space in the column name we are replacing it with
267:00 - 267:30 underscore right so how we are doing it for column in CPT codes DF do column so this will give a list of all columns let's say if there are five columns we this Loop iterates five times and for each column it see if there is space it replaces it underscore if it is upper case it replaces it with lower case right and then in the data set we are renaming The Columns so there is a
267:30 - 268:00 transformation with column renamed and we are giving old column name and the new column name the old column will be replac with new column right and then CPT code DF do create or replace time view this was not required anyways this just to Showcase you the data procedure code CPT code right and then procedure code description I hope this makes
268:00 - 268:30 sense okay and if you see we will be writing this in park format CPT code dfri format par mode override where we are writing under bronze CPT codes right if I go to my bronze layer now you have one folder for CPT codes one for claims and data will be in park
268:30 - 269:00 format right I hope this makes sense to you this uh h a and hosb is for EMR data claims is for claims data CPT Cod for CPT codes ICD codes in NPI this we got from API directly these two were not in landing right so this way all my data is now sitting in bronze and this bronze layer can act as the source of Truth and if my data engineering team wants to see
269:00 - 269:30 this data take it forward to do anything they can do but this is not clean data as such not much of cleaning and all has been done so this is raw data I would say but sitting in the right file format which takes less storage have right compression things are set and this bronze layer is set for us right so uh now since everything is under my bronze layer now next step
269:30 - 270:00 would be that how to take this forward to Silver layer by doing some cleaning by implementing a change data capture right or like a SCD type two or implementing a common data model whatever so now we will be seeing how to implement our silver layer so let us see now bronze to Silver bronze to Silver bronze is all
270:00 - 270:30 set now we have to take it forward to Silver by implementing a few things now I mean we have many data sets if I just take it uh from what we have written earlier where are the names right this is EMR data uh this is EMR data we have then uh
270:30 - 271:00 claims right and uh we have uh claims and CPT right CPT and apart from that NPI and ICD codes these are the data sets that we have now if you see in the EMR data this small data would be I would say uh providers and departments there will be hardly a few doctors in a hospital there will be hardly a few
271:00 - 271:30 departments in a hospital so this we will be doing a more like a full load full load right that means we will override without implementing a STD so we were doing a full load for this and this right so no need to implement a STD here in all the others where we were doing incremental then we will do a SCD we will Implement
271:30 - 272:00 std2 here also we will Implement scd2 scd2 that means all the other places we will Implement SCD type to now if you understand it for one at one place you can understand at any place right so if I show you one uh full load scenario like complete refresh um complete refresh scenario
272:00 - 272:30 complete refresh that means overriding and no need to maintain the history and one is uh place where we have to implement SD type two right if I show one table for this and one table for this then you should be clear about the approach right so uh let us start with the full load or the complete refresh which is very very
272:30 - 273:00 easy idly everything is easy but complete refresh has nothing as such so I go to the silver folder silver folder has details on how to move it from bronze to Silver but you know claims and CPT codes even have some code to bring it from Landing to bronze right so let's start with claims claims anyways is uh std2 right so I will start with f right you see
273:00 - 273:30 Department F providers F that mean these are the two places where we are doing a full refresh and other places we are implementing STD type two so that we maintain the history in case if something changes so let me show Department right so what we have to do read the data in par format from bronze layer so spark. read. par mn/ bronze and
273:30 - 274:00 since department is a part of EMR data we have it as part of hor a or hor B right so Hospital a let's say and we are reading same way for Hospital B Department we are creating two data frames you can see this and then we are trying to do a union idly both of this will have the same structure so we are doing Union by name right so how Union
274:00 - 274:30 by name works if let's say first table of columns like a b c the column names are AB BC second table have columns like uh B b c d right then here I mean the common after Union we will have a b c and d and since this second table do not have a it will come null here first table do not have column D so it will come null here for the
274:30 - 275:00 first table right and whatever B and C are matching that will never be null in such case so I mean based on the column name it will do the Union by name so data frame say do Union by name data frame hasb that means both of these data sets we are combining together right if there are five rows from the first data set five from the second it will be 10
275:00 - 275:30 rows and we are assuming the structure is same schem I same now after this what we are doing what we are doing after this is uh let me see if I have the data sets to show you uh Department if I open one also then it's fine so we have Department ID and name that's it that's what we have Department ID and name and let me see from this
275:30 - 276:00 what do we have we again have Department ID and name so I mean right now both the hospitals have same things but it can happen that one of the hospital have one more Department which is not there in previous Hospital right or Department ID and depart let's say Department 12 is pathology here and in
276:00 - 276:30 the other one Department 12 is let's say surgery that can also happen so that things can happen so what we need to do we need to make sure these kind of even if those things are there that if department is there in one hospital and not in another that should not disturb our Logics so what we want to do is so here if you see what we are doing with column that means whatever we have in Department ID we are creating one
276:30 - 277:00 more column named Source Department ID and same values will be copied so we will have one more column Source Department ID which will have similar kind of data like Department ID because we are getting it from source source can be Hospital a or Hospital B right from whatever Source now what we are again doing is we are generating one more column Department underscore ID and now what we want to do if we can Club this let's say Department 001 along with the
277:00 - 277:30 source from which source it is coming like Hospital a or Hospital B then we will be clear and there will never be any confusion right so we are concatenating Department ID hyund and then data source we are doing uh Department ID hyund data source and if we do that we will never be running into any challenges as such right now when we move the data to park
277:30 - 278:00 right we have data source also here in our Landing it was not there but in our bronze it was there so remember that part so we Club this so it will become like for example Department 012 Hyun hosb that means this department belongs to hospital B right and then what we are doing we are dropping this department ID column because whatever this is there now it is under Source Department ID you
278:00 - 278:30 can see this I hope this is clear now that means now it will be Source Department ID which is this kind of column name which is this again then we have Department _ ID which is something like Department 001 Hy h a department 002 Hy h b like that and we have source that we have now we want
278:30 - 279:00 to move this data to Silver table so we want to First create a silver table create table if not exist silver so B basically silver is what silver is nothing but a database name like a schema inside my Unity catalog right so create table if not exist silver. departments so silver is the database name or schema name departments is the table name and we have Department ID that means Department
279:00 - 279:30 underscore ID which we newly created here Source Department ID right name s data source and basically we have is quarantine which I will show what it is right so is quarantine is let's say if some record fails data quality check right then we will quarantine that record and we said quarantine true quarantine means oh keep it isolated
279:30 - 280:00 right it should not go with the right data so any bad record should be quarantined any good record should not be quarantined right so we apply some data quality check and bad data will be quarantine quarantine will be true for all the bad data and this table structure is kept in Delta format right Delta format so we have created a table structure now since this is a python
280:00 - 280:30 notebook right so if you see this is a python this notebook we if we want to run any SQL kind of this thing we run this percentage SQL right I hope this is clear now if there is any data in this table silver table in case of rerun or something I truncate this because it's
280:30 - 281:00 always a full load like a complete refresh so truncate table silver. Department initially it will be no data but from next time it will make sense to have trunc here now we want to insert it into silver table so table definition is created we truncated it now inserting it so insert into silver. department and select Department ID a newly generated uh column Source Department ID which we were getting from the source
281:00 - 281:30 file name right this name data source we in our Park file we had that in bronze layer right and case when we have a case statement case when Source Department ID is null or name is null then true that means if Source Department ID this if this is coming as null or this name is coming as null then we Mark quarantine as true basically then true else false
281:30 - 282:00 and so case statement end here as is quarantined so we will understand is quarantine true or false true are the ones which are bad data false is the one which is correct data from departments departments is this table which we have right departments is this table that we have we created this right departments I hope this is clear right so this is a full refresh
282:00 - 282:30 every time because this is hardly a few records but kind of we have applied a common data model also to some extent because if there is same Department in both the hospitals right then we are not dependent on that right and it can happen that hospital 2 is bringing one new Department which is not there in hospital one so that will not change our logic our logic is independent of the hospital now it cannot mess up our
282:30 - 283:00 logic right so this is about a full load uh let me go back now same way providers provider is nothing but your doctors I'll just quickly show you this we again read it from our see it's when we are reading from bronze it's always par that we read spark. read. par MNT bronze Hospital a provider M bronze Hospital B provider right we again do a
283:00 - 283:30 union by name right create a Time view with the name providers okay and uh you see if we are just uh bringing this and this data source was coming as part of this par right you remember in bronze we have that so what we can do now is as a next step we create a silver table we create
283:30 - 284:00 a silver table again using Delta Delta format right so provided ID first name last name specialization all of these things is quaran to have this data quality checks and have true or false true means bad data false means good data we truncate this table right now we insert insert into silver. providers select distinct provider id that means we are handling
284:00 - 284:30 if there is any duplicates or something right so uh any duplicate records and all so provider id first name last name specialization uh all of this and cost NPI as in so any casting by default it will be string I believe so we are casting it as integer okay and data source case when provider id is null or Department ID is null that means whether
284:30 - 285:00 provider id is null or Department ID is null then true that means it's bad data else false and as quarantine from provider so I hope you understand same way we are doing so once you understand one one full load scenario or complete refresh scenario then it is easy for us to identify right so this is all fine this should give you a clarity
285:00 - 285:30 now now what we do we go back again so we have seen both the full loads now we will see if we see one SCD typee two where let's say data change is happening and we have to maintain the history also then how to do that so let me take uh let's say patient data for that patient data patient data will be
285:30 - 286:00 more data patient keeps coming patient details can change all of that so again we read par from bronze Hospital a patients bronze Hospital B patients and create Temp View whatever right we are I'm just showing you like select star from Patients Hospital a so we have uh patient ID first name last name middle name SSN number phone number gender date of birth
286:00 - 286:30 address and there will be few more things modifi date data source so we are reading this from bronze now what to do so customer address can change and all right that's where scd2 is required we want to maintain the history so we are same way seeing the data for Hospital B
286:30 - 287:00 right hos B okay now we want to for this patients we we even want to implement a common data model CDM for patients we have to implement scd2 that's one thing SCD type two and we want to implement a common data model CDM that means there
287:00 - 287:30 can be different column names and all right so let me just see the patient data let me just see once so we have uh ID first name last name and here let's see if there is any difference so you see here we have patient ID column name is patient ID and here we have ID right here we have first name here we have F name right here we have last name here we have L name
287:30 - 288:00 middle name here M name and uh so you can see column names are different some of the column names are different and apart from that uh let's say this that's the change that we have then how do we bring it under a common schema that is what is common data model right so what we are doing create or replace Temp View CDM patient CDM means common data model CDM patients as select
288:00 - 288:30 conad so now again it can also happen that let's say this ID for the patient right hos 1 001 and this you can see this IDs can also Clash isn't it might be both the hospitals are keeping patient ID same way right under the same kind of structure then how will you
288:30 - 289:00 differentiate whether this is from hospital b or Hospital a right if you have if you Club Union this you will not understand from which hospital it is coming so that's where we have to even say patient ID hyphen and we have to Club it with the data source we should be clubbing it with data source also right so let me just show you this create or replace time view CDM patient
289:00 - 289:30 a common data model as select concat Source patient ID hyphen dat data source right so Source patient ID I will just show you Source patient ID one second uh so right now if you see this is patient ID and this is ID is
289:30 - 290:00 there anything called as uh I think we will be deriving that concet Source patient ID hyphen data source s patient key uh comma star from now that means I want all the columns all the columns but apart from that so whatever I have this I want all the these columns plus I want to create one
290:00 - 290:30 more column this way from this so consider this as our table now right so select patient ID right as Source patient ID you see this from first one we were getting patient ID and from second one we were getting ID right so we are naming it as Source patient ID first name last name middle name all of these things I'm keeping as it is from Patients Hospital a union all now I will talk about Hospital B select ID in the
290:30 - 291:00 hospital B it was ID so we are again naming it as Source patient ID F name as first name L name as last name m name as middle name so you see I'm giving it as s similar names now right all of these I'm giving from patient Hospital B so this is how I am bringing it under a common umbrella right and the schema would be under C CDM patients and there will be one extra column which is Source patient ID that
291:00 - 291:30 means whatever ID column hyund data source whether it is h a or hosb because if both of them have duplicate let's say if the patient ID is one in hospital one and patient ID is one in hospital 2 also then how will you differentiate that's where we attach the hospital from which hospital it is coming using this data source right
291:30 - 292:00 okay now we says select star from CDM patient you see this patient key we have generated newly Hospital 1 Hyun 001 that is what whatever key was there apart from that H A or hosby which we have added and Source patient ID that means we are not even touching that we still have this that means what was there in the actual Source first name last name middle name so earlier this names were different column names were different we brought it under this common this thing
292:00 - 292:30 SSN phone number gender right address modified date data source so you can see now we have this right one table which have everything now and in a way that it will not be messed up we can easily identify from which hospital it is coming right so different schemas we merged into one all of that so CDM is
292:30 - 293:00 implemented now after CDM so CDM is implemented done right now quality checks we have to do quality checks okay so create or replace Temp View quality checks right this name we are giving as select patient key or whatever case when now this extra thing is case when Source patient ID is null
293:00 - 293:30 right or date of birth is null or first name is null or right uh first name lower it is literally it is null right in string then we say that this should be quarantine quarantine should be true then true else false and and the name of this column is is quarantine is quarantine true means bad record is quarantine fall means good record from CDM patient that means we are adding one
293:30 - 294:00 more column is quarantine and if it is Data quality is mismatched then true else false right and you can put more checks here if you want now if I say select star from quality check order by is quarantine descending then you see first name is null right then it came in the quarantine thing and it will be quarantine
294:00 - 294:30 true I hope this is clear right some of the bad records are here but I have still not eliminated this bad records but we can identify these are bad records so this quality checks is done using a case statement and we quarantine that that's the term use quarantine right and now we load this data into
294:30 - 295:00 silver table so first we create the silver table structure create table if not exist silver do patients right or whatever columns are there we have so you see this three more columns inserted date I'll just show you so for Quality checks what is that is quarantine right is quarantine right you see this uh
295:00 - 295:30 column right for CDM we brought under same column names and all of that now if we want want to implement a std2 we require these columns std2 is basically let's say uh let's say my my name is Sumit right uh I am let's say staying at some address let's say I am
295:30 - 296:00 staying in Hyderabad U Hyderabad uh let's say I am staying in himayat nagar my address is himayat nagar right first time this data is coming then when at the time uh this data came then uh let's say this data came first on 11th November 2022 This Record is inserted so when it is
296:00 - 296:30 inserted the insert date will be this first time when this record is coming right so insert date will be inserted date will will be 11th November 2022 modified date also I will keep the same that time when I'm inserting This Record modified date also will be the same when it is inserted first time this record is coming for this person right and is current will be true because first time this record came so this will be current record of course
296:30 - 297:00 right now let's say my address change it right Sumit same same person identified with some identify let's say the patient ID or whatever now I am changing my address from himayat Naga to let's say uh what place is taraka taraka is a place in uh Hyderabad right
297:00 - 297:30 then my address change on let's say uh 1st January 2024 right so so I will say so it is coming this record is coming on 1st January 2024 right so what we will do when this comes the insert date I will mention as so first when when this record came
297:30 - 298:00 on 1st January 2024 what I need to do I need to ended the previous record so what I will say modified date 1st January 2024 right modified date I will make 1st January 2024 and is current will be made as false for the previous record I'm talking about the previous record this himay nagar one 1st January 2024
298:00 - 298:30 modified date so inserted was this modified did this that means this record was valid from 11th November 2022 to 1st January 2024 and now this record is no longer valid currently and now I have to insert This Record insert date should be 1st January 2024 modified date also should be 1st January 2024 and is active should be
298:30 - 299:00 true this way I know that this person Sumit stayed in himayat nagar from 11 November 2022 till 1st January 2024 right because and is no longer living there not current and currently he is living in taraka and he started living there from 1st January 2024 and is currently living there right so this is the way we can maintain the history
299:00 - 299:30 SCD type two that's where we require inserted date modified date and exactive flag insert date modified date and is current right I hope you understand this so what we will do now is merge into silver. patient that means we have to put the data to Silver table
299:30 - 300:00 right so that is a silver database patient table which is my target so uh whenever we are doing this we have to understand what what's our Target table and what is our source table and we merge the data into the target target is nothing but in silver it is patients right and source is nothing but source is what my source just now I have this above table right what we created I
300:00 - 300:30 will see this source is quality checks because this is what I will be using now source is nothing but quality checks because quality checks have been performed on this so now I will take the data from source which is quality checks and I will merge the data into the target table which is silver. patients based on the above logic which I have mentioned to you we'll Implement STD type to
300:30 - 301:00 right so uh we created Silver table and after this this is the merch we will see this percentage SQL because this is a python notebook so percent SQL means SQL code we are writing merge into silver. patients as Target using quality checks as Source now what what should be the key column which I need to check right for
301:00 - 301:30 patient if they have to uniquely identify a person or patient it will be patient ID right so here it is on target. patient key so not patient ID rather patient key target. patient key equal to source. Patient key because if these two keys are matching that means if something is already there in silver table one record is there and new record is also coming then I understand that oh there is some
301:30 - 302:00 update is coming for that person right so when this is matching for the same person and is true that means we see that some that record is active now if this is matching that means for me let's say my patient key is 101 for Sumit right I have a record for Hyderabad uh let's say uh what was that himayat nagar
302:00 - 302:30 himayat nagar and it was um it current was true this this is a active record is active was true now again one more record came for 101 so I understood that okay there was a record existing for this 101 and it was active also that means I should Implement a merge scenario now right there is a clash now I should merge that means some update is coming right so you see
302:30 - 303:00 this so when this is matching and is current is true when match when this conditions are matched then what we need to do that oh is any column changed or not so you see if target. Ser patient ID not equal to this right what we are getting or first name is there any change in the first name or if there is any change in the last name or middle name right we are checking that if any column is changing so we
303:00 - 303:30 understand that okay if any of these columns are changing then update this right if any column is changing then we have to do this what we have to do we have to set what I mentioned that I understand one new record is coming for this person and now all record was already there so I have to end dat this what I need to do I need to make the modified date so you
303:30 - 304:00 see this modified date I will make as current Tim stamp so I will just take this here one second I will take this so assume it was like this right now I understand one more record came and one of the column has changed right because I'm checking for that then
304:00 - 304:30 what I need to do immediate thing that I need to do do is let's say on 1st January it is coming so modified date I need to change modified date will become from let's say 1st January when the record new record came and is current is current or what is that is current I will make it false right this two changes I have done that means we have undated this previous record we have updated this and
304:30 - 305:00 updates are possible in Delta table that's where Delta table supports updates right because of Delta logs and all it supports it I hope this is clear we have ended it the previous record right now if see if it is matched then we have done this and if it is not matched if not matched then what does that
305:00 - 305:30 mean what does that mean so remember one thing in the previous case we have ended the previous record but we have still not inserted the new record which came right so idly I should insert insert again a record for 101 Sumit and I should say Hyderabad uh and taraka right and I have to make it as active right uh is is current I have to
305:30 - 306:00 make as true and I have to set insert date and modifi date I'll just mention I'm yet to do that I have not done insert date it should be 1st Jan 2024 and this should be 1st Jan 2024 so whatever we have done till now here we have not done that we have just ended the previous record but this is yet to be done I'll write it yet to be
306:00 - 306:30 done right now when 101 or some record was matched then we did undated the record so this when not matched will it will come to this when the record has not match that means it's the first time this record is coming then definitely it will just do a insert right it will mark it as true and current time stamp current both modified date and insert
306:30 - 307:00 date is current time stamp first time record now remember what we are yet to be done we have ended the previous record right in case of we have end dat the previous record but yet to enter a new entry for that right so let's go down in this notebook let's go down in this notebook and see merch into silver. patient as Target using quality checks on when patient key is
307:00 - 307:30 matching right and Target is current is true this is true when not matched that means this this particular scenario when this is not matched then what it will do that means we have made it at false right so it will be a case for the update scenario it will be a case for your update scenario where a new record has to be inserted so when not match then we are inserting that we are
307:30 - 308:00 inserting this you see this inserting a record and uh Source do is quarantine uh this uh current time stamp we are mentioning current time stamp and true so this will go as inserted date 1st January 2024 modified date 1st January so assuming 1st January 2024 is today right let's say so we have taken done done handled two cases right one is the case when it's a new
308:00 - 308:30 record then it's a simple update right nothing has match it updates uh it inserts basically when the when it's a update scenario when a record was there and new record has come then we ended in this we did two things and dated the previous one and uh um marked it as inactive right and then uh inserted a
308:30 - 309:00 new one right so this the way you inserted a new record versus the way you inserted a update record it same actually it's same so that's how you implement SD type two right I hope this would make sense to you right and uh now you you can check it
309:00 - 309:30 out uh whatever whatever you have you can just have a look at it right you can run basic queries to see whatever you have done is correct or not so this is done implemented SCD type two I will show for one more quickly uh implementing of scd2 and then same way it goes for others let's say I'm talking about transactions right we read it from both par right the bronze layer and we did a
309:30 - 310:00 union by name and created quality checks where we implemented whatever if nuls we quarantine that same thing and then we load we we created a silver table definition right so if table already exist it won't create otherwise it will create it's a Delta table so that updates are supported and now whatever quality checks we have created right
310:00 - 310:30 that's our source and silver table is the target so merge into silver do transactions as Target using quality checks as source and the criteria transaction ID and Target is current is true same thing when matched when match that means when these conditions are matched and any of the column has changed then you ended the previous record audit modified date current time stamp and this current you make it as false that means you saying the previous record is invalid
310:30 - 311:00 and it has become invalid on this date right then what you will do now you basically have to if there is a new record which is coming or for the update record you have to insert right that will happen here merge into silver do transactions using this everything is same when not matched so in this both
311:00 - 311:30 the cases will come one case is when it's totally a new record there first time this particular transaction is coming or it will even when it's a update when Target or disc current is false now right so it will not match then also then it has to insert and insert becomes same way we uh put the current date as both our modified date and insert date and this current flag
311:30 - 312:00 will become true right so that's how you implement STD type 2 same way you can do it for or any other uh data set so basically here what we are doing we are implementing quality checks right or in some of the cases we are implementing a common data model that means if we have different schema we are bringing them under a common schema I will say common
312:00 - 312:30 schema right or if there can be clash in the IDS for both the tables we again are giving a surrogate key right by appending the hospital name to that that is what CDM is quality checks is basically we are checking for NS and lot of other things and quarantine the B record quarantine bad records right and apart from that we are implementing scd2 where we are maintaining history uh
312:30 - 313:00 end dating the previous record and so on all of this we are doing here and the structure is same for all of these silver tables or or silver notebooks right now silver is done now we will check that how to take it forward but before that when you talk about claims and CPT code there is a bit of more code to bring it to bronze also you see this right that's fine after
313:00 - 313:30 that we started bronze to Silver right so pattern is same you create a quality check table right and then create a silver table move the data from quality check to Silver right and Implement scd2 all of that that's fine I'll will give you all of this code it I mean it it will be an Overkill if I keep explaining other notebooks for
313:30 - 314:00 silver so now we need to check on the gold layer gold layer silver to Gold okay now silver to gold is pretty straightforward very very easy let's quickly look so I will go to again my data brakes right and I will go to my gold
314:00 - 314:30 layer I'll just show you what exactly we are doing here or what's our thought process let's say I take patient right so what I will do of course I have to create a schema in the gold layer right so gold is nothing but the database name or schema name and dim patient is the table name so dim underscore stands for that it's a dimension so some of the table will be uh so if you would see
314:30 - 315:00 this right so accordingly There Will Be Few not this sorry my bad uh this one so some of the table will be Dimension some of them will be fact now in our case we have just one fact we could have multiple facts idly but this is just to demonstrate right so transactions is a fact and remaining are dimensions here so if I just show dim underscore patient that means it's a dimension we
315:00 - 315:30 are truncating it now we want to insert into it from where we will insert silver layer and that's how we are following a medalian architecture right so we are taking but we do not want to bring inactive records that means the history and all I'm not interested I am only interested in the latest record that's where in gold layer we are saying that is current true because there will be let's say this patient detail has changed five times
315:30 - 316:00 then there will be total six records in silver right but in Gold I want to bring the latest record which is the active one so where is current equal to true and also I want to not bring the bad data is quarantine false anything that was quarantine I do not want to bring It Forward let it quarantine it should not come out of the room like that's what that's how it happened right during covid days that oh this person is in quarantine he will not come out right so
316:00 - 316:30 same way that is quarantine in that layer now so it's quarantine false that means all the records which are good will only move forward to Gold there so uh two things is current is current true right and is quarantined false these two things we are bringing so that we are bringing only quality data and we are bringing
316:30 - 317:00 only newer data right this is the only thing that we have done and you can see we are uh showcasing it this is all working code I will give you the notebooks you can try right if I keep running everything it will take a lot of time so before this session I ran it so that you can see the outputs okay fine so that's about uh this now if I show you let's say dim
317:00 - 317:30 patient that's what we saw or uh right so only this thing is current true is quarantine false based on this we will try to get the data so you can see this it's very straightforward nothing to talk much I'm seeing if there is anything else that I need to highlight I'll quickly open these notebooks just to check uh right create gold table truncate it bring it the data from
317:30 - 318:00 Silver table but bring Only the Good Records and the latest data that's the only thing that has to be done and gold layer is set our gold layer is set now right right so gold layer is set only so silver layer will have more data gold layer will have less data only the proper data and that two in the form of uh facts and dimensions so our facts and dimensions tables are created next what
318:00 - 318:30 next what I mean idly we say gold layer we can use as part of our like final business end user queries right right so there are some gold queries like based on the kpis and all U so you see this U so this basically in a SQL Warehouse the quer is written and you can start a serverless uh this thing uh 2x small that also take four debuts per hour
318:30 - 319:00 which is also a lot by the way right so make sure you use it wisely so any query like total charge amount ount per Provider by department or total charge amount per Provider by department for each month of year 2024 these are the end user queries which are there so let's see this for example uh so we are using in this fact transactions table and uh transactions
319:00 - 319:30 fact and provider Dimension and Department Dimension so we are joining these three and we are uh we want to find Total charge amount per provider per Department it can happen that a provider of doctor might be associated with multiple departments also sometimes right so provider name we have Department name we have and then we will do group by so how Group by
319:30 - 320:00 all works is any non-aggregate column you see some we are doing apart from that these two are non-aggregate columns right then it will consider both these columns otherwise we would have to manually mention Group by provider name Department name unnecessary so this is a cleaner syntax and Will Group by based on provider name Department name right so any non-aggregate columns it will kind of group by based on those which are used in the
320:00 - 320:30 query so we are getting this and again we can do it for each month also so provider name Department name and we can get the uh month also year month again Group by all is a very neat syntax to do otherwise you would have to mention these three columns here in order to do that so you see we are trying to find the end user this thing like business
320:30 - 321:00 queries provider name uh department for in the month of January 2024 total charge amount total amount paid all of that right but for claims related and encounter related data we have not uh I mean we didn't proceed it in fact right we could have also created fact table for those so that we can calculate some kpis related to that but we have not done as part of this demo but that could
321:00 - 321:30 also have been done right so you can see this reporting can be done on this gold layer right right that's the thing uh so you understood about gold queries now final pipeline right now I have shown you bits and pieces but everything has to be integrated through data Factory right so let me go to my home and I go to data Factory ADF Dev
321:30 - 322:00 right now we have one end to end pipeline when we run this it will invoke the notebooks also so if in case you someone has to run this pipeline what needs to be done is you truncate the audit table truncate the audit table then you can run it from scratch and I mean why we are truncating because we are not having newer data if we have new data then even we do not have to trunk it incremental will work but since the
322:00 - 322:30 same data is there it will not do anything in the second iteration in that case so better truncate and then we can run so if you see the pipeline this PL end to end Healthcare is the one pln2 n right so what we it do is if you see execute plr PL EMR source to Landing right so what
322:30 - 323:00 whatever we have seen earlier source to lending right which we have seen in previous session I will show you that we check the config file we see if the file is there we move it to Archive right and then see if the is active flag is zero or one accordingly bring the EMR data and bring it to your this thing bronze layer right whatever was that it is
323:00 - 323:30 there this is that one right I hope this is clear and then what has to be done we have to execute notebook once the data is in bronze layer then we have to execute notebooks to take it to silver to gold so this is the one this is the one after that so you see that what is executed first so if I click here you will
323:30 - 324:00 understand what notebook is this so it is using the datab bricks link service otherwise data Factory will not be able to call this notebooks right and which notebook path it is saying right you see here if I take this and show you the notebook path right it is transactions notebook under the silver this thing
324:00 - 324:30 right this transaction notebook under the silver and this is the transaction notebook under the gold right so first this is executed and then this is executed one by one so first I mean the flow of data Factory pipeline is ADF uh it brings or basically it gets the data to bronze layer and we have in part one itself we
324:30 - 325:00 have seen how to get it using a metadata driven pipeline where we have a config file and as part of it we are having EMR data which directly is coming to bronze right and then we have again claims data claims data you remember as part of silver layer itself we have this that it comes to bronze also right and then we have various codes right all of these things are there we have seen all of
325:00 - 325:30 that so bronze we were doing as part of the previous pipeline if you see this one one right this is that pipeline which we have seen lot many times and once this is executed then it will uh it will have this silver to gold and it's all connected using end to end right for this part to move it to bronze this part to move it to silver to
325:30 - 326:00 gold right and that's where our uh this link service for data breaks was required in order to do that now also one thing that you will notice uh we have implemented retries so you can see many places we have added like retry two and each retry interval is 30 seconds so we can do that at some places not done but other places it's done you can also configure retries
326:00 - 326:30 as part of good practices and all right so this is our Pipeline and if I want to run it what I need to do I need to truncate uh my audit table so let me go back to workspace and uh I go to uh as part of this setup I have this audit ddl I will try creating a small whatever small cluster I
326:30 - 327:00 had right I will turn it on it's a very small cluster uh which will consume 75 dbws four CES and 14 GB I get standard D3 DS3 V2 right so I am just turning it on for this I could have even uh done it using this uh uh my this thing serverless starter Warehouse right I could have even done with that but fine I'll turn
327:00 - 327:30 on one cluster because it's required now so I will truncate this table and then I will run the end to end pipeline that we have that end to endend pipeline will take care of all the activities taking the data from our aor SQL DB bringing it to our uh what bronze layer in park format then again calling the notebooks to take it from to silver to gold all of that it will do right and it's a metadata driven pipeline
327:30 - 328:00 again so let's wait for the cluster to start and in the meantime I'll show you one thing that when we were talking about this right so you can see here uh this particular uh claims and encounters it is only in silver not gold that means right now we are not creating uh basically the gold table for this so
328:00 - 328:30 we are not having this claims and encounter as part of facts in gold layer right so if we we want to handle more kpis serve more kpis then we should be even creating that but for this demo I don't think it's required because you are still getting an end to endend flow right let's wait for the cluster cluster should be up very quickly so my cluster is up I will quickly truncate
328:30 - 329:00 this because there are some entries inside this uh audit table and then I will show you how to run the pipeline and the pipeline will run in parallel that's the best part right batch size is five so basically again this is the catalog name in unity catalog then table uh database name and then table name done truncated now what I will do I
329:00 - 329:30 will go to my uh end to end Pipeline and I will run this right so let us see now so if I I can see here also or I can go in monitor Tab and I can see it there so let's see or let's say Monitor and here there is a debug
329:30 - 330:00 mode so I have run it earlier also so uh I can see that as well I will say last what custom okay last half an hour if I would see so anyways this are the new ones and uh let's see how it's
330:00 - 330:30 happening so you can see this is executing now you see as part of this n to end pipeline we have EMR source to Landing right EMR source that means you take that EMR data and bring it to bronze actually not Landing it should be bronze idly named as bronze you do not get confused because EMR data we are directly bringing it to bronze right so EMR sours to uh bronze actually that is
330:30 - 331:00 getting executed now if I I let's say click on this if I click on this you see multiple things are happening multiple things are happening parall right so it's not that it is doing it one by one for each table it it is acting or it is doing it for all tables together right so you can see
331:00 - 331:30 this so it's running in parallel now now and uh this should complete very very quickly because it will not take much time it's not going to take much time one thing is that when it runs you can see what parameters were passed right for what thing it is running like PL copy from EMR right uh so full load what database name table name all of
331:30 - 332:00 this parameters which we have passed through our config file right you can see here to understand more that means for which thing it is running and how it is running all of that I hope you can understand all of this well so this should be done uh and uh uh it won't take as much as time it has taken
332:00 - 332:30 earlier right so you can see this out and by the way if you have a bigger integration run time more resources then things will happen more quickly because again we have acquired very limited set of resources you know right so that is also one thing if you acquire more resources it will be much much faster so by mistake I have put different timings so if you see uh lot
332:30 - 333:00 of it is completed right you can see this and what is that you can see full load and and what hospital right TR Tech Hospital whatever you can see the value here table name and data source what parameters are being passed to this so a few things are running and I think that should also be complete very very quickly so this pipeline internally has all of these things which are running underneath right so copy from EMR there
333:00 - 333:30 are lot of these things silver to gold is happening now you can see this silver to gold is running where we are calling that uh notebooks right that is running so I think it should be done very very quickly so how cool it is that we have just this pipeline everything is integrated right such a good pipeline we have created end to end where we got the data from various sources put it to bronze layer then silver layer
333:30 - 334:00 implemented CDM right implemented data quality checks implemented SCD type to brought it to gold layer right perform some end user or reporting queries right and integrated all of this in a single pipeline used best practices like keywall naming conventions of course I have to show one more thing that we have even used this uh Unity catalog right so if I go to catalog then you see here
334:00 - 334:30 this is to keep I think the audit table audit right uh so this is the catalog name TT HC a DB WS that's the catalog name that we have given and inside this audit is to keep audit table audit database right so database or schema is same thing in unity catalog then we have gold schema all this dims and facts are here right
334:30 - 335:00 then we have silver all whatever tables we have in silver are here so you can see and one more interesting thing uh if I show you let's say uh uh for example let's say patient data I can click on lineage it shows me that what is next what is the downstream to this Downstream to this is the gold layer right dim patient that means from this part I am sending the data to gold. dim.
335:00 - 335:30 Patient right let's say encounter I will again click on lineage uh so encounter encounter we are not putting right to Gold anyways yeah so let's say we are putting uh patients right you see this Downstream you can see that where does the data go from this part it goes to Gold dim patient right which notebooks are
335:30 - 336:00 referring to this we see the notebooks refering to this all of that right I hope this you can get a clear idea so very very interesting and the role of this catalog is very uh good that it acts as a central uh matter store so if you have more datab bricks workspace those workspaces can also refer to this so I hope you would have got quite
336:00 - 336:30 a good Clarity on this now the last thing is we can sync up this code with GitHub and create a new branch and all I'll just give you some idea if you do not know about git and GitHub you can check out my videos on YouTube I have a complete 4 hours video on that right I will provide the link in the description now just I will give you an idea how we do that to sync the code uh we basically uh go to the
336:30 - 337:00 GitHub account I have a GitHub account right I go here and uh go to settings right and I go to developer settings here I can go to personal access token tokens classic and I can generate a token I have already done that earlier right so I will not create I can generate a token which I will just capture I cannot see it later I have generated this for aure healthcare
337:00 - 337:30 project uh and gave certain permissions now once the token is there and once the username you should know the username of it of your account right for example mine is Big Data by Sumit M right that's the username so once you have the token two things the username and token right you get from your GitHub and uh then you go to your data breaks right go to uh click on this and
337:30 - 338:00 click on settings linked accounts here you have this GitHub uh I mean it will show you multiple options right so if there is a other workspace I could have shown to some extent let let me just see if I have something else because this is already integrated right this is already integrated feature code you can see but uh let me just check that out uh if some other random I will
338:00 - 338:30 see okay let let me see this is not related to us but I'll just show you what I'm trying to convey so I come here I say settings I click on linked accounts and here I select GitHub personal access token I provide the uh usern name and
338:30 - 339:00 the token right I provide the username and token here and then it kind of shows up so that's how you can link this up right it's already linked in our case you see feature code uh I mean you can create a new Branch or whatever you would like to and you can do that so for now I mean I do not want to go so deep into it that can be a separate topic
339:00 - 339:30 itself but you see how neatly we have uh I mean uh wrote Our code segregated in different notebooks and we have covered a lot right we have covered a lot and this should give you a very very good idea I don't think there will be any project which is available on YouTube at this level right now u i mean just in case if you want to check about SQL videos or your git and
339:30 - 340:00 GitHub I have separate playlist on my YouTube you can check it out I will provide the link in the description and to support this initiative I mean I would just say that if you like my videos and you appreciate what I'm doing you like this video post or post about it on LinkedIn what you have learned and share it with your friends so that others should know about this quality stuff you know that this kind of stuff you do not get it for free you I'm very
340:00 - 340:30 sure you would be aware of that and in case if you do not know I offer ultimate data engineering program and I will provide the link in comments in case if you want to master data engineering to a level which I mean very few people can Master up to that level right if you want to reach that level then you can I mean consider going for my ultimate data engineering program which is a program which has impacted several careers and if you have not yet subscribed to my channel do subscribe now and I'm very
340:30 - 341:00 sure you would have truly appreciated all the efforts put in this project and you would have learned a lot so with this let me end it up thanks a lot and if you have seen all of it definitely you are much ahead than others thank you