Understanding the Crucial Role of Data Governance in Big Data
CS5229 - Lecture 12 - Big Data Quality and Governance (2023)
Estimated read time: 1:20
Summary
In Lecture 12 of CS5229, Nisansa de Silva wraps up the series by focusing on Big Data quality and governance from a business perspective. The lecture delves into the challenges companies face in claiming to be 'data-driven,' the reality of data chaos, and the necessity of managing data quality and governance effectively. This includes discussion on the need for transparent data ownership, data quality dimensions, data governance roles, and privacy preservation. Additionally, the lecture differentiates between the traditional hierarchical data governance model and the more modern networked approach, addressing the complexity of managing data trust and security.
Highlights
- Companies often use terms like Big Data and AI as buzzwords to attract attention. 🎯
- Mislabeled data leads to confusion in business meetings, highlighting the need for data governance. 🤯
- The lecture discusses the roles involved in data governance, including the Chief Data Officer (CDO). 🧑💼
- Big Data governance involves understanding data's volume, velocity, and variety, known as the three V's. 📏
- Modern governance models emphasize open, collaborative data use across an organization. 📡
- Privacy-related issues are discussed, emphasizing the protection of personal data. 🛡️
- The transition to networked governance allows for faster, more flexible data use. ⚡
- Nisansa de Silva emphasizes the importance of both technical and managerial skills in data governance. 💡
Key Takeaways
- Many companies claim to be data-driven but lack actual Big Data analytics capabilities. 📊
- Big Data quality is crucial for effective decision-making in businesses. 🏢
- Data chaos leads to mistrust and confusion, requiring structured governance to manage. 🔄
- Data quality involves cleaning, standardization, and profiling to ensure consistency and accuracy. ✅
- Data governance includes managing data architecture, storage, and security. 🔐
- The transition from hierarchical to networked data governance improves flexibility and democratization. 🚀
- Privacy preservation is vital, requiring techniques like anonymization to protect sensitive information. 🔒
- Innovative governance models balance control with engagement, improving data utility and trust. ⚖️
Overview
The lecture begins by addressing the common misconception that companies are data-driven despite lacking true analytic capabilities. With buzzwords like AI and Big Data thrown around, the reality is that many businesses are still grappling with data chaos. This disarray often leads to confusion and mistrust during business decisions, highlighting the need for robust data governance systems.
Nisansa de Silva elaborates on the technical and managerial aspects necessary for effective data governance. On the technical side, data quality management involves data cleaning, standardization, and profiling to ensure accuracy and consistency. Meanwhile, the managerial side requires organized data management, clear ownership, and the evaluation of data trustworthiness and security.
As the lecture progresses, de Silva contrasts traditional hierarchical data governance models with modern network-based approaches. The newer models, characterized by their flexibility and democratization, allow for more dynamic data use and foster a culture of collaboration. Moreover, privacy preservation becomes crucial, ensuring sensitive information is protected amidst these advancements in data governance.
Chapters
- 00:00 - 06:00: Introduction to Big Data Quality and Governance This chapter serves as an introduction to the topic of Big Data Quality and Governance. It is the final lecture in a series on Big Data Analysis Technologies. The lecture, identified as lecture 12, takes a business perspective approach to exploring the subject. The transcript suggests it is a concise lecture but indicates that it contains a great deal of important information. More specific details on what follows in the transcript might include discussions on the significance of quality and governance in managing big data effectively within a business context, though the transcript preview is limited.
- 06:00 - 12:00: Challenges in Big Data Utilization The chapter begins by emphasizing the importance of understanding different concepts related to Big Data. It highlights the various challenges faced when utilizing Big Data. Specific challenges are not detailed in the provided text, but the focus is set on addressing and understanding these challenges.
- 12:00 - 18:00: Data Quality Management The chapter discusses the common claim among companies about being data-driven and utilizing Big Data, while in reality, only a small percentage actually employ Big Data analytics effectively.
- 18:00 - 24:00: Understanding Data Governance The chapter titled 'Understanding Data Governance' addresses the common use of buzzwords like 'big data' and 'machine learning' by companies as a strategy to appeal to employees and clients. However, the chapter suggests these terms are often misrepresentations of reality. It aims to clarify the discrepancies between the promises made by companies and the actual events that occur in the field. The transcript hints at a deeper examination of the industry's practices, inviting readers to explore what truly happens behind the scenes. The chapter exposes the gap between the enticing language used by corporations and the realities of their implementations.
- 24:00 - 30:00: The Role of Data Governance in Business The chapter discusses the challenges faced in business meetings when data is collected manually from various sources such as Finance ID. It highlights the frequent disagreements that arise when presenting analyzed data, where individuals dispute the data's accuracy, leading to conflicts and differing conclusions. The need for robust data governance to ensure consistent, reliable data that aligns stakeholders is emphasized as a solution to these challenges.
- 30:00 - 36:00: Strategies for Data Quality Assurance Data quality assurance requires a unified approach to data transfer, ensuring a common understanding and completeness across the dataset.
- 36:00 - 42:00: Privacy in Data Governance Data chaos can lead to fear, uncertainty, and doubt (FUD) within an organization. Good quality data and a proper structure for managing data are necessary to avoid these issues.
- 42:00 - 48:00: Comparison of Data Governance Systems The chapter explores the topic of data governance systems and begins with a common scenario faced by data scientists and engineers. It highlights the typical concerns expressed by stakeholders such as lack of trust in data numbers, and questions regarding data analytical processes. The chapter sets the stage for discussing how robust data governance can address these challenges by ensuring data accuracy and reliability.
- 48:00 - 50:00: Conclusion and References The chapter 'Conclusion and References' discusses the reactions and queries that may arise from peers and superiors when presented with a report containing complex conclusions and data. It touches on the confusion and curiosity about whether such information can be shared or used, and seeks to address the comprehension gap between technical details and the understanding of peers, bosses, and subordinates.
CS5229 - Lecture 12 - Big Data Quality and Governance (2023) Transcription
- 00:00 - 00:30 welcome to the last lecture of Big Data analysis Technologies this is lecture 12. we will be talking about Big Data quality and governments this is a very business perspective topic it's a rather show to because well but it is a lot of things that are
- 00:30 - 01:00 uh it'll be remembered and I hope that uh you learn some other concept related to Big Data by now and you know the challenges we have with the big data right so let's get on with it so
- 01:00 - 01:30 the first thing is is her so everybody likes to say that they are dead women a lot of companies say that they are data driven we are doing big data um but in reality only a very small um percentage of them actually use Big Data oh analytic sense and so forth right so uh
- 01:30 - 02:00 big data and machine learning these are buzzwords that these companies use tries to use to attract both both employees and plants and most of the time these are lies so let's see how this happens and what actually happens uh typically you spend
- 02:00 - 02:30 hours and hours preparing for a meeting you collect the data manually from Finance ID and so forth you do the analysis you prepare the insights and you go and present but most of the time it would be just result in people fighting over saying no that is not what my data says but that is not what the other person's theater said and that is those are not the conclusions that I draw and so and so forth so what
- 02:30 - 03:00 you need is a unified transfer the data common understanding and complete phrase away beyond the distance that you are taking and transparent data ownership who owns this data is a important question um there is to be answered even between the single organization um to avoid conflicts this is where data governance and data quality comes into
- 03:00 - 03:30 play you have to have good quality data and you need to have proper structure in managing that data yeah so data chaos would result in something we call data fud so fud is just a fancy way of saying fear uncertainty and doubt so basically if you have data it is very
- 03:30 - 04:00 common if you are already working as a data scientist or a data engineer at your business you might have seen you might have heard these comments I don't trust these numbers uh oh that would be one of your bosses would say I don't trust Spa how did you even come to this conclusion oh to your subordinates or your
- 04:00 - 04:30 peers would say why are you showing this showing me this am I even allowed to see this am I allowed to use this now can I use these uh conclusions and again things your bosses who may not even be computer science people would be saying I don't understand this report what are all these complex numbers and then again your peers your subordinates and even your superiors might be interested in where did you get
- 04:30 - 05:00 this data how can I also get this data and also the same people who would say this would be like how can you help me understand this data but you have your own job right so you can't waste time teaching people how to understand data so maybe there is somebody else so all of these all of these is what we call data chaos
- 05:00 - 05:30 now um data instructor infrastructure is about exploding volume velocity and velocity of data we have seen this many many times over the presentations and the solution for that is managing data complex and on the other hand the data consumers the business has a increased Reliance on analytics and
- 05:30 - 06:00 Regulatory reporting they believe what the data says so if they are taking business decisions on these data they do need to have so we do need to have trusted data as a business dependence all right so all together what we need is data calibration understanding Discovery and
- 06:00 - 06:30 first so the two things that we are talking about today are data quality and data governance so data quality works on data cleaning standardization matching profile this is the more technical side of it data governance on the other hand has to do with BP process technology and this
- 06:30 - 07:00 is the more management side of the topic pretty bone through data quality data cleaning standardization and so on so forth in your lectures with uh so we are specifically only looking at things that are targeted at data quality itself um therefore the weight of data quality
- 07:00 - 07:30 would be a little less and Based on data governance would be led more so let's look at data quality first what is data point quality is how much it would fit into the data driven decision making that you are supposed to do and regardless of whether bits or not then you need to look at how
- 07:30 - 08:00 we could meet the expectations of your consumers and how well it represents the objects evens and Concepts that it is intended to represent right uh so on one hand it has to be well fitting so what you are doing in your business and on the other hand it
- 08:00 - 08:30 should be actually useful for the people you are doing the business for and it should actually be relevant to your business so uh this is a amalgamation of three concepts that should come together also um measuring data quality level can help the organization identify data errors
- 08:30 - 09:00 that need to be resolved and and assess whether data in the ID systems are fit to serve it's intended purpose okay so what are data qualitations so you collect data from different sources and when they come together to a single system there might be issues uh uh
- 09:00 - 09:30 this might be oh the the item might be wrong uh there might be indecisiveness in data as in you don't know what this is so you can't even derive it from these data because Samir from Hyderabad because Hyderabad is a city in both of these countries so you can't even derive it here and sometimes you have no idea because there are
- 09:30 - 10:00 londons in there is a London in England and there are many londons in United States if you didn't know now you know so you can't even derive these values so you are at a issue so again uh States so state is the thing in the United States it's in the name United States so California is a state India being a federal country also have
- 10:00 - 10:30 States but if you take something like Sri Lanka or China we would have provinces right Uh Oh Canada we would have provinces and that would be a okay I think yes I might be wrong about Canada but uh China I'm pretty sure actually like I am
- 10:30 - 11:00 difficult oh so how are we going to measure uh just like we mentioned earlier the purpose of data quality is to check whether the data feeds safety business purpose of the uh of your organization but to understand this you need to measure Bitcoin so what we measure is using uh data called Dimensions uh criteria
- 11:00 - 11:30 that we discussed and data profiling so data profiling is to learn it's like learning metadata about something you are trying to learn uh about the data set uh you can extract the effect of analysis and matrices based on the data set that you have and if it is huge you can speak it into small jobs not randomly you can maybe divide by a
- 11:30 - 12:00 certain attribute and analyze maybe geographical information maybe you can divide right from um or temporarily Monday data might be different from this area who knows it depends on your application or maybe Columbus data is different from candidate who knows again depends on your data so you are the person who's handling data you are the person who is familiar with
- 12:00 - 12:30 your business process so you are the person who should decide how you do this Decay classification uh how would you do this clustering into Champs in you trying to analyze this data and the objective as always is to analyze evaluate the data set against the business process
- 12:30 - 13:00 out so what are the data quality dimensions we mentioned earlier we have completeness so data can be complete even if some of the optional data is missing so just because the word says completeness do not think that it means everything should be there okay and
- 13:00 - 13:30 let's say if you have something like uh first name middle name last name and if the full name is not provided why you can definitely deriving from those variables and inside the the question of completeness is not about filling all the slots but it is about whether this data is usable
- 13:30 - 14:00 whether it has all the necessary information necessary attribution so that we can use it in our business use case all right then we have uniqueness real world compared to the number of Records so if you have duplicates
- 14:00 - 14:30 you should be getting rid of a step and you can get the uniqueness as a percentage of things that exits of all the data set all the data points that you have next you have timeliness
- 14:30 - 15:00 is this a reality in this point of time as in if you are going to take certain data into uh let's say it is medical day right so Medical Data from these two years uh
- 15:00 - 15:30 2019 to 2022 so these three or so years would not be so good to predict hopefully the medical situation in the world in let's say 2055 because hopefully there won't be this Corona pandemic by then right and uh also let's say statistics of uh human life it's been expected see
- 15:30 - 16:00 from 1938 to 1945 might not be very accurate for any other time period because that is the time of the World War II and there is a lot of people who were dying because of that so uh the relevancy of timeliness of data is very important on the quality so it is not saying that those data is
- 16:00 - 16:30 stained and it is no longer relevant they are very valid data for that time period but no longer for the time period that you are going to predict oh be appropriate in and then we have are the is the data valid um B data length data types uh the range minimums maximums
- 16:30 - 17:00 just say uh confirmed to the uh the syntax that we are looking for so remember that previous example of uh parts of the name that is there what if our system does need uh surname first name middle name format and cannot work with the full name so in that case
- 17:00 - 17:30 you do need to make sure that your system does contain information about those of the features that are expected all right next we have accuracy the degree to which data is secured to the real world data that you will be seeing so how much uh this data
- 17:30 - 18:00 reflects the so I don't want to use the uh the words training data and test data but you can think about it in that perspective how much would our the data that we have to take the decisions would reflect the examples that we would see in the real world
- 18:00 - 18:30 so let's say uh if you have data collected from United States and your model is going to be running in the United States um you should not uh format your data you you should not format your dates in this data in the way that you would use in the commander right so
- 18:30 - 19:00 they use uh month day year system we use system so we should not do payment here formatting in dates if we are going to uh both collect data and work oh and uh worker system in the United States you should make sure that it works with the assistant because uh it is something
- 19:00 - 19:30 that can cause problems right because as long as the the date is less than 12 it is it can be processed so you can have one six 2022 and in one case it is January 6th and in the other case it is June 1st so a very dangerous scenario uh
- 19:30 - 20:00 similar thing happened with the uh one of the NASA probes that was sent to Mars some people used meters and other group used Pinchers and it crashed all right the last part is consistency there should be the data so you are very familiar with the word consistency because we talked about this uh in
- 20:00 - 20:30 nursing cancer and so forth but here it is a different perspective that we are looking at consistency is the absence of difference when comparing two or more representations of the of a thing against a definition so if they are going to behave in the same manner they should have the same out of the same movement or same
- 20:30 - 21:00 situation in our data all right so here is a whole mess of data but you can see uh how each of these things how each of these uh data quality aspects a question
- 21:00 - 21:30 in this data set so let's look at some completeness this data is missing so every time that there is a range it means completeness is being suffered right confirm this is not confirming to
- 21:30 - 22:00 the knee right when did he say a fraction or this is not confirming to the ranges of the values oh yeah and oh yes our consistency Euro used in United States no Great Britain pounds used in United
- 22:00 - 22:30 States again no you can see another one so here you might be able to construct this from this that is what is shown here so GB maybe you can say GBP here right also you can see issues with Great Britain and United Kingdom over here
- 22:30 - 23:00 and here is a problem of duplication you can see the top here because I mean it's a little large and the issues with the integrity you have this data line and this data line and then issues with accuracy uh here you would not know why it is an issue with technology application but you can see these are things that
- 23:00 - 23:30 can occur uh in these are all data point issues that you would see when you are doing a project right so this is a very quick slide on the life cycle of uh data quality you profile the data you
- 23:30 - 24:00 Define targets and matrices uh you define an increment data quality rules you put it in development you do testing MB View and continuous monitoring because then you go back to here and continue on and on and on so it's a six step process uh the initial uh proofing or ongoing monitoring
- 24:00 - 24:30 optimization so you do the profiling profiling is to discover and access the data structure and identify if there are any anomalies help identifying strengths and weaknesses in uh data and help determine the the project plan that you need to do uh you would you can pinpoint different data errors uh for example
- 24:30 - 25:00 inconsistencies duplicates using data we discussed these early as well then we have the mattresses in matrices uh you can do overall data for overall data or each of the data attributes oh each of the data lines so so basically there is lines and columns you understand that um it should also be based on the data quality Dimensions that we talked about
- 25:00 - 25:30 them earlier so then the data quality rule defining um it is very dependent on the business requirement and it will process specific uh alerts warnings so computes rejects as so these are the feedback that would be fired these are the uh the replies and flags that would be raised so whatever whatever that you need to be triggered
- 25:30 - 26:00 when data quality issues come that is done here development is where you apply the degree rules uh and decide where they should be applied testing is whenever uh the develop should be used for preview how my percentage of the data quality is achieved right
- 26:00 - 26:30 then we come to the different types of data uh used in profile so we have the column structure if you have a new lens uh input data types so these are basically metadata types that would be captured so during data profiling um if you have nulls and blanks if you work with the specific uh pandas you you would see that there
- 26:30 - 27:00 are ways to uh handle both nulls and zeros uh duplication a very common thing to handle uh data lines uh deduplication and so on so forth um and uh patterns can be managed by yes regular expressions and just checking matching patterns frequencies means how open something
- 27:00 - 27:30 happens so uh sometimes uh high frequency data are very useful sometimes low frequency data so let's say if you are doing something like a data mining um of a large amount of dictator then um low frequency might be very useful but on the other hand uh on the same data mining if you are trying to do something like the Google identification uh uh pattern matching uh rule mining then
- 27:30 - 28:00 maybe higher frequency things I could uh okay if you are doing some some quick data OBT natural language data these birds that are repeated always like uh the these stock words might not be very useful but on the other hand birds and uh very rarely use uh would be quite useful then we have the Rangers we have very specific ranges for dates times uh the Springs
- 28:00 - 28:30 uh and numbers so if you have water and so on so forth you would know that there are very uh interesting cases if you are interested um look at how in the game series uh named civilization Mahatma Gandhi the leader is considered the person who's very associated with nuclear weapons if you look uh so as a homework or a side
- 28:30 - 29:00 thing look it up and see how data Range quality checking has not doing that has resulted in one of the most hilarious and uh long-lived bugs in video gaming history and then we have spaces on uh between bird sensors so these are things that if you are having uh strings uh whether whether or not to remove leading
- 29:00 - 29:30 spaces it is not a constant right so if you are um if you're processing essays maybe if you are processing python code don't do that don't remove these faces because then you will lose the uh the nesting information so most of these things that I say in this lecture and most of the other lectures as well will be dependent on the situation so then we have fasting and character sets if you have again
- 29:30 - 30:00 this is for uh spring data if you have upper limits uppercase and slow places uh the numerals do you we uh if you want to keep it only alphanumeric that means ABC A to Z then one two Z one one two nine to zero um do you only need to keep them non-neuty effect that means uh do you do you want to handle Unicode don't you want to handle
- 30:00 - 30:30 it raining or do you want to convert any report to ASCII or do you uh have a need oh Unicode if you are handling things for Signal out I mean definitely you need to keep it as Unicode right 36 um that you derive from the data minimums maximums uh range values itself so you have the domain on one side and then you can get the ranges and you can calculate things like standard deviation and then you can
- 30:30 - 31:00 normalize database extended so if you have a certain column it is possible to take a z value and normalize it some cases that might be important so some cases it might be very important to remove the bias between columns right next we will look at different types of data profiling so here we have
- 31:00 - 31:30 something like single data and uh so single column would be assuming everything is the same um time number of distinct values max value mean value primary keys for in case that sort of thing if you have so that is the cardinality spot um if you have been waiting for the sake of buttons and date types maybe spring
- 31:30 - 32:00 versus number versus speed value distribution you can think of address phone email first name uh on the multicolor uniqueness uh they have to be you can say they had to be unique and distinct counts uh all values in a are also present in b uh if you are looking at uh to uh
- 32:00 - 32:30 when you're looking at multiple columns so these uh things that uh only then on a single column you are only considering one column when you are deciding and here multi color means you are deciding you are you are looking at more than one column at a time uh so inclusion they never uh two records have the same X values do they have the same a value as well so uh let's say if
- 32:30 - 33:00 if the columns have be in the United States and force that they would have dollar right so that is a multi-column operation and exchange rate is uh another example so so you can have the
- 33:00 - 33:30 United US dollars to Great Britain pound and then you have flipped on the other three and they should have a relationship of uh X and one now it's right because otherwise it wouldn't work next we are going to look at the roles in data quality
- 33:30 - 34:00 you have the chief database data stores subject matter expert data quality leader capability and analyst technical specialist process Improvement specialist data quality field so these are just remember I when I started this lecture I said so this is going to be very very varied management type lecture and you can see why right so this is a management structure of uh of the people
- 34:00 - 34:30 that uh in in the data for the pipeline all right um what is uh specific for Big Data uh in Big Data we should care about the bees we have been repeating this all over this lecture Series so they should be explained in
- 34:30 - 35:00 the terms of uh volume velocity velocity variety value and whatever the other means that are added subsequently to the question of Big Data all right now so we are now
- 35:00 - 35:30 moving on to so you saw who is doing data quality roles and uh you see that there are people who have domain no data quality this is why these two topics are interconnected and with that data quality now we are going to look at who is going to control and Trust the data who is going to control the data who is going to trust the data who is going to control the task Trust of the data and that falls under data governance and
- 35:30 - 36:00 Stuarts you can see data columns and stewardship so we we looked at data students under data quality as well so you can see the connection hopefully all right so we have the uh data infrastructure managed by the chief infrastructure officer um information managers data Architects data models uh modelers who will be using Hadoop databases data
- 36:00 - 36:30 integration and everything else that the technologies that we looked at mlib azus Azure and let's say AWS whatever all of it foreign [Music]
- 36:30 - 37:00 these people would be utilizing things like visualization and self service business intelligence so you already have that knowledge assets because regardless of whether you would be using
- 37:00 - 37:30 it or not you will be the one who are building it for weak data so the goal of a chief data officer CDO is data collaboration here so there would be data governance manager data stewards who would be under them and we would be working with data stewards ship platform so you can see
- 37:30 - 38:00 how the two ends of the things that you have been learned so far the very very technical lens to do the analysis storing and working and the Very human part of visualization and so and so forth can be finally bridged together with this lecture with data governance and stewardship so as usual it is the managers who are doing that so thus we are discussing
- 38:00 - 38:30 this managerial ideas all right then let's look at uh 10 elements of Big Data accounts first we have the data architecture we need to care about what is the uh the structure of the data and how we present it how we store it uh how we handle it
- 38:30 - 39:00 then how do we model it for our business members so data can be anything it can be modeled into it's like KO packs um so data governance should care about how the modeling is done next we have to think about where they are stored do we have are we doing rate are we doing duplication uh deruplication do we have sharding
- 39:00 - 39:30 uh what type of data store are we managing and what do we do in a failure who is responsible for uh both the storage itself or the SEC and though they say actually and as we were talking about that we need to care about the data security uh uh we need to make sure that nobody hacks into our big data system and
- 39:30 - 40:00 steals all of our nice QC data uh then we do need to care about the integration and interoperability uh the data and the insights that are derived from our data can be plugged into the overall business process that we are we have
- 40:00 - 40:30 document and content a it goes two ways one way is that we need to have a the documentation on the metadata and what data is and on the other hand we need to know where data is going and where data is being handled and so forth so these are mostly the documentation aspect and also the data descriptions
- 40:30 - 41:00 if you have it so references and mass data were so this might not be relevant to all cases this is relevant when there is a hierarchy of how the data is handled and with some data are dependent on by function no Integrity o value whether
- 41:00 - 41:30 they are dependent on some sort of higher level uh data so for example when we were talking about exchange rates and uh monitoring variables and also uh states it is very dependent on the variables and data on the actual geopolitical information right so um it would be some master data which would contain uh these are the states of
- 41:30 - 42:00 United States so these are the uh monetary units used by various countries Etc so ik very simple and generic example that you can understand but uh so those are like worldwide constants right so uh do not be misguided that that is the only type of masturbator that exists there can be other Master data which is only relevant between your organization
- 42:00 - 42:30 um but the examples that I gave is understandable by everybody in any any case so um but what you call Master data the the level of specific City or the that definition where you draw the line uh it's very dependent on your particular application and especially your particular company right so then we have the policies of data warehousing and business
- 42:30 - 43:00 intelligence so it goes two ways so sometimes we even call the extraction of information from data warehousing data warehousing so it pertains both to the idea of where you put data how you put the big amounts of data because you don't need data warehousing to store 100 lines of code it is only relevant when it is Big Data um how you put it and what level do you put it do you put prior data do you put
- 43:00 - 43:30 process data are you going to keep them and are you going to keep in product files or not and so on so forth and business intelligence is going to be the derivations that I you are going to derive from them so if you do something like Association rule mining as I mentioned earlier how are you going to handle that is it going to be part of it is it going to be open to all of the business processes that are there so because
- 43:30 - 44:00 even that would be sometimes a a derivation of the patterns between the data so then we have the metadata we mentioned uh this earlier even in uh document and content uh what type of metadata do we have uh it plugs into all most of the previous points that we have mentioned and I don't think I need to take too much time explaining what
- 44:00 - 44:30 metadata is to you and finally data quality so as you have seen data quality is an important part of data governance uh the data governs the person who's in charge of data governance should make sure that the data quality is maintained as we discussed earlier all right so now we get to
- 44:30 - 45:00 something that you may have already predicted movies so if you are looking at Big Data dominance there are movies they are always enough views in Big Data so okay let's see we have the volume uh please signal here which I have left as it is but let's look at
- 45:00 - 45:30 um volume you what is can you find the information that you are looking for then we have the value can you find it when you most need it oh this it have value when you find it so it goes both ways right ah telling how a nuclear reactor Works to a Iron Age
- 45:30 - 46:00 person might not be very useful but telling them how to make a certain projectile weapon might be useful so uh even very useful information might not be useful uh depending on the time that you find it is an extreme example that I gave but you get the idea then the veracity are you sure that this is real are you
- 46:00 - 46:30 sure that the information that you're handling is fair so we actually looked at this uh earlier as well when we were talking about other bees uh sometimes veracity is even added to added to the first three these as a very high maribb uh given that that we now live in the age of this information so visualizations can you make sense at the glance if you if you want uh is it
- 46:30 - 47:00 possible we looked at that variety one of the old good old three you see uh [Music] can ask the example says image would be okay then trying to write it down in various various languages right anyhow Willow City again something that we looked at earlier uh information has momentum we have our
- 47:00 - 47:30 social media and so forth which builds up very fast on data uh how is it for today then we have new movies viscosity uh who did it stay with you does it call for Action uh right just this data is this data mundane and it is something
- 47:30 - 48:00 that you expect uh that happens all the time you see falling into the pattern or does it give you an idea and an opportunity or a fit for you to change your business process to change your ideas and how uh your book is organized or even at least sometimes uh how your data is processed then we have reality
- 48:00 - 48:30 so that's it convey a message that we can be set into uh a presentation on Instagram as in the city does it have an um a new Marketplace information that you are getting from this big data all everything these are business problems right unless you are in Academia but even even in Academia you have to be able to sell
- 48:30 - 49:00 your idea to your reviewers who are reviewing your academic paper if it doesn't have a punch if it doesn't give a novelty it is going to get rejected so similarly and and not even similarly more so than that in business world does it have a fund would it catch attention would the consumers care matters okay yeah pause this video
- 49:00 - 49:30 and do a little exercise on listing any issues that are faced in handling data quality and governance related to Big Data in terms of these eight V's
- 49:30 - 50:00 so I'm going to continue hope you pause and did that little task so what are the challenges the diversity of data sources as usual so people are in aspect uh bring various data times and we have context data structures and this increases the
- 50:00 - 50:30 difficulty of data integration so you might your data might be talking about same or similar ideas but you would have a hard time putting them together then we have problems of uh okay volume so remember the first one is variety and the second one is about volume even
- 50:30 - 51:00 though the word is not mentioned explicitly in front right you see how it goes and it is difficult to charge the data quality within a reasonable time then we have but velocity where data changes very fast and the timeliness of data uh is very short Things become irrelevant very fast let's say it is something like stop training it is either you use it and do
- 51:00 - 51:30 your stuff trade or you lose it use it to lose it in most of these business applications and uh requires you to process things very fast all right so in the same aspect we are going to look at the roles for the chief data officer as published by telling the 2014
- 51:30 - 52:00 paper um so there are dimensions of the chief State officer that means today looked at so there is the collaborative Dimension there are some it goes inwards and outwards then the USB data space traditional data and Big Data Dimension there is the value impact of service and strategy
- 52:00 - 52:30 so uh the CDO has the roles that I mentioned in the diagram as a coordinator reporter Arctic Ambassador analyst marketer developer and experimenter so these are all these business business scams and uh here there is the reporting who would they report to um who would the people in this particular
- 52:30 - 53:00 division of a company would report to uh to CDO to CEO uh so CDO is the data officer chief CEO is the operating officer and CFO would be the chief financial so these are the things these are the people that we've been working under a CDO would report to on various uh data quality and data governance issues all right so
- 53:00 - 53:30 let's look at the data quality Dimension um consistency bits uh idea so that is the contextual consistency which is uh how the Big Data the large data sets that we have um are used in the same domain of Interest as in uh
- 53:30 - 54:00 multiple data sets in but in a singular uh uh domain of Interest without I mean the format or the size of the bonus velocity of the plate and then then we have the temporal consistency where it is it's a matter of time and how the data comes in it should be understood in a consistent
- 54:00 - 54:30 time slot uh because Sun as I mentioned earlier it may be giving you very uh extreme examples we discussed how data from one time so it might not be competitive with data from many different types then we have operational consistency where the operational influence of Technology on the uh the production and
- 54:30 - 55:00 we will save the data so you can see uh what are the subdivisions or subdomains of each of these previous that we usually based on uh come into play and these consistencies right so [Music] you can see the can the even the word
- 55:00 - 55:30 consistency is used in contextual because uh they are the the consistency means uh consistency in the regular definition we are data should be consistent right so with the the speed of the data that we are getting so let's say from the velocity column uh they should be substitutions to each
- 55:30 - 56:00 other they should be comparable they should be matchable and again with the variety that you have care about that uh where even though you have multiple sources and micro types they should be working with each other so those are some examples I'm not going to explain each of the cells and each of the columns but you get the idea of how it is structured
- 56:00 - 56:30 so here is the example of a popular tool uh it is PQ at Amazon so this is how internal data quality assurance application the tune of Amazon words so
- 56:30 - 57:00 it is running on Apache spot you know it's starting um they have the data then they calculate the matrices that they have uh uh defined and the the input the data quality constraints and the so it it can come from uh it can come from business decisions or it can come from
- 57:00 - 57:30 as a solutions from uh the data itself so it goes both ways and uh they Define the matrices and the matrices are calculated and they need to be verified and with the calculated uh matrices computed matrices they would get the data quality reports which is used for internal resistance in Amazon
- 57:30 - 58:00 your system does not have to be uh your data board system your your organization's Big Data quality system or the data governance system does not have to follow exactly this but it's a simple and elegant solution so maybe uh it's a good starting point for you as well all right so there are two main ideas so two main
- 58:00 - 58:30 ways of data molecules so the the reason that why this uh title has a ship is because generally they started with this hierarchical data governance which is also called system of reports I'll be using uh either of these words in the either of
- 58:30 - 59:00 these titles in the subsequent slides so this used to be the default way data governance is coming so you have the keypad coordinator keyword oriented it is a very strict hierarchy it is defensive some when something bad happens they rush to fix it and it's also based on scarcity there are a few customers and even fewer producers
- 59:00 - 59:30 there is a compromise because this is the whole system or system which which even predates big data so few we were losing some of the power and the versatility of uh even digitization impact this this has roots in paper and 10 days
- 59:30 - 60:00 of business and it uses digital operators to some extent but not too much it is not uh very scalable for Big Data by we pay a lot of data scientists who are running so this is why you need a shift [Music]
- 60:00 - 60:30 which is also called the system of an engagement see where the CTO is um but CDO now we see experimented by themselves uh it is a mode remember that uh 3D graph that I showed you so they have moved towards outward Direction and The Big Data Direction and the strategy Direction and the offensive is value driven it is
- 60:30 - 61:00 this is why you might have heard these words in business meetings a lot how do we create value it's a very common question in this business questions right and uh now now we are facing abundance there are many people at the same level because it's going to be a great you can see how it is uh arranged in a matrix pattern uh they have there is a
- 61:00 - 61:30 many producers uh the data we say it is democratized that means uh there are each of these uh can be producers that we there are no bread lines bed lines uh these are very uh known for you this case because these are basically cues waiting for a resource so or if you uh thank you
- 61:30 - 62:00 gasket so this is a bread Cube that is what is uh used for this word but basically that is the idea for a resource you need to wait uh until it is available in the previous system but here you don't no longer need to do this so you are customizing the business intelligence and it is and the cheap digital power because you have distributed it among your basically notes right and many many nodes would be servicing
- 62:00 - 62:30 many it is no longer be on this because it's no longer the responsibility of people who are sitting above you to provide you with Services it is uh is anyone any one of your peers who are capable of providing the service can provide and the supports to the customer in in one hand takes if the customer doesn't have a single point of failure they can attach they can be sent to different
- 62:30 - 63:00 nodes many nodes um and this would increase the availability and reduce the time for time between the opening of an issues and closing foreign
- 63:00 - 63:30 [Music] let's say mostly utilized on the access uh support all right so remember I said that I'd be using these words a lot so let's look at a system of records and system of Engagement this is from uh this analysis
- 63:30 - 64:00 is from this paper so we have the hierarchical data governance okay system operator so also called Enterprise content management so these almost interchangeable Network names similarly Network on a system of Engagement and the gate for social business system so hierarchical purpose is the control and regulate because it is the old system where the power and control six with the people who are sitting on the top
- 64:00 - 64:30 and then we have the drop down design okay it is organizing the drop down design and around these three pieces of information that are called records the decomposition happens in Black boxes a PP have no idea what is happening in systems that they have no access to uh
- 64:30 - 65:00 and this can go up to the very top and it assumes it presumes a big picture that means uh PP at bottom level think that the people who are above them would have a better understanding of the picture and that is what they are using to guide them so uh so examples of this uh sapc birthday service now and at last so I think some of you might be using this
- 65:00 - 65:30 next we have the network data Gardens fantastic purpose of innovativeness and it is organized bottom up in the decent uh decent place Bay it incorporates Technologies which encourage via interactions it's leveraged by clouds this is seed money and you have
- 65:30 - 66:00 emergence of this because uh it is going to be a network of people and there would be emergence properties coming from that so examples of this kind of solution is slack or confluence right now let us look at a comparison of the system of Records versus the system of Engagement
- 66:00 - 66:30 here we have on the left system of record and on the right system of Engagement so if you remember system of Records is the old system and system of Engagement is the new system so you can see how uh system of Records record these focused on transaction control and permanent application as their qualities
- 66:30 - 67:00 or by on the other hand system of Engagement is more about collaboration interaction uh and Adobe can open accessibility uh next we now look at how fast can be digitized so in hierarchical data governance so system
- 67:00 - 67:30 record it's established centrally by a session competence Center so just like the control it's going to be the the crust as well is going to be centrally managed um it is also possible to have external pointed trustees with formal roles such as stewards owners and actors and on the other hand for network data
- 67:30 - 68:00 governance so system of Engagement and trust is more complicated because everybody is a peer they are equal then you need to know whether if the data is a fact or an opinion and what is the intention and who below who does this data belong to who is going to watch for it is it today it has it been certified through the standard process and so and so forth so while
- 68:00 - 68:30 it seems from the previous slides that it is a very good then free and uh democratic application to data governance as you can see when it comes to trust and security and that uh sphere of influence you can see that it might be forcing certain problems next we have uh hybrid data commands which is which apps you may have guessed it's a mix of those two uh
- 68:30 - 69:00 so critical assets are controlled top down like hierarchical and the rest of it the things are things that are not so critical uh strongly regulated you can have a peer driven architecture to give more Serendipity or freedom on a shared platform so again uh
- 69:00 - 69:30 the challenges here are digitalization of trust because even though you can certify from the top the critical items you can know you can not certify the ones that are controlled by the peers yes obviously um they bring the the situation or the arrangement that is
- 69:30 - 70:00 coming from each of their sites and in the case of Analytics as for data valuation and Netflix they have to be married to uh the two types of governance as well
- 70:00 - 70:30 and as you can see at the bottom the problem of leadership and role transition uh between uh who is going to control what is also an important Point here so let's look at how the system of record uh and the data sets it is say authoritative source so because it's
- 70:30 - 71:00 going to be a top-down so all activities and information and the surrounding data uh and its meaning and its use works as the foundation and then we have the process the metadata then we need to know where the data comes from so you have data catalog data dictionary then you need to know what the data means so you get the business glossary reference data next you have to know whether the
- 71:00 - 71:30 data is right so you have the policy manager meaning who is going to be the person who's going to be controlling this data oh uh on the other hand so on one hand it is controlling on the other hand it is going to be responsibility so you have find understand and Trust so these three policy three aspects of data
- 71:30 - 72:00 governance being controlled and being utilized uh in this structure next we have to look at the uh the life cycle of the Big Data tech industry so we have new injection of data which goes through Discovery collection indexing query and
- 72:00 - 72:30 integration so integration after the Integrations step there shouldn't be any difference between this new data and the data that existed before I mean there should be a time values and so and so forth but the point is after that step they should be indistinguishable from the data that existed in the Big Data store that you were having
- 72:30 - 73:00 another thing that we have a look at is how the data quality is guaranteed here so we have the data Lake where you have all the data uh at a equal uh balanced scenario and then slowly you move it upwards um you know with the Innovation step with the conceivable step and until uh you arrive
- 73:00 - 73:30 at actionable inside because these are the things that the managers are interested in they are not most of the time they are not interested in numbers and graphs and Trend sensors and so forth you need to give actionable Insight so that is how it is done on value based data governance to move step by step until you can talk in the same language as people who are not
- 73:30 - 74:00 as Tech so yes you are uh on the case of uh valuation cycle we have the consumption value so if we start at that point so you have to get the um policies for data sharing data handling uh and uh data lake so we'll start there
- 74:00 - 74:30 because we always say we will we are data centered and so forth so from there if you start with the little pyramid which we are familiar now uh then we have the prescriptive value where you extract insights and derive use cases and uh manage algorithms that are relevant then we bring it up
- 74:30 - 75:00 into more marketable domain the creating competitive value so so how do we use this insights to um improve the complexity value and from competitive value you get operational value so how these values can be
- 75:00 - 75:30 applied for the business use or handling event driven programs uh and identifying critical data elements which then results in the process and risk value so there you can decide on what is the uh risk management how would
- 75:30 - 76:00 you reduce the risk that you have for the company uh for your business center and so forth how would you optimize your process and so on so forth so these 10 results in in your data store your data Lake and you start the process all over again so this is how uh you go around the data evaluation cycle and add value
- 76:00 - 76:30 to your data all right so post here and try to see whether you can suggest a new dimension new poly Dimension that I did not talk about so far thank you if we thought privacy or privacy depending on however you pronounce it you thought the same as many other
- 76:30 - 77:00 people so why does why is privacy important privacy is going to be an important part in data science forever because data is something that people do not surrender Billy nili and some people are very sensitive about who has access to their data and who
- 77:00 - 77:30 controls their yeah some people are okay some people would post things on Facebook and so on so forth and but some people are very guarded about their data before various things and some uh you can see celebrities have a reason people uh being worried about their private life more so than a random person but even random people have some people are very concerned about uh putting data in public so that means
- 77:30 - 78:00 whoever captures this data should be very careful about uh the privacy of the people who are associated with the data that they collected so I there are some rules laws that govern uh personal data privacy and if you remember very recently not very recently about one and a half years ago you might have gotten a lot of emails
- 78:00 - 78:30 from websites that does uh hosted in Europe because uh across Europe the European Union changed some of the policies that they have for data privacy and that resulted in all of them updating their terms and services about the big data that they collect on behalf of you or about you on their servers so data privacy privacy preservation or
- 78:30 - 79:00 privacy preservation has various levels um you can do that at data collection level you can do that as publishing level or at the data mining level so uh data collection time you can just not collect certain type of data uh things that are de-anonymizing oh at the publishing time you can remove the data columns or features that would be
- 79:00 - 79:30 distinguishable uh de-anonymizable and Publishing and on the data mining time sometimes there are possibilities of reverse engineering these anonymize the time to de-anonymize data so uh you have to be careful at the data mining time as well this goes on to a little bit of
- 79:30 - 80:00 computational Ethics as well since we do not have that as a class I would suggest you look at data I thinks for this particular question all right so what are the personal data attributes um first we have personal information identifiers pii uh we will be talking about this say at several slides so uh
- 80:00 - 80:30 you can say as personal information identifiers oh the other way around personally identifiable information either way it is going to be pii so things like ID name and email definitely going to be a direct culprit of being API so most of the time you are not supposed to release that type of information into uh pure data Lake then there are quasiated files so they
- 80:30 - 81:00 are not directly identified but if you mix and match with other data sources you should be able to de-identify de-anonymize people so there are things like age and depth option Grace so if you have one data set that gives a gender person ways of uh certain people living in a certain area so there is one data set and on the other that's it you have the same information uh agenda protection uh race of the PPE
- 81:00 - 81:30 and let's see uh some kind of a behavior so then you can put these data sets together to match who in what area has this particular Behavior then there are uh the sensitive attributes such as salary relationship start status diseases especially um medical information uh in most developed countries they are very heavily guarded then there are the
- 81:30 - 82:00 nonsensitive attributes which is anything that does not fall into the above cases right so how to manage these the first thing is to suppress by replacing it with the star or something or a null or so and so forth generalization is to replace with generalized values you instead of saying people's uh ages and so and so forth you'll just average uh then there is uh swapping you can swap some of the values with uh either
- 82:00 - 82:30 hard coded values or randomizes but that could be tampering with the data array not much in that particular sense then we have anonymization sorry and optimization there they are we you so remove the you separate the cost identifiers and sensitive attributes into different tables so you cannot match one with the other so you can still give the data out but
- 82:30 - 83:00 uh not which row contains what value uh directly so if you if you break the table apart and not have any quote unquote for in case it would be okay uh you definitely have to uh sort by those uh the shuffle the rows as well otherwise you can just put the tables together so
- 83:00 - 83:30 you have to be included uh obviously then the then the prescription uh which is perturbation which is basically replacing the original value of some sensitive attributes using some fake values again if you do something like that you have to mention that when you release the data set otherwise uh other data scientists are going to run their systems on your data and get uh very bad
- 83:30 - 84:00 results because these are random and fake values so um while we are talking about rear indication and consent um given so in Instagram is not a big deal but uh in European countries and the United States it's a big deal where uh ZIP codes are used very much uh so they have found that using the ZIP code um uh birth date and sex you should they
- 84:00 - 84:30 would be able to identify about 87 percent of uh PP with social security numbers the the social numbers of people so then uh in let's say in United States Social Security number is tied to your tax your bank and so on and so forth and then they can be fraud and steal your money and so forth uh but initially those pills are not considered pii personally added terrible information but
- 84:30 - 85:00 they can be derived and then another study another incident that happened in 2014 New York City release about 173 million taxi trips so it is just details about taxi trips um even the license plates and the taxi identifier uh has been obstigated for anonymization process but it was de-anonymized within hours of being released and um
- 85:00 - 85:30 BP tagged and found where our celebrities were living where they have gone who they are being with and so and so forth uh very within hours of the uh release of data so that was a huge privacy scandal um informed consent uh human subjects if you are you if your data is from humans the human should know about the experiment must say that they agree with the experiment uh and it should be
- 85:30 - 86:00 voluntary and must have the right to be broken set at any time so that is what happened with the Cambridge analytica they are they used the information that you provide to Facebook and the information that is gathered by Facebook to put together and decide who should be influenced for political reasons for political campaigns and so and so forth and it caused a big scandal uh
- 86:00 - 86:30 you need to balance the concrete data and privacy one thing I need to mention as well uh in the uh the taxes study uh how they found celebrities is by cross-referencing with timestamps pictures of celebrities taken from uh Paparazzi and so forth uh of them entering or exiting taxes um to map them together to find personal addresses and how much they think cancer and so forth
- 86:30 - 87:00 then uh here is an example of a data set with the health records so there are names so personally this is a personal identification this is definitely a dangerous field and then we have the Quasi identify verifications which can help in uh finding the identifier and then there is the sensitive attribute which people do not like other people know it so how you
- 87:00 - 87:30 can fix is uh you can generalize this type of data and you can suppress this type of data and again generalize this type of data all right so there are various privacy models from literature there is K anonymity so K anonymity is
- 87:30 - 88:00 well the information for each individual cannot be imminently differentiated from at least K minus one other individuals so if uh when you go to find or be anonymize details if you find at least K similar records then you can say it is anonymity so if
- 88:00 - 88:30 it has 1 million people and when I try to De anonymize your data if your data is equivalent to 10 other people then it has K 10 anonymity then we have ill diversity L diversity is if a regroup of tubers that share the same quasi identifier values have at least L we will be presented sensitive way so that means
- 88:30 - 89:00 if all the Quasi identifiers are similar there has to be at least l number of sensitive values in that section of rows right so they they should be uh well represent that means those should be uh class balanced then we have the T closeness T closeness is the distance between this future not the sensitive attribute in this class and the uh attribute in a whole table should not
- 89:00 - 89:30 be uh should be no more than a threshold P so there are some examples here so here this is a co-anonymous data set why is it the over Anonymous data set because you you cannot uh if the name column was not there
- 89:30 - 90:00 you cannot distinguish any of these lines from each other we're using the policy identifiers to be unique right so all of these are the same so that means this is a full Anonymous data set there is nothing to be distinguishing so you can see that this is in if for a machine learning
- 90:00 - 90:30 perspective for a data scientist if you are looking at trade-to-day scientist perspective you are you must be thinking this is horrible this means we cannot do good prediction on this data yes it is true so you are sacrificing the power of your model the power of your system to save the privacy of the people about whom you have collected the data that is the whole idea you are making weaker and
- 90:30 - 91:00 weaker systems uh and having bad accuracy because you need to uh save previously but that is the that is a sacrifice that you will be asked to do then we have uh so next we have okay diversity so here it is uh it the same so you can see these are the same
- 91:00 - 91:30 right these are the same and they are you have at least three types of labels so again these are same and there are three types of labels these are the same there are three types of Labor so this is a three diverse data set
- 91:30 - 92:00 all right so privacy preservation should be capable of preserving the Privacy as well as data mining utility at the same time that is what I mentioned earlier as well so you are sacrificing the goodness of your data for the sake of privacy but it shouldn't be too much you should not uh the data so much so that it becomes useless so it's a balancing
- 92:00 - 92:30 act that you have to be very careful about so what type of matrices that you can use um you have the previous symmetricians and the utility messages right so you need to balance both of them so on one side you have the previous Matrix so we have the confidence level um where you can calculate the Privacy confidence
- 92:30 - 93:00 then we have the conditional entropy if you are familiar with how entropy is calculated it is uh the uh difference between the idea of how much of information that is contained uh from absolute chaos right so the differential entropy is the difference between the the entropy values if there is uh so hidden failure is the ratio between sensitive patterns hidden within the biopsy uh the ability to So
- 93:00 - 93:30 when you say failure it is on the other hand it is say success of finding out data but for the concern of previously it is a failure on the other hand so the all of those to make sure that privacy is guaranteed and then on the other hand we have the utility matrices uh metrics uh which is trying to guarantee that the data set has not become useless
- 93:30 - 94:00 because of our uh attempts to make it more private so we would do generalization suppression counting how much of uh generalization uh suppression operations were done the loss metrics gives you an idea of the Noma is not stock is each attribution how much of information how much of diversity uh you have lost because of what you did um there is a penalty for each Tuple
- 94:00 - 94:30 based on how much how many other tuples on the data set are indistinguishable profit so remember on on the other hand we are we were trying to get this L diversity and K anonymity to make sure the uh the tuples were same from each other were same to each other and uh indistinguishable so that the Privacy is preserved but here the loss metric on the other hand penalizes that that indistinguishableness
- 94:30 - 95:00 um so what you need to do is to balance both of them and have decent values to both of them so then we have KL Divergence it is not just a method for this scale Divergence is basically to see how the two probability distributions is different from each other so it is not only for this privacy business it's not only for this utility uh checking you can use it in any of your programs any of your applications even if you are it can be used as a nice loss function
- 95:00 - 95:30 if you are doing a machine learning so uh basically if you are predicting a probability distribution and you expect it to be a certain probability distribution instead of checking by label by label what you do is you predict a probability distribution and see how different is the generated probability distribution is from what you expected so in this particular case what they do is you have the
- 95:30 - 96:00 original table as a probability distribution and the sanitized table as a probability distribution and then see how diver these two distributions are so that is basically the idea there so finally uh the few directions on privacy preserving data uh in the case of data governance is privacy can be subjective um it depends on each of the people each of the person and the country and the
- 96:00 - 96:30 company and so forth um most of the time user has the ability to decide what what type of privacy they want and what level of privacy they own sometimes these are overshadowed by the end user agreement that you sign the thing that you scroll down and say Yes usually and on the other hand we have the context area privacy as a future Direction um using iot As again as I mentioned
- 96:30 - 97:00 it's this uh buzzwords uh and identifying privacy requirements based on the context so even if the person does not say something um you should be able to derive it derive the Privacy requirements from uh the context because you have seen similar incidents similar uh comments on countesses these similar contexts before so that is the end of uh
- 97:00 - 97:30 this uh section on talking about how privacy uh has been factored into the case of uh data governance so here we have references on uh oh it's uh presentation post name and this ends
- 97:30 - 98:00 bought this lecture and the lecture series for this course thank you