KIT+ 2 Runde: Data Privacy Workshop Insights

KIT+ 2 Runde Data Privacy and Governance Recording 1

Estimated read time: 1:20

    AI is evolving every day. Don't fall behind.

    Join 50,000+ readers learning how to use AI in just 5 minutes daily.

    Completely free, unsubscribe at any time.

    Summary

    In the second session of the KIT+ 2 Runde Data Privacy and Governance Workshop, hosted by appliedAI Initiative, participants delved into the intricacies of crafting a data strategy crucial for AI use cases. The workshop emphasized understanding data quality, strategic data management, and governance principles essential for effective AI adoption. Participants engaged in discussions, exercises, and were introduced to advanced data handling techniques, tools for maintaining data quality, and shared individual expectations related to data strategy development. The session offered actionable insights into leveraging data as a strategic asset and preparing organizations for AI-driven transformation.

      Highlights

      • Participants explored the importance of data as a strategic asset and how it drives business success. 🚀
      • The workshop provided practical insights into creating a dynamic data strategy tailored to evolving organizational needs. 📈
      • Real-world examples highlighted the significance of data quality over sheer quantity, crucial for AI model success. ✅
      • Interactive sessions encouraged participants to evaluate their own data management practices and uncover improvement areas. 💡
      • Advanced topics like data lineage, transformation flows, and data quality checks were discussed, aiding robust data governance. 🔍

      Key Takeaways

      • Data quality is more crucial than quantity - less but better data boosts AI effectiveness! 📊
      • Public data sets can be handy but watch out for copyright issues before using them commercially! 🌍
      • Regularly assess your data governance strategy to adapt to new regulations and technologies. 🔄

      Overview

      The workshop kicked off with a look at the monumental role data plays in crafting successful business strategies, juxtaposing theoretical approaches with real-world applications. Participants introduced themselves, sharing their unique expectations and experiences with data governance, setting the stage for an engaging, collective learning experience.

        The heart of the session revolved around forming a robust data strategy, tackling the essence of data quality over quantity. The discourse illuminated how focusing on better data improves AI outcomes, a view supported by compelling examples where refined data drove superior predictive models.

          Wrapping up, the discussion transitioned to technical demonstrations on Azure Machine Learning Studio, showcasing tools and techniques in data transformation, cataloging, and governance. This concluded with a consensus on the importance of adaptable, well-governed data practices to harness AI's full potential.

            KIT+ 2 Runde Data Privacy and Governance Recording 1 Transcription

            • 00:00 - 00:30 welcome everybody to today's data strategy Workshop uh this is the second iteration uh luckily we communicated it correctly so that nobody from uh today morning uh is in this Workshop as well so um yeah over the next four hours we will talk about uh data strategy uh what data strategy is what we can do with it why it's important in the context of um building AI use cases for our company
            • 00:30 - 01:00 and um yeah um this this will be the main content of today's Workshop um with me is my colleague Anish who will give some technical demonstration later on uh yeah Anish you wanted to introduce yourself yeah uh so hi I'm Anish uh I'm also working at applied AI as part of the trustworthy AI team and although our focus is actually right now on the AI act itself which maybe you've heard of
            • 01:00 - 01:30 uh but my background is more of uh like technical data engineer or mlops engineer platform engineer whatever um and yeah looking forward to the workshop today yeah very cool yeah thank you um so two technical people talking about data strategy today but we are already experienced we already gave this Workshop a couple of times together now um so yeah looking forward to it cool um yeah so what will
            • 01:30 - 02:00 be the actual content of today's Workshop so um or what are the goals rather of today's Workshop so um on the one hand uh we want to talk about um what data data strategy actually is uh what kind of building blocks data strategy has and why it's important um then learn some best practices uh in terms of data management um how we can apply the data you within uh yeah a data
            • 02:00 - 02:30 strategic framework so to speak um talk a little bit about data governance as well uh and um then we would like to apply what we have learned to a particular use case um this somehow depends on how quickly we will be today and um if somebody wants to step forward and present their own use case um if not this is also totally fine um but um nevertheless we will have some exercise
            • 02:30 - 03:00 in between as well so um we encourage active participation within the workshop this is also reflected in the mode that we would like to uh establish within today's Workshop so um in uh the morning workshop there were already some people who uh had some good discussion points and we would like to encourage you to just keep this up um so um if you have any questions or if you are not quite sure how you can apply the things that that we are talking about to your
            • 03:00 - 03:30 organization um in particular then please don't feel shy to raise your hand and ask uh a question or maybe um uh post a request that you want to go into detail about some um aspects that that we didn't cover um in so much detail before um so we can dynamically answer your questions or um talk about some particular aspects of your organization um if you want and I think um because we are only 28 people I think there should
            • 03:30 - 04:00 be enough room for this and um so yeah don't feel don't feel shy um uh in parallel we will also of course because it's a blood ey we will work with the mobot uh and there we also will have like a dedicated spot where where where you can put your questions but um at the same time don't feel shy to also use the chat feature um of Ms teams or just raise your hands um so whatever is more comfortable for you
            • 04:00 - 04:30 all right and to get everybody out of their um lunch uh um I don't know uh low uh we have prepared a short introduction rounds on the my robort I will just paste the link here and you can access it and uh I can just quickly switch over so uh we have prepared a short introduction round here where uh we put down our um image and uh
            • 04:30 - 05:00 what kind of role we have and what the goals of of the workshop are for us personally and we would like to invite you to do the same so just put down like a um a sticky note with your name on it uh your position the company that you are from and what you would like to learn from today's workshop and um I think we can just take like five minutes for this and uh yeah everybody should be able to put down a
            • 05:00 - 05:30 note all right let's start uh I hope you have access to the B yes perfect e
            • 05:30 - 06:00 ah okay so um those of you who were
            • 06:00 - 06:30 already in the workshop in the morning um this is the same Workshop basically just in different time slots so uh you will not miss anything if you leave uh for those of you who haven't been here in the morning uh yeah welcome
            • 06:30 - 07:00 this is this is the same Workshop uh as your colleagues did maybe um of course you always welcome to stay if you want to uh yeah talk about some things that you maybe have missed uh in the in the morning but uh yeah in case you have already heard the workshop uh today then um there there will be no more uh new information for you e
            • 07:00 - 07:30 U maybe if you see your colleagues on
            • 07:30 - 08:00 the board and maybe you can put your notes uh to your colleagues as well then we can go through one company at a time
            • 08:00 - 08:30 and introduce ourselves uh in minutes
            • 08:30 - 09:00 e
            • 09:00 - 09:30 e e
            • 09:30 - 10:00 yeah cool thanks a lot um if if you
            • 10:00 - 10:30 should be still um typing that's totally fine you can just uh finish uh maybe uh we can start um a short introduction round uh here on the left side and then we just work all the way to the right side with uh Max Berger and um yeah because you're on the left side
            • 10:30 - 11:00 maybe Jennifer you want to start um yeah I'm Jennifer from maxb and I'm an application developer there um and yeah I hope that we learn how we sort data right and how we can um yeah manage them correctly to use it later
            • 11:00 - 11:30 yeah cool thank you do you want to give over to one of your colleagues and I'll take over Hi Felix parties uh AI engineer from Marx ble um the topic of data is very um yeah very pressing at the time
            • 11:30 - 12:00 because we are also um designing our very own data strategy and yeah perfect placement for this uh course today for this Workshop um D it you go on okay hello everyone uh to introduce myself uh I Am David de um I'm uh from Max
            • 12:00 - 12:30 ble um my current status um at Max ble is uh that I'm a traine at uh Max ble and also I study um AI so I'm a training uh AI engineer and um yeah uh for the question um of data
            • 12:30 - 13:00 um it's very important uh for me uh how to handle data correctly and uh some kind of uh strategy therefore yes okay then I will close from from our end hello everyone I'm Stefan lutka I'm also from Max ble from the maxb windi and there I'm the head of the lean management for for the wind
            • 13:00 - 13:30 department and um yes data is the base not just for AI but also of all kpis which are very interested or interesting in lean management um from the lean management perspective and for today I hope to learn more about yeah just data strategy awesome thank you and cool to have four of you on board today uh then maybe we can go to the right side and we
            • 13:30 - 14:00 can continue with convo and I think the first person from convo here on the right side is Marcus hello everybody my name is Marcos hell from conam um I'm responsible for Automation and digitalization uh uh projects at conam and currently we are part of the project K transfer BL and we are on the very beginning of the our journey so uh for me it's important to
            • 14:00 - 14:30 know what is a best practice way to get an overview of our data which we have collected currently so we are thinking about different ways but I hope we get some answer today thanks cool you're welcome uh then Matias from conver is also here hello everybody my name is Matias from conm I'm in the planning team from the
            • 14:30 - 15:00 production and I also do my first steps uh with uh Ai and I'm very interested in it and hope to learn as much as as possible yeah thank you yeah hi I'm I'm Stefan D I'm the director operations at convos um yeah for me in uh in the first step it's uh also to get
            • 15:00 - 15:30 knowledge about best practices and uh not only how to collect data but also how to I don't know analyze or sort uh the data in the right way so that it can be used for AI Ki uh applications so I'm as as well as the others it's it's our first steps in this direction and yeah very interested
            • 15:30 - 16:00 in yeah cool thank you and nice to have you on board then uh we can continue with Bela some people from Bela were already present in today's morning workshop but never well nevertheless um I would like to welcome our Bela participants thank you so my name is Alexander Kel um coming from Bela I'm a chemist and uh in technical support for our customers so uh leading
            • 16:00 - 16:30 chemistry products to application of these chemical products at our customers which is based on um polymeric conversion um so for me of Interest would be R&D data so quite complex data something like with pictures and numbers and um well something coming from the machine something coming from regulatory um and uh comb comination of
            • 16:30 - 17:00 these thank you hello my name is Rina I'm a controller I'm also working for Bala on a group level and uh I'm a controller so my day-to-day business is data and I'm very much interested to learn how to improve how we manage our data in general yeah cool welcome welcome on board uh let's see if we can answer some of your questions today uh we can continue with MRA
            • 17:00 - 17:30 game hello my name is Patrick pitel I work at mft game I'm a software engineer and I hope to find out more about how to identify the data that is actually useful that we need to have to train AI uh later on
            • 17:30 - 18:00 all right um hi my name is rugas I'm also um software developer at marov and and I'm very much interested in learning very broadly about best practices regarding data so you cannot train EA model without data so it's uh uh important topic hello everyone I'm uh L robenstein from marra as well and I'm a software developer and I'm interested um how to
            • 18:00 - 18:30 get um yourself to structure data and um how you can use this data to train AI hello everyone I'm Richard I'm an application expert at marra and I'm interested in uh general information about data strategy and management I think the perfect one for this Workshop perfect you're welcome cool um we can just continue with Max
            • 18:30 - 19:00 pman hello I'm Max um I'm the CEO of Max pman uh gamea and we want to have a straight plan for our data and I hope to get a good overview today uh from the management approach of this topic yeah hi my name is Stan K from the max max G I'm the technical person for
            • 19:00 - 19:30 daa and digitization and um my boss Max pman we have the same questions the same goals of course and we are very interested in that yeah cool uh welcome then we can continue with vanel who I think already have somebody who is dedicated or like an an employe Jana I think whom we met uh in the
            • 19:30 - 20:00 morning who is already yeah the dedicated employe for data strategy yes right so hello together my name is link Dominic I'm working as a Cloud solution architect Advan um we are responsible for the whole asure and uh Microsoft 365 environment and um yeah from today uh I want to explore how we can use or Define strategic data management for AI purposes so for
            • 20:00 - 20:30 example how can we handle different Source systems or data from different systems put them together and uh yeah prepare them for some AI tools or some use cases all right then we move on to kro GMB hello um I'm Michael I'm a product manager at comro that's a local internet
            • 20:30 - 21:00 provider um I'm curious about uh data management and preparation and what kind of possib possibilities it could open up yes that's me and I think there's some one of your colleagues here both I think yeah this is wasle also from the product
            • 21:00 - 21:30 management of Comal and yeah my colleague made the point and also Stefan dor he had also the same the same points first to to collect the DAT the data right and how to prepare to use them uh in any AI so that's what we want to know first yeah well we have some slides yeah thanks too we have some slide prepare in
            • 21:30 - 22:00 terms of data collection this will be a topic later on uh so I hope we can provide you with some nice information on that all right last but not least um a uh maybe Nico you just take points and then hand over to your colleagues uh yeah I can start um hi I'm Nico gaer I'm from e in alburg um I'm International sales assistant and pretty much why I'm here is because I want to
            • 22:00 - 22:30 find out how I can best introduce AI to my workspace in sales and marketing in general yes then I'll hand over to M hi everyone my name is Matt tolu I'm also working in e as a controller and basically it's just like in my job data is key and so I think I might learn some interesting things about
            • 22:30 - 23:00 it and then I'll hand over to Victor yes hello so I'm also from E like my other two colleagues um I'm an apprentice right now at e and I'm also interested in the use cases of AI in general and what um want to get to know some other um point of views of AIS and what you can do with it yeah cool did I forget
            • 23:00 - 23:30 anyone um so a couple of people also joined while we were doing the introduction ah okay and I also see one hand uh yeah Bruno go ahead hello everybody um you forgot me but it's no problem Dominic link is also from vanel um I'm the process manager for the service department and normally I work with sub ECC c4c and FSM and we are at the beginning with our AI journey
            • 23:30 - 24:00 and I hope to get a better understanding uh to use Ai and data management for better processes thank you yeah cool uh is there somebody else who joined late and didn't put their sticky note on the board okay this doesn't seem to be the case um then I will just switch over to our slides
            • 24:00 - 24:30 again and uh we will just dive into our content directly so um what we will talk about today is um first off I want to make um data as an asset a little bit attractive to you talk about um why data uh might be valuable for your company um I mean I think I think we are all familiar with this phrase that data is the new o um I would I wouldn't phrase
            • 24:30 - 25:00 it as strongly as this but um still we will look at a little bit of statistics and um surveys uh covering um why data might be an asset and under what what circumstances um data might be an asset then uh we will talk about uh the data strategy so what is it and why does it make sense to have uh then we will talk a little bit about uh the means of data collection so how we can actually get to data and how we can um yeah store it
            • 25:00 - 25:30 somewhere where where we can retrieve it then I have one slide about how much data is enough data um where it's more about the rules of thumb that we can use in order to um think about uh when um enough data has been collected for a use case um then we will talk about data quality uh also combined with this uh where we can store data and in what kind of formats maybe then uh Anish will give a short demonstration of a data lineage
            • 25:30 - 26:00 tool or Data Tracking tool and um then we will talk a little bit about data governance and uh yeah I will leave you after the or not leave you uh I will um talk until the fourth chapter is finished and then Anish will take over for today uh and yeah um as I said like if you have any questions in the meantime just uh feel free to raise your hand and uh we will try to answer them right away all right without further Ado um let's
            • 26:00 - 26:30 look at the importance of data and actually um the Surs that we see here on screen I already I also have them as a link format so you can look at them uh if you like maybe in the breaks um so um what do all these graphs T tell you basically um that the that the importance of data and data driven decision- making for example have just increased over the last couple of years here on the right side you can see um
            • 26:30 - 27:00 surveys from 2018 2019 so already 5 years ago actually but um if the trend continues um data um yeah or the importance of data has just increased over over the last couple of years um I think we have like uh in general we started with this trend of big data in the early 2010s I would say uh where it was more about um yeah leveraging like very early machine learning models maybe
            • 27:00 - 27:30 for some automated decision- making for example and now we have arrived at this age of AI basically where um the AI algorithms that we can leverage in order to um realize some use cases are very powerful already um but of course they also require vast amounts of data or data with high quality in order to make them as powerful as they can be so um okay so this to the importance of data
            • 27:30 - 28:00 and then on the other side of course we kind of want to link this to some business uh Concepts or business successes and here on the left you you can also see uh the results of the of the survey um where participants were asked about um how significantly um data governance or data strategy paradigms have shifted their strategic position on the market market and um here the results of this survey
            • 28:00 - 28:30 um show that um there has been um significant Improvement for many companies um to generate uh a competitive Advantage based on what they can derive out of their um data that they have collected over some time so the bottom line of this is that um if you um um tap into the potential of data you might be able to um significantly expands your position on the market
            • 28:30 - 29:00 depending on what kind of company you are and what the market conditions are of course but um if you start about uh if you start um thinking um from a data perspective um you might be able to um identify weak points in your um processes um develop new products uh for your customers or um see um how your business is developing in general so um these are all very powerful perspectives to take and um data is the fundamental
            • 29:00 - 29:30 um sub substance basically or resource that you need in order to take these steps um and this is also reflected here in this uh quote of Andrew Yang um and he's one of the lead AI researchers um in the world at the moment and um he also affirms this um the results of these surveys basically um that uh that
            • 29:30 - 30:00 data is very important but at the at the same time that um data is basically like this entry barrier um to um yeah shaping a business um into the direction of being enabled by data um so um if we like okay so this is like the very high level overview of things but if we now take a look at the at the um at the tasks that technical Personnel does um
            • 30:00 - 30:30 in order to achieve this um this goal basically um these are typically the data scientists um that work for your company and if we look at surveys conducted under data scientists um and you ask them what um is the the task that they spent the most time on then they say 60% of their time is devoted towards cleaning and organiz in data and then if you ask them what is the least
            • 30:30 - 31:00 enjoyable um part of your job then they say okay yeah cleaning in organizing data um isn't really the fun part of the job to be honest and so we see that there's like a um quite a large need for um companies to develop a strategy in order to alleviate some pressure from the data scientists so that they can shift their focus from just cleaning data and pre-processing data Maybe towards actually building applications
            • 31:00 - 31:30 and solutions that help your business to achieve the goals that they that you want to achieve and um for this uh we um first need to look into what data can actually do for us and therefore we switch over to this chapter on data as an asset so um first of all of course the question arises I think most of you will already be familiar with this but um what con Utes data and
            • 31:30 - 32:00 um here we mainly talk about Digital Data of course so this is information that is stored in a digital manner um so for example in a numeric format but also maybe um in some other encodings for example jpeg for for images or um MP4 formats for um video files and um these um this information is then stored electronically um at some location for example on a computer um of U your
            • 32:00 - 32:30 individual employees or at data centers um or maybe even in the cloud um depending on what kind of um yeah um CMS you're using for example um where where the systems are located um how you access these systems so um typically data is generated through it systems or it related systems um I think um especially if you come from more like like core engineering business sites then you are are also um handle lots of
            • 32:30 - 33:00 sensor sensor data from your production machines or maybe from like U yeah um uh chemical reactors for example um they of course the the these sensors also are integrated into an IT system but they the the data that they generate or that they collect is more from an from a physical underlying process whereas in other organizations and most of the data um maybe comes in
            • 33:00 - 33:30 like an unstructured format for example um emails um or um uh transcripts of meetings for example so where where it's more about the language that is uh stored in this information so bottom line uh data is some form of digital information I think we are all familiar with this um and um before we go into how we can organize data or how we can think about data in different way um I would like to invite us to already take
            • 33:30 - 34:00 a step back and do do a short exercise on the myo where we think about the attributes of data so what kind of attributes can data have um an example for this would be for example um the source so where where data comes from but maybe also the format in which um data can be stored um or the type of data that we are dealing with so is it like image data or data or whatnot and for this we can go to the myboard again
            • 34:00 - 34:30 and go to the bottom of the introduction uh frame I will just scale this up and maybe we can take like seven minutes or so to think about what kind of attributes data might have and then uh as Anish has showed me uh today in the morning we can use Myro AI to um cluster this and see what comes out of it all right so I would say yeah 7 minutes
            • 34:30 - 35:00 I think should be enough and um yeah if you if you can't uh can if you cannot think about uh attributes then um just be inspired from what other people write on the board
            • 35:00 - 35:30 e
            • 35:30 - 36:00 e
            • 36:00 - 36:30 e
            • 36:30 - 37:00 e
            • 37:00 - 37:30 e
            • 37:30 - 38:00 e
            • 38:00 - 38:30 e
            • 38:30 - 39:00 e e
            • 39:00 - 39:30 I think oh okay no people are writing
            • 39:30 - 40:00 still yeah it's fine
            • 40:00 - 40:30 e
            • 40:30 - 41:00 e e
            • 41:00 - 41:30 all right all right all
            • 41:30 - 42:00 right well then uh Anish you can take the the cards and uh let the AI do its magic this usually takes a couple of uh minutes maybe or yeah maybe 90 seconds or so uh let's see what the AI can come up with in terms of categories but um I think like um actually there are more
            • 42:00 - 42:30 sticky notes now than they have been in the morning I feel like lots of different um ideas about data I I see that uh some of you have put down uh actual um file uh file not extensions but the type of file for example jpeg or CSV or something like this uh this of course plays a role um then but also here uh on a more um in-depth level so to speak what kind
            • 42:30 - 43:00 of uh like data type um is being stored within the file for example like a string or integer or something like this but um other people have thought have thought more about data in terms of for example uh legal legality um also here we see some um like an attribute historic data so uh where just the data come from maybe uh
            • 43:00 - 43:30 at what point has it been generated uh within the company's data history um and also here we can see that um here people thought about whether or not data is complete or incomplete if it's structured or unstructured um and also uh I think I saw this somewhere um I'm not sure what locked means but maybe accessibility plays a role
            • 43:30 - 44:00 here um by the way we do have the results from the it was instant this time actually ah okay I see nice yeah nice I think it it's it shows quite similar uh categories than last time all right okay so the AI in its in its wisdom has said that we have this category data management so um here it put some uh some notes down saying
            • 44:00 - 44:30 creation date change date uh yeah um time St stamp time based data Text mical data time um I I would concur I think data management or like how to manage your data plays a role within a an attribute of data then file formats I think we've already talked about that media types and fire formats I would just Bunch this together right now and what kind of format has the has data been stored then data quality here accuracy
            • 44:30 - 45:00 plays a role variance completeness incompleteness consistency uh relevance I think relevance is a very good point um way too small for for its significance yeah I would say um how relevant is the data for for a specific use case um this we will talk about this a little bit later as well um but but also um what is the department that is uh may maybe managing or generating um data and then file types I think we've
            • 45:00 - 45:30 talked about this already um then characteristics I think this is this also makes sense maybe I would call it a little bit differently um maybe this also belongs in this more like legal domain uh if if it's personal data that we are dealing with or not um then um yeah the the size data size of course plays a role um is yeah um in in what kind of uh size is my my data being
            • 45:30 - 46:00 stored is it something that that I can store on my hard disk or is this something that I actually need to store in the cloud um yeah metad data I think timestamps auth we have already seen this before authorship um also very important uh and then um yeah time stamps maybe in a video format um also adds metadata today for example and yeah in the end it added some uncategorized
            • 46:00 - 46:30 category uh origin income incoming Channel um I think this is um especially relevant for if you are streaming data from a device for example or from multiple sensors for example um stored on many separate systems maybe this goes into the direction of uh um backups uh in your in your data um product category um maybe this goes more into the direction of sales for example and then
            • 46:30 - 47:00 who is it for I think um this plays a major role when trying to scope what the use case of our data should be um yeah cool thanks a lot uh I think we have covered many aspects and maybe we have not covered some aspects that we want to talk about today but um yeah you can see that data is uh or we can view data from many different perspectives and um of course incorporate these different
            • 47:00 - 47:30 perspectives into our data strategy as well and we uh in the remainder of today's session we want to take a deeper look into how we can select different aspects of data to incorporate into our consideration of a data strategy in order to make data inherently more usable within our organization and with this I um switch back to the presentation um I think we've already talked about some aspects
            • 47:30 - 48:00 of this but maybe um just to introduce a couple of terms here um so um and in the morning workshop this term already came up a couple of times so um when we uh think about data uh one um attribute we can ascribe to it is uh whether or not um data is of the category of Master data um this data usually descri or this category usually describes um if this is
            • 48:00 - 48:30 data that changes over or that stays constant over a long period of time this could be for example um data from our customers uh which kind of customers we Supply with our product um of course this goes the same way around um suppliers uh this um uh yeah um can describe where we Source our um raw materials from or
            • 48:30 - 49:00 prefabricated components that we want to use later maybe ingredients um chemical um uh basis products for example um but also um for example this could could also describe um our organization structure um which um stays constant over a period of time and so basically everything that is um that can give you an anchor for for um if you are looking into a particular set of data points and
            • 49:00 - 49:30 in these data points there are some information um and additionally there's also some information about for example um the supplier where um some components come from for example and the data is related to these components then you can um automatically infer and what um time period this data has been um created then um additionally we can think about um like if we stay on this
            • 49:30 - 50:00 more like temporal dimension of data we can think about um transactional data um and this then to the in the contrary um describes um data that is changing constantly um an example for this could be the numbers of hours worked or maybe the um the amount of uh components that we sold to uh to a customer um maybe our gross income uh changes uh over a period
            • 50:00 - 50:30 of time and so um basically everything that is that that changes uh from day to day basically um then last but not least uh one term uh that we can um use to describe data um on this Dimension is a data set so this is basically just a collection of data points um that we um have sourced
            • 50:30 - 51:00 from yeah a source where the data was stored before Um this can be um either think of it like as a complete table from a database for example um or um we can um already um have a data set that is not necessarily comprised of like raw data but more um the relationships between variables uh within um an original data set for example um or data table for
            • 51:00 - 51:30 example so um another um dimensionality to or like two other dimensions to think about when we talk about data structured data and unstructured data um so um structured data describes um everything that can be basically put into a table format um so yeah sales data for example um everything that is being stored in a database uh can be or usually is of a structured data format and then
            • 51:30 - 52:00 unstructured data in on to the contrary this is um data that is uh yeah basically doesn't have like an inherent structure to it uh so I think like images is a very good example for this where the information that is being um captured with an image um comes from the structure of the of the object that is being um put or that is being displayed in the
            • 52:00 - 52:30 image um but the image itself like the the image file itself doesn't have like a like an inherent structure and um if we want to use with if you want to work with unstructured data usually not not every time but usually AI um is a very good tool to work with unstructured data um then uh on the other hand we can uh think about how um data is being um
            • 52:30 - 53:00 basically like transferred into our usable domain uh and what this means is um that uh maybe um data is being transfered transferred in so-called batches uh if you have a machine learning or AI background you you know the term but in this case it doesn't actually mean like microetching but this means just that um your data or like that you you you Ingress data into into your system in regular intervals and this could be that you for example like
            • 53:00 - 53:30 make a backup of all the emails that have been sent over the last 3 weeks or something like this um but it could also mean that you um like store um data on a device uh in your production facility for example and then you download this data onto a master server or like a yeah into your cloud system to store it for later and then um on the other hand uh of this like batched format you can also stream data and this would be for
            • 53:30 - 54:00 example um if you have like a camera camera surveillance surveillance system usually this streams data and also stores it uh in some capacity um or especially if you're working with uh like sensor data from from production machines for example um they are you you can stream uh the signal from um your production machines and then make some inferences on this stream data um maybe later later down in your data processing
            • 54:00 - 54:30 pipeline so these are just like some VAR or some Dimensions to think about um if you're thinking about data but I think on the mobot we've already collected some other aspects for example the legality aspects or the accessibility aspects and um actually um we can already use this to extend our thinking to Big Data um if we are thinking about D big
            • 54:30 - 55:00 data we um can think about the um five vs so these are volume velocity variety veracity and value um so um if we thinking about Big Data I think intuitively we already think about volume because it just means like yeah big data lots of data um so in uh or in this dimension basically we we think about um how much data is being stored somewhere uh and um not
            • 55:00 - 55:30 necessarily in terms of file size but also in um how many files we um save for example this could be uh yeah if we have like lots of images um like if we saw a million images somewhere um we can consider this to be like a big data set basically um but also um if it then comes to to like a mixed data environments that we uh like have table
            • 55:30 - 56:00 lots of tables stored somewhere and then lots of like images somewhere there um if the volume increases um it brings with it um some challenges but also some opportunities to integrate data into a use case later on so velocity then on the other hand means um how quickly basically the data is arriving at your end points and um this can be uh like in a batched format in a streaming format but it also can of
            • 56:00 - 56:30 course be in a static format where something has been put into a table and then this table exists for 20 years 25 years or something like this and then um you can like directly access it but um it stays more or less constant then um I think this also corresponds to the variety here um so here we think about uh whether or not our data is structured if it's started this table format or if it's um if the if the information is not inherently um
            • 56:30 - 57:00 organized by um table rows and columns for example um also you can of course have like semi-structured data which is like partial like um you have some data model uh like for your backend server that um stores images uh in a database or the references to images in a database and there you um yeah have to handle like the meter information that is St in the B database but then you have this link to the image that is then
            • 57:00 - 57:30 like a unstructured an unstructured data format then veracity and this was also already on the mirror board I think um is all about the data quality uh and the then the accuracy of the data and also the trustworthiness um so um here in this aspect uh we think about um uh I mean yeah um data quality um of or like um we
            • 57:30 - 58:00 think about data from this perspective of uh how um has it been created like um has or are the tags that have been put on on the data The Meta data Maybe if is this coming from a source that is trustworthy within my organization uh or is this something that has been created on the Fly maybe created semi-automatically um where we don't know what kind of quality the data has or if it can be trusted um this is also
            • 58:00 - 58:30 a a dimension of data that we can think about and then in the end um the value and here uh it very much depends on the use case that that you have um where um the the this aspect or this attribute of data is more dependent on your on the business case for example um if you have stored data about your like the reliability of your production machines um or that like I don't know like
            • 58:30 - 59:00 sensors that detect vibrations in your production machines then this data is very valuable for a predictive maintenance use case for example but um if you want to build um a retrial augmented generation system like a chat bot system that is supposed to interact with customers then this data of course is not super valuable for you because you don't want to uh yeah give this to your customer customers or um yeah large language watchs probably won't
            • 59:00 - 59:30 understand uh the raw data from from sensor information so um when thinking about big data or um thinking about uh yeah these these different data dimensionalities we can think about them in terms of these five vs and um they of course have a relationship to one another so the volume determines uh or is determined and determines the the format or the the velocity with which data arrives which determines the
            • 59:30 - 60:00 variety which with which Returns the veracity which ultimately um determines the value of course okay um last but not least I mean we have heard about the five vs but I also want to introduce you to this concept of fair um this has been suggested by the EUR European Commission in 2018 uh what does it stand for so it stands for findable um accessible um interoperable
            • 60:00 - 60:30 and reusable so what does this mean so findable I think very intuitive so is the data actually like findable in my data system but also can I access um can I um yeah um identify it and then accessibility um Can my data be accessed um is it possible to um yeah retrieve data from a system but also um do I have the necessary
            • 60:30 - 61:00 authentification procedures in place in order to uh see um or to detect who um is allowed to access my data and who is not allowed to access my data um and um do I have the standardized communication protocols in place in order to make um this um data available um here uh when you think about it from a from a software engineering perspective um this would be um for example like um
            • 61:00 - 61:30 organization uh or organization internal guidelines on how to build apis for example where um if you have data St on a server how can um How Can employees that are not on premise maybe they are in the fields trying to repair a machine or something like this how they can access um this data via um yeah interfaces then the
            • 61:30 - 62:00 interoperability interoperability um part is about um how you can um integrate data with one another uh this um applies more to the to the process so um analysis workflows storage procedures procedures and um Pro processing steps that can be used um and um this interoperability d mention basically means that um if you have these things in place you can um um
            • 62:00 - 62:30 Source data from very different um sources and then integrate them into um a use case where you can make use of very heterogeneous information in order to derive some value from it and then last but not least uh the US reusability aspect of it um this also goes into the direction of uh creating meaningful metad data um where metadata
            • 62:30 - 63:00 describes your data um um yeah well or adequately and um with this you can um replicate um experiments that you want to do with your data more easily or you can find data more easily for example like if you do a metad data search for example um and you can also enrich the information that is being stored within a single data point so um this makes a lot of sense to uh think about data from this reusability perspective um so that
            • 63:00 - 63:30 or like a a procedure where you don't think about this so much is would just be that you collect data um at some source and you directly want to use it for a for a use case you experiment around with it a little bit and then just throw it in the cloud um without any additional information and then a couple of years later maybe uh you have your poor training look through uh like historic data points and then they will not be able to
            • 63:30 - 64:00 um understand what this data is about so yeah reusability makes a lot of sense so these are different um like Frameworks or perspective perspectives that we can use in order to look at our data and to um get a rough estimation of uh whether or not the data that we have stored somewhere can hold any any value for us and um if you think about it like if you if you uh maybe go into your own
            • 64:00 - 64:30 um data management system or like yeah man Management Systems maybe maybe even look into your own databases as a more technical person maybe um you can use these Frameworks to evaluate whether or not um the data that you have at a certain location may already hold some value for you so this these are like the very like broad guidelines um but of course we want to go a little bit more um deeper
            • 64:30 - 65:00 and for this we would now switch over to um yeah basically um an excurse on what a data strategy actually is um um but before we do this I would like to give us the opportunity to ask some questions at this point uh yeah um so if you have any questions just shoot and I can drink a cup a sip of
            • 65:00 - 65:30 tea domic did you have a question okay all right then I would just move on with a short EXC course to what a data strategy actually is so um I think this this already came up in
            • 65:30 - 66:00 the introductory rounds um this this phrase without data there is no AI uh and this is of course very true um so what do you use the data strategy for so in the short term um you can use a data strategy for um planning an AI project um or start planning in a i Pro project essentially because um like even if you think about like super cool AI use cases
            • 66:00 - 66:30 um if you don't have the data to support them they will never um be put into production operation they will never uh like actually help your business to achieve its goals um this also comes from the this Paradigm here trash in trash out if you're if the data quality for example or like uh if you have data but you but the data quality that you have is um not quite quite good then of course your AI system will not learn um
            • 66:30 - 67:00 how to make use of the data and then it cannot provide the value to you that you intended it to do uh and um basically like our conclusion of this is that um data uh like in the in the in the in the whole process data is uh one of the biggest leveraging points to improve your AI um algorithm uh and for that also the
            • 67:00 - 67:30 performance of your use case so in the shortterm um data strategy can provide you with this that uh you can get an overview of the data that you already have the data than than the quality of the data and um you uh can estimate how much um the the um quality of your data may impact your AI project but in the long long term uh if you think about data establishing a data strategy in the long term you can um think about it from
            • 67:30 - 68:00 a more business perspective and this may allow you to automate some decision making um processes um but of course here the requirement is that you have to um have a high degree of trust into your data so that these decision- making processes um promise to be of a certain degree of accuracy um or certain degree of uh insurance that they don't do something
            • 68:00 - 68:30 that you you don't intend them to do um also uh this is a different dimension of a data strategy so so this is more about like the business value and this is more about the organizational aspect of a um of a data strategy this is the question of how you can move data through your um systems and organizations um in order to get it to the points where it's need it um often times when we think about uh
            • 68:30 - 69:00 yeah AI applications we uh have this picture in in mind where um we just have like one uh Department that develops AI algorithms for us and this department then produces maybe like products for our customers but also maybe um produces products to be used internally um this could be for example like a yeah dedicated uh um yeah um large language model for coding assistant for example if you have like in in your organization
            • 69:00 - 69:30 if you have like a particular way of uh of writing code then this large language model can maybe augment um programming Personnel to uh write good and reusable code um but um this might not necessarily be the case all the time because sometimes you may also have other departments that are also very reliant on uh some data um typic uh one such Department could be like a controlling department for example where
            • 69:30 - 70:00 they want to uh look at different data sources in order to assess the health of the company and see um where processes may take up uh a long time or additional resources that have not been accounted for um and like all of this of course goes more into the direction of like business intelligence or bi um but um also in in the context of artificial intellig um this needs to be thought about because um sometimes you may want to
            • 70:00 - 70:30 roll out anyi solution in one particular Department uh and give the employees there the control over the AI system and of course then the data also has to reach them in this department so we can already see that like this data strategy has multiple different um aspects to it but um for us I think in the KC plus program the short-term aspect here um is the more tangible one where with a very basic
            • 70:30 - 71:00 data strategy we can already derive lots of value from it because we can get TOS evaluating what use cases um make the most sense to implement within a shorter time frame and um where we can yeah basically start to evaluate what kind of kind of steps need to be taken in order to make um data more usable um that in the first iteration cannot be used for the first use case that we want to implement within the kit plus program
            • 71:00 - 71:30 but maybe the Second Use case then can use this data all right um and uh just to drive this home this point home again um I think uh the um people in the audience who are who who who are AI Engineers already know this or experiences with this um data related issues in AI projects are actually very
            • 71:30 - 72:00 common uh and this comes from many different directions so I mean like you um sometimes also within KT plus we have this phenomenum that we talk to our companies and they say yeah yeah we have like super nice super high quality data it's super no concern and then when we go into the de development phase then they realize oh no we don't even know where this data is stored or if we know where the data is stored the quality of the data is actually not what we um
            • 72:00 - 72:30 anticipated it to be and um this is also reflected here in these points uh that we just collected just like as a short overview of U where these problems with data can come from within AI projects so um I already mentioned the lack of data quality but also understanding the context of data um this this is especially problematic if you if you're working within a super specific domain where maybe the AI Engineers are not
            • 72:30 - 73:00 familiar with I could imagine um that for Bel B for example that um you um reside in this like um chemical industry domain where you have to have a lot of knowledge about um like chemical processes um and how things work work on the chemical or physical level and as an AI engineer of course maybe you have some like um Natural Sciences background but you don't have like a deep understanding of
            • 73:00 - 73:30 chemical processes and here um if you don't have data that um um is categorized with the domain Knowledge from the get-go then AI Engineers are um a little bit lost and they can can't can't necessarily work with it um additionally um uh of course um you have this phenomenon that data influences the quality of your AI software and here we have um these points of data drift and concept drift um data drift here um this
            • 73:30 - 74:00 is describes this phenomenon that um maybe you've like trained an AI algorithm on some data set um and then you put it into production and the AI algorithm is performing quite well for the task that you wanted it to do but over time the performance of the AI algorithm degrades and um this might come from the from a from an slow but steady change um in the data that is being fed
            • 74:00 - 74:30 into the AI algorithm because some processes might change over a long period of time and so the AI algorithm because it was just trained on a specific um time frame uh didn't learn to adjust for this and so you need to retrain your AI algorithm on uh more recent data in order to make it as performant as it was before um yes and the same also re also um
            • 74:30 - 75:00 applies to concept drift here this means that um because ai ai algorithms are inherently or fundamentally um relying on statistical analysis um if you uh if the ai ai algorithm has learned a particular relationship between two variables within your data set maybe um the relation relationship between the variables also changes over time um where for for example like two variables
            • 75:00 - 75:30 were very related to one another the value of one variable was um extremely determined by the value of the other variable before maybe over time this relationship changes and then your AI algorithm also um cannot adequately respond to these changes and then you need to retrain it again so the the more um your data um represents the um underlying real world the higher of a quality it is and
            • 75:30 - 76:00 the the higher performance you can expect of your AI software um yeah last but not least um I think the last two points uh also uh yeah they also important data versioning and this this goes more into this domain of reproducibility where um you can uh yeah um reproduce certain experiments that you did uh experiments here in this context of course means uh building an AI solution for a particular problem
            • 76:00 - 76:30 where you um yeah um do an experiment of a particular AI model then you feed your data into it and you see how it performs and um with the data versioning you can see how your data evolved over time so let's say for example you start with a particular data set and you um train youri um algorithms on that and then later on you realize okay we actually have another source where we can Source
            • 76:30 - 77:00 um like more data points uh from the same domain and then you add this to um the already existing data set and then you retrain your AI model on it and then you can make comparisons between the AI models trained on uh a previous version of the data set so to speak um and then last but not least uh legal risks of course um also exist uh when you try to process certain data um usually we talk about this in in the
            • 77:00 - 77:30 context of the DSG for so personal data but um of course also the EU AI act um plays a role and actually we will talk about this in another Workshop um coming up later this year and um all these problems of course uh we kind of want to mitigate and for this we need the data strategy now so um yeah having talked about a lot about why we need a data strategy let's now look at what a data strategy actually is um so a data
            • 77:30 - 78:00 strategy is more like a dynamic process then it is like something that you set up once and then you never touch it again so um in this Dynamic process you um think about um all the cornerstones of um um aspects that uh influence how data is being collected stored and processed within your organization um and additionally you
            • 78:00 - 78:30 think about um here these government uh government governing aspects of data um additionally how you can make um the the potentials of data transparent and usable for your organization and um also later on if you use data to um like evaluate the state of of your company for example um then also how you want to steer your company in a certain
            • 78:30 - 79:00 direction um I think like if we if we think about this from a sales perspective in this particular domain uh this makes a lot of sense if we can see over a long period of time that um some customers tend to buy less and less from us then maybe um we can use this as an indicator to then motivate a strategic shift of our organization into new markets or to retreat from markets that may not hold such a large business um
            • 79:00 - 79:30 potential as um they had before right so um it's a it's a very like holist holistic perspective on uh data so um not only from this technical perspective of like how we want to manipulate it or how we want to use it within an AI algorithm but also um on this very strategic level level um of um how we um yeah want to want to bring data to the
            • 79:30 - 80:00 Departments that that need uh specific um data points um or how we can actually use this data in order to make more strategic decisions um further down the road so um this is a very like high level concept so let's take a look at the building blocks of a data strategy um so here on the right side you can see a nice figure or illustration of it and in the core we can see that uh data
            • 80:00 - 80:30 quality uh plays a central role within a data um strategy and what does data quality mean here in this context this means that um you minimize a biases within your data um uh augment existing data with meta data um store it in an easily accessible location or adequately accessible location and um yeah um have
            • 80:30 - 81:00 um yeah basically this this this quality this Excellency um aspect of data in mind uh when talking you about uh how you want to um set up certain processes within your con company that ensure that the quality of your data um is high or can be made very high so um another aspect here now more in the in the middle ring of of a data
            • 81:00 - 81:30 strategy is the acquisition and processing so um these aspects um all revolve around how you develop effective prototypes and algorithms around your data um this is usually in the domain of the AI Engineers or maybe also data uh scientists can also be in this domain where they um try to extract as as much value from existing data as possible then um another aspect is
            • 81:30 - 82:00 context so um actually understanding in what kind of context the data was created but also in what kind of context the data stands with other data points U makes it more um reusable and reliable then um storage um plays a role um in what kind of data format are uh files stored on a very like finy gr granular level but also so um what is your generous storage um strategy for example if you have or do you have like
            • 82:00 - 82:30 a data Lake um do you have a data warehouse and this determines the format uh and the way you store um data and load data um down uh in a in a use case pipeline for example um so these these aspects are covered with a with a storage uh Dimension here um this also influences how um easy um data can be accessed uh what kind of U speed you will have uh um for example
            • 82:30 - 83:00 like if you if you start it in a cloud um it's very easily accessible um over the entire world basically whereas if you start it on a local hard drive um then it might not be very easily accessible and um you will need to have somebody who basically runs to the data server pulls the hardware hard drive out or a physical copy or something like this and then provides it to you in a different way um last but not least here in this
            • 83:00 - 83:30 uh like middle circle is provisioning um this is all about optimizing the um process of how you can access data um and um the implementation of safeguards and this this also goes in this this direction where you um have like a very like efficient way of how you can authenticate people who want to have access to your data and um if you have this very efficient way um covered with the provisioning aspect of data strategy
            • 83:30 - 84:00 then um people can have easy access to the storage and can um rapidly prototype uh new Solutions utilizing the data here in the acquisition and processing step and um in the very end we have the management and security aspect and this basically covers the entire data strategy um this um revolves around around data security intellectual property rights for example play a play
            • 84:00 - 84:30 a role here that um if you uh or like um where you establish guidelines on on how employees can work with data that is of a very sensitive format for example um but also how you want to build up your organization um in terms of or like how you want to use your data strategy within your organization is it supposed to play like a contributing role is it supposed to play a central role in uh the the um in strategic decisions of
            • 84:30 - 85:00 your company um this is this these aspects are um covered with the management and security aspects here yes and and all of this together um builds the data strategy where um at the core lies um ensuring that data quality is present or is enhanced and um with um all the four building blocks in the midd um it's ensured that a value can be derived from uh from data and um that
            • 85:00 - 85:30 costs are minimized or um resources are spent there where they make the most sense and um these this over overw walking shell basically um around all these um different aspects ensures that um your data is being kept secure that you adhere to certain regulatory standards and um that you you use your data within your organization um in the way that your organization can actually handle
            • 85:30 - 86:00 it all right um then maybe a question to you as an audience um what is your status quo with respect to having already established a data strategy so um is this something you have already actively thought about um is this something that you already implementing um or um what is the status there maybe uh someone from the audience
            • 86:00 - 86:30 can um update us uh on their Ambitions um yeah feel free otherwise I will just pick someone
            • 86:30 - 87:00 okay then I'll start for Max B yeah awesome awesome cool yes pleas um currently we don't have uh um data strategy in its whole but we are well in development I'd say and have many yeah many points that you also um
            • 87:00 - 87:30 explained today that we um want to find out and want to get behind that MH cool and maybe a question to you um how did you start with this um how did we start um we started um from um application side we had some um
            • 87:30 - 88:00 input what from some ideas that want to be implemented and we um got to the point to acknowledge that there isn't AN fitting data infrastructure at this time and so we started this the whole process to get a data strategy for the whole
            • 88:00 - 88:30 company yeah super good super good um and this would be actually also the the suggestion that we have uh to start with an application case um because um otherwise it might not be very clear where the value of a of a a data strategy jues um and um I think this also applies I
            • 88:30 - 89:00 always say this I think this also applies to um AI use cases basically right where we where we want to start with a lighthouse project basically where we can show to the company that uh you you as an organization can do AI um you can like derive value from using AI at some like at some points and then um and I I think this was was also like a point that was in this big discussion in the morning session today that
            • 89:00 - 89:30 afterwards you start this like long process of like breaking open silos for example in your organization uh and um yeah um establishing like a for example like a data strategy task force or something like this uh where um the value has already been uh proven basically that in a small scale project um value can be DED for your company and then that you try to itera iteratively build up on this success um with more
            • 89:30 - 90:00 and more projects or with a like a um a finer and finer um data strategy basically I think like this this this uh illustration here at the bottom makes this point quite clear again that um if you like if you start out uh or like if you approach data from a like a maybe non- strategic way you um think about
            • 90:00 - 90:30 what you what you can do to the data so here it's like cleaning validating controlling and protecting um but this is like very like small scale and um doesn't like really unlock the the entire potential for um what can data what what data can do within your organization whereas if you um a more modern data strategy or like if you think proactively about data you can you think about data from this
            • 90:30 - 91:00 business perspective that you think about like what do you want to do and then you can say okay we want to attract new customers this is like a we we might be like in a somewhat volatile Market we don't have like customers over um super long time and therefore we we need to constantly attract new customers anyway so uh we just um we will just start from there so what can data do in order to um yeah um help us achieve the goal to
            • 91:00 - 91:30 um attract new customers and then you can go through through all of these steps like okay what kind of data do we have do we need to pre-process it what is the quality of the data and so on and so forth and then maybe you arrive at some kind of AI use case later on and then um you implement the AI use case and then you can see ah yeah cool uh with all the steps that we have taken we have made the raw data that we collect anyway usable for um achieving all
            • 91:30 - 92:00 business codes yeah cool it's always nice to hear that uh everything that we talk about in theory also uh like has some is is grounded in practice or in practice super nice all right maybe I think this is the last aspect of what a data strategy actually is um and I think this is just the thing that we've talked about um
            • 92:00 - 92:30 again already um only one step having having taken one step back so we don't think about a DAT strategy like from a like a business case which then validates the data strategy but we can also integrate this into our AI use cases so um I hope that if you um select your use case later on in a in the use case adiation Workshop that you um also already think about what kind of data
            • 92:30 - 93:00 strategy um can you experiment around with based on the AI use case that you want to do so um yeah like in the in the first iteration of the K plus program many companies wanted to build Ral augmented generation use cases so basically that the internal chatbot for for an organization and there um our experience is that yeah many companies struggle with u collecting all the data resources basically for um feeding the retrial
            • 93:00 - 93:30 augmented generation model in order to um yeah make it performance on some kind of like knowledge domain right and so um maybe this is something that you've that you now can already think about and maybe you you want to um like build a retrial augmented generation solution anyway because it's your ambition but um maybe can think about how um a data strategy or like an initial data strategy the first start of your data strategy can be um yeah built up um
            • 93:30 - 94:00 based on this use case um yeah and yeah I think I think this makes a lot of sense to do uh in in the context of the K plus program all right okay so I've been talking for uh one and a half hours or we've been in this Workshop one and a half hours um what do we say shall we take a short break maybe like 10 minutes or
            • 94:00 - 94:30 so yeah I see it's thumbs up yeah makes sense cool then um I would say we meet each other again in 10 minutes and then um I will talk a little bit about the means of data [Music] collection cool see you in 10 minutes
            • 94:30 - 95:00 e
            • 95:00 - 95:30 e
            • 95:30 - 96:00 e
            • 96:00 - 96:30 e
            • 96:30 - 97:00 e e
            • 97:00 - 97:30 e
            • 97:30 - 98:00 e
            • 98:00 - 98:30 e
            • 98:30 - 99:00 e
            • 99:00 - 99:30 e
            • 99:30 - 100:00 for
            • 100:00 - 100:30 e
            • 100:30 - 101:00 e e
            • 101:00 - 101:30 B
            • 101:30 - 102:00 e
            • 102:00 - 102:30 e
            • 102:30 - 103:00 e e
            • 103:00 - 103:30 all
            • 103:30 - 104:00 right I hope you're
            • 104:00 - 104:30 back okay cool I hope you had a nice
            • 104:30 - 105:00 coffee
            • 105:00 - 105:30 break okay uh let's continue with the means of data collection um so first of all I think like one question that we need to tackle right away is the question of if whether not we should collect all data that we generate uh at our um company um of and of course there are
            • 105:30 - 106:00 advantages of doing this uh and also disadvantages so uh maybe we can start with the disadvantages um so um depending on how you would like to store your data um this of course incurs some some costs and um this uh is reflected of course on the one hand in the in the actual storage space that you need to pay for uh be it a cloud provider or be it uh like physical hard drives that you want to store in your server Rex um but also
            • 106:00 - 106:30 of course like on the on the maintenance level and this might be like passive costs through uh like Personnel that needs to um be tasked with tasked with this um creating a good data security um Paradigm uh creating backups um building infrastructure support interfaces uh yeah apis throughout your organization also maybe maybe even giving some internal workshops depending on the size of your organization uh
            • 106:30 - 107:00 where um yeah the other departments are being made made aware of what kind of data is being stored and uh how they can access it um then of course there are some legal risks associated with this um especially when you're storing personal data this goes into this direction of DSG for example um where you need to check if your dat is uh actually compliant with all the regulations and rules um then of course you always have risks uh if you have if you store more
            • 107:00 - 107:30 data within um the framework that you are storing data in then of course um the risk is always there that if somebody has unauthorized access they can collect more and more data from you and then maybe can build some model of the of the of the like financial situation of your company or something like this or Where You're vulnerable on on the market right now uh things like that and um of course in the end also um if we if we want to build like a high quality data standard of course if we
            • 107:30 - 108:00 collect more and more data then um it this becomes more and more difficult um or costly in this in this case um and therefore um we we might not want to collect all the data that we that we generate within our organization of course on the flip side um on the positive side uh we um and have this phenomenon that storage space becomes more and more cheap the more data that we want to store uh usually um when you yeah um get in touch
            • 108:00 - 108:30 with Cloud providers for example and you want to to buy um yeah multiple thousands of terabytes of space for example then um this is usually more uh cost effect cost efficient per um storage unit uh then if you just want to like save like I don't know like two three terab of documents or something like this um then um of course if you if
            • 108:30 - 109:00 you store your data and manage your data well then of course um maybe um like in the short term maybe there are no like um use cases derivable from it but maybe in the long run uh you can identify new use cases that you can then actually already realize because you've stored enough data over a longer time frame and um this makes um um yeah also maybe bigger projects um relevant and also last but not least um
            • 109:00 - 109:30 if you store all your data then of course you have this advantage that um you have a big Archive of data available and um at some point um maybe you have like a very specific use case or you have a very specific question uh from one of your customers or something like this um where they want to know about an order um that they placed 20 years ago or something like this and then you can easily go into um yeah your your data storage and retrieve this information
            • 109:30 - 110:00 and additionally and this is not mentioned here on the slide you of course have some uh regulatory requirements that you need to store um data for at least one year uh and then afterwards you can delete it again all right um what kind of ways do we have to um acquire data uh and to make data sources um available for us so uh we have listed three um ways of doing
            • 110:00 - 110:30 this here uh one would be the use of um free um and public data resources and then the second one would be to develop Partnerships with uh either um external organizations that specialize in selling data or with other companies or entities um that provide data for us and then last but not least of course creating data on your own so in the first um Step uh looking
            • 110:30 - 111:00 at free data sources um of course on the internet there are many many different uh yeah um data repositories available uh for example kegle uh kegle should be uh known for everybody who is in the engine ramp up um is a web page where many many different data sets are being stored and um of course you can like use these data sets in order to do some prototyping for example and you can easily download them
            • 111:00 - 111:30 um then there are also open data sets from Amazon for example also Google provides um data sets for research purposes um and um there are also of course like data sets and also code in order to like process data um available on GitHub um here we've listed some examples um the advantages of of doing this is that you have like a huge variety of data to choose from and um maybe uh like
            • 111:30 - 112:00 you will find some data that is uh yeah relevant for you for your use case in terms of uh yeah that is that like maybe you do some image classification or something like this and then maybe you can find um samples online um and you don't need to rely on your own data set only so you can augment your data set with this um and of course um relying on open uh Source data or open data um allows us
            • 112:00 - 112:30 for allows you to do rapid prototyping of machine learning applications where um you can just try out um how well a machine learning model is performing on an open data set or maybe even you don't maybe you don't even need to do your own experiment but you can just rely on information that is online uh to see how the performance of a model is for example um and um this allows you to do more rapid prototyping of course without needing to tap into your own data and do
            • 112:30 - 113:00 all the cleaning um steps um before you do any um exploration um in terms of what kind of machine learning model you want to use um on the other hand um a problem with uh open data sets is that they might have been published under a certain copyright um framework which might make them not usable in a commercial context so you really have to to watch out for this that um if you
            • 113:00 - 113:30 want to um bring something in production and you want to use open or like publicly available data for this that they have been published on under like an Apache or um what is the other one I always want to say MIT license but I think it's a different University license um so that you can actually use them uh also within your business applications um then another problem and this is like very a domain specific is
            • 113:30 - 114:00 that um most of the time you will probably not find exactly um the type of data that you um have in house especially if the use case is very specialized I could imagine for example like in a chemic in the in the chemical industry you might find um some um yeah data that is use like usable for graph new networks for example that goes into this to this um polymer um Direction but
            • 114:00 - 114:30 maybe um you are a producer of like very specific chemical products and um then um maybe um information on this is not available Al maybe because it's your actually your intellectual property um so um sometimes um there are cases where it makes more sense to just like start working with your own internal data right away instead of trying to search the internet for um data that is sufficiently close to how your data
            • 114:30 - 115:00 looks like and of course one big thing um in the end because you do not control how the data was created you um there there's always a a a certain yeah some uncertainty left on on what the quality of the data is and sometimes the data quality is not very easily um accessible sometimes um you yeah cannot cannot see if the data is of sufficiently high quality and then um using public data
            • 115:00 - 115:30 sets comes with its own risk of course all right this is everything to public data sets do you have any questions regarding them okay then we can continue with developing Partnerships with uh external organizations um this might be um
            • 115:30 - 116:00 sensible to do in some circumstance where um for example or where where where you don't have access to um the data that you want to use for a specific use case at the moment be it that you haven't collected them yet or that you haven't um pre-processed them so that they it here to some quality standard that you set um or um because you are simply not able to collect them in some kind of way
            • 116:00 - 116:30 um um in these kind of situations um it might make sense to to go into Partnerships um for example uh and we listed this here as one of the potential Partners as well um if you if you go into a partnership with one of the regional centers or or with um other universities maybe they have some laboratory equipment um or something like this that you internally do not possess and then maybe they can provide you with lots and lots of images of like I don't know like electron microscope uh
            • 116:30 - 117:00 images of some material that you that you want to produce um and U maybe you want to go into a long time long-term partnership with them so that they provide you the the data um for your specific use case the advantage here is of course that you can like externalize costs a little bit um uh especially if it comes to like buying super uh high-end expensive equipment for creating data points um and you you gain
            • 117:00 - 117:30 some time with it um for example if you if you yeah have unstructured documents internally and you need some time in order to um structure them in to to realize a certain use case um but um um yeah you you want to realize an air use case uh in the near term then you can part up and uh yeah realize this AI use case with a different data source um so um yeah the the the advantages um
            • 117:30 - 118:00 mean that you can yeah get data quite easily sometimes and reliably but um the the difficulties of and because you can you can communicate what kind of like data quality you want you want to have from from them um and the but the disadvantages of this is that um setting up the bar boundaries within the negotiation with your external organization might not be super clear especially when it comes to um processing for example like internal
            • 118:00 - 118:30 documents um where yeah um data protection plays a role intellectual property property might play a role and um here you need to really like negotiates the negotiate the ins and outs of the services that the external um organization provides to you and um another um Contra point to this would be that um this creates this
            • 118:30 - 119:00 tendency that we that we stay locked in into the into the model of the um data providing organization meaning that um we built Ani use case on top of this on top of the data that an external um provider provides for us and then we put this use case into production it helps us out it it it um generates some business value for us but um because we are relying on the data that is being
            • 119:00 - 119:30 provided from the external provider we um are not able to um develop this use case further or um we will always have like a lost Revenue stream or like in in a financial inefficiency within um our organization all right last but not least uh we can of course create our own data um for AI applications um this is very much dependent on the on the AI
            • 119:30 - 120:00 ambition that you have or thei Vision we call it here uh but um basically what you what you've already created in your ambition workshops um these These are or this process is very specific to the use case that you want to realize um and um it hinges on um what kind of data is needed to realize use case in the end basically the advantages of this are of course that you have control over the entire pipeline meaning that you can
            • 120:00 - 120:30 control the the quality aspects of it and you know at every step where data came from what has been done to the data and um and what kind of AI application it has been fed into um increasing the the reliability of your AI application and also the flex flexibility of the of the creation process of the building process of your AI application but of course um this is naturally very cost intensive uh high
            • 120:30 - 121:00 maintenance um is also associated with this and um you actually yeah you you may need to establish the infrastructure internally um so uh yeah um on the one hand of uh like the infrastructure of actually collecting the data for example like building or like getting or buying the sensor equipment for example if you do like sensor um data collection um or um you need to build um like maybe some
            • 121:00 - 121:30 uh code additional code that collects documents from specific places and puts them into a joint repository or something like this so these are three ways of data collection um that we that we see um and now coming to data labeling this is like um this is a Paradigm that applies to all the different uh aspects of data collection um so what does data labeling mean this basically means that
            • 121:30 - 122:00 um aside of meta data you um assign um a ground truth value to your data um and um thereby making this data more usable um if you want to have an example for this um if you have images of corre L produced uh things in your production line uh products in your production line and um products with a fault then um you
            • 122:00 - 122:30 would um give these images to an expert and they would label the the images uh with a faulty a faulty part faulty or um incorrectly produced path in a correctly produced part and this this process is called labeling because um you are um G um information context information that is specific for the use case to um a sample of you they said all like data
            • 122:30 - 123:00 point and of course this this extends from from computer vision where it's all about like images um over to natural language processing for example then you mark a certain passage in your text and give like additional information to it for example like this is a customer um customer email for example and this is like an an internal email for example and but also um you can think about this uh in terms of audio processing if you
            • 123:00 - 123:30 have a have recordings of some um production facilities and you label them as um this this is how the production facility sounds like at 100% utilization and this is how it sounds like at like 85% utilization you um then can use the AUD audio files in order to determine uh if everything runs smooth smoothly in your factory for example and here um you would label the audio files
            • 123:30 - 124:00 accordingly um labeling can be um quite labor intensive um especially if it's if it's about labing labeling a large amount of data um so the question arises how can you get the right labeled um of course the most straightforward thing is to do is just to do it yourself um this requires to yeah basically tell your employees to go through a large data set and then assign the correct
            • 124:00 - 124:30 labels to it um uh of course like here the question always is um are the labels already somewhere stored within your company and you can you leverage them quite easily um in the in this example of a like faulty and non-faulty parts coming from the assembly line maybe you already have some form of qual yeah you probably will already have some form of quality control and maybe an additional step
            • 124:30 - 125:00 here would just be to um look into the database of the quality control team and extract um the yeah the dats they have given to the faulty Parts already and then you don't need to revisit your data um from the past with with an expert but you can already collect it at the point where it's being generated um anyway um yes was there a
            • 125:00 - 125:30 question no okay right um another way of labing data would be to contract yeah contractors um for example like Freelancers um that do the work within your company uh maybe you can also like U Outsource this to a startup company for example that uh who want to do this for you um or you can use an automated tool for labeling um I mean like uh I mean here always the
            • 125:30 - 126:00 question is like if if there already exists a tool that can label something for you automatically um then either uh the question is does it do it in a sufficient way to have like high quality results or is the error rate quite low or if there's already a tool that knows how to label data is there already a solution that exists on the market that can do the task for me um like if it's a
            • 126:00 - 126:30 classification task um then it can all then there might already exist a tool on the market that can do the classification task right away because uh like yeah buying a tool that labels the data automatically for you then throwing all your data in it just to train a machine learning model of um on a classification task that the automatic tool does anyway um doesn't make any sense this is like a a loop uh snake
            • 126:30 - 127:00 eating its own own tail um and last but not least there's always of course the crow sourcing option um you can like in a business context you can do this with U for example Amazon Mechanical tur or upwork where you um give the data set to a um to another company um then give a description to it and then you this company will distribute it over um a vast amount of people and these people
            • 127:00 - 127:30 are all tasked with uh labeling um the data for you and then they do it manually but you don't have to like um give the data to your employees but somebody else externally does it for you the problem with crowdsourcing um like if you are considering this is that um everybody who participates in the CR CR sourcing labeling action of course has different backgrounds like has
            • 127:30 - 128:00 different levels of knowledge um about the topic uh and may have different opinions um on a certain on a certain topic for example here um there's just an example sentence and um the the people are being the crowdsourcing people are being asked to rate if the if the example is toxic or non-toxic and then then maybe they have um very different opinions for example
            • 128:00 - 128:30 this person here says this is super toxic like the the the comment they were evaluating is um people still eat at Pizza Huts gross and they say ah because I love Pizza Huts so much I say this is toxic whereas somebody else says this is non-toxic it's just an opinion and then maybe you get like a variant within what kind of like answers people gave and um this diminishes of course your data quality so just to keep this in mind
            • 128:30 - 129:00 that this is like a potential Pitfall to fall into when uh cross souring your uh labeling um and also to keep keep in mind going back to the slide before um if you do it yourself maybe there are already some kind of processes um in effect in your company that um do the labeling for you all right um so yeah last but um yeah um also some words on best
            • 129:00 - 129:30 practices on labeling so uh if you want to um if you want to task people for labeling then um of course you need to provide clear instructions for the labeling process uh so um for example like what kind of things they should pay particular attention to um also you should try to minimize the choices and the amount of labels like if you have 1,000 labels and they are super
            • 129:30 - 130:00 finely granular then U maybe people will not have the overview of all the labels that um potentially exist and then you will get um you a worse outcome uh also you should probably explain the labels especially if you talk to um people that are not domain experts um so that they understand um what a label constitutes um The Next Step would be to consolidate the labels um in order to improve the quality um here you could
            • 130:00 - 130:30 for example like prototype your labeling um process with a small subset of your data and then um if you feel like um people are sufficiently aligned on how they evaluate um different subet from the data set then they uh can then you can um use the same strategy for your entire data set um and you could um also
            • 130:30 - 131:00 um yeah um basically like use certain techniques to consolidate um the variance between labelers so that um one like if for example somebody makes a mistake that um this doesn't weigh as much for example you could you could take the average um of um of the day that um people like a multitude of people provided for a certain data point in order to um make sure that certain mistakes don't get into your uh ground truth
            • 131:00 - 131:30 labels uh also you can audit uh the labels um you can verify the accuracy um through a random selection and presenting them to a domain expert for example and you can also try AI assisted labeling um I already said this like in the morning session but especially in the in in an age of um large language models and ever increasing capabilities of large language models there might be some use cases where uh you can do um
            • 131:30 - 132:00 certain labeling tasks automatically I think we already saw this like in the in the miror AI even though the miror AI is not so good that you can like because basically like aggregating sticky notes is basically a labeling process right so um there there might be some potential there okay um finally um when it comes to how we
            • 132:00 - 132:30 can get to our data here just a short overview of the steps that um need to be done uh in order to pre-process data to make it actually usable um I will not go through all the steps here but just um to to give you an overview that um coming from raw data and going to um a potential use case may involve additional pre-processing steps and um if you think about it from a um data
            • 132:30 - 133:00 versioning um perspective you should also um make clear what kind of steps you did in order to arrive at the data set that you then use for your AI application later because um on the one hand um you need to do this in order to adhere to the um AI act to um data governance aspects and on the other hand um some steps that can be done within the um within manipulating the
            • 133:00 - 133:30 data set in order to get to a final data set that is then used to train an AI algorithm May influence what the AI algorithm can learn from the data set that you use um especially if you do data cleaning and outlier detection or handling missing data um there's a big difference between throwing out all the data points that have like missing values in in some column or using for example like a data imputation method where you um where you
            • 133:30 - 134:00 um replace um missing values based on the statistical properties of your of your entire data set and based on this the performance of your AI algorithm might change and um therefore um it's important to lock this and um to make transparent what kind of pre-processing steps you did um yeah if you're if you're an ani engineer uh you are familiar with these terms um for
            • 134:00 - 134:30 everybody who's not an AI engineer um all these different pre-processing steps basically mean that you manipulate your data set in some kind of way to um ensure that it um that it lies within certain criteria that that are important for AI algorithms for example um here step number number five um usually AI algorithms um rely on on the data set to be normally distributed and um therefore you need to perform
            • 134:30 - 135:00 this normal normalization step um if your data set is not normally distributed but um has has a different scaling all right um this was everything in terms of data collection do you have any questions or comments or want to discuss something when it comes to data collection
            • 135:00 - 135:30 processes all right cool then um I would just say one thing to this topic of how much data is enough data and then I will um hand over to Anish um so um there's a reason why we only have one slide so basically you never know
            • 135:30 - 136:00 how much data is um enough data um you can you can employ some hortic and we've listed them here but um before go like you only discover how much data is enough data as soon as you start experimenting with it and experimenting with different AI models um and try to um build something that can actually fulfill your use case um before that you
            • 136:00 - 136:30 only have like a rough estimate I mean like you can already say like if you only have like 10 data points and you want to train an artificial new network on that then it probably will not work um but um of course like if you have like 100,000 data points or a million data points then maybe um the question becomes a little bit different um but in general um the characteristics that we listed here is um on the one hand that the more high qual quality data points
            • 136:30 - 137:00 you have the better the performance of your model will be so um if you have some way to generate high quality data very easily then you should generate as much as you can um on the same page basically um also related to data quality the more complex the problem is the more um yeah yeah the the the the more difficult the the um Pro problem is for an AI algorithm the more data you need
            • 137:00 - 137:30 um a complex problem in this context may be that you want to integrate many different data from many different sources into um a model that can then um give you a prediction of uh like very high value um aspects of your company for for example like that an an an algorithm that is supposed to assist you in finding the best Market strategy for the next 5 years or something like this
            • 137:30 - 138:00 then and this is like a very complex problem because it has lots of moving parts and lots of variables and therefore this requires um more data naturally then um more from a statistical perspective the more variance you have in your data the um the more data you need um um this this basically means that um if you are collecting data from an environment where things change a lot
            • 138:00 - 138:30 then um in order to um discover the underlying principles that are captured by the by the data that you're collecting um a machine learning model uh needs to see a lot of this a a lot of data points um otherwise it will um yeah maybe like overperform in some situations and super underperform in other um um situations Anish maybe you can take you
            • 138:30 - 139:00 can answer the question in in chat uh about feature engineering um or maybe we can we can answer this after this one so yeah last relationship um if you have less data you need more feature engineering if you have more data you need less feature engineering so what does feature Engineering in this context mean uh feature engineering means that you um before even throwing data into your machine learning model and make it try to learn some relationships between
            • 139:00 - 139:30 data points um or yeah basically accomplish some task that you um as a as an AI engineer already look into the data and derive certain Relationships by hand and make them so to speak like uh um or basically um give them to the model so that that the model doesn't need to learn them by itself but that you can tell the model okay so this this is um a
            • 139:30 - 140:00 specific case um if we think about it from a statistical perspective um there might be uh statistical or like yeah variables within your data set um columns in your in your data set if you if you look at it from a like T table perspective and um there exists a certain relationship between two or or more variables within your data set for example like um if one variable um
            • 140:00 - 140:30 reaches a certain certain Threshold at some time points um a certain amount of time later another variable will be lower ow below a certain threshold or something like this and um these patterns uh usually um AI algorithms try to learn on them on their own in order to um maximize their performance but um if you don't have sufficient data points the AI might not be able to learn this relationship and then you as a as a data
            • 140:30 - 141:00 scientist or an AI engineer need to go and extract this relationship by hand basically so you do manual statistics in order to discover this relationship and then you extract this relationship on your own and um make this Con complex relationship between all the all of these variables um or condense this into uh one variable for the model so for
            • 141:00 - 141:30 like this this variable then would would say uh you are at this point of a of a curring trend or something like this and you give the trend to the model basically and then the AI model doesn't need to to learn this trend by by its by itself but it can rely on uh what the AI engineer gives the model as like already pre-processed information so to speak um the disadvantage of this feature engineering is of course that you um
            • 141:30 - 142:00 need to spend a lot of time at it and uh in some situations you might not be able to um see certain relationships in data remember like this is a like a like a um one aspect that makes AI algorithms especially artificial neuron networks nowadays um super super high value that they uh can detect these relationships in vast amounts of data um and um these
            • 142:00 - 142:30 relationships might not be visible for humans U because um humans are humans cannot uh put their attention at um 20,000 variables at the same time and Ne networks can do that so um if you have more data then um the um artificial neuron Network um is able to extract this information on its own but if you have less data then you need to provide this information to the algorithm in order to um yeah reach the
            • 142:30 - 143:00 performance I hope this explains data Engineering in this context okayish uh maybe R you can give a thumbs up if you feel like it or um if you you feel like um there are still information that you need yeah thanks Anish for the for the example uh thanks a lot and yeah cool thank you um and this would be it uh in terms
            • 143:00 - 143:30 of how much data is needed um it's 1625 what do you say an should we take another break or um do you want to continue right away with the rest of the presentation for today um so maybe let's take a pulse check with the audience uh how are you guys feeling so if you want let's say a 5 minute break just give me a thumbs
            • 143:30 - 144:00 up okay I see a fair few hands so okay let's take a 5 minute break um till 16:30 and then we can continue from there awesome thank you all right see you in five minutes yeah
            • 144:00 - 144:30 e
            • 144:30 - 145:00 e
            • 145:00 - 145:30 e
            • 145:30 - 146:00 e
            • 146:00 - 146:30 e
            • 146:30 - 147:00 e
            • 147:00 - 147:30 e
            • 147:30 - 148:00 e e
            • 148:00 - 148:30 all righty just uh drop me a reaction if
            • 148:30 - 149:00 you back or just raise your hand or something cool uh then I will share my screen and here is where we
            • 149:00 - 149:30 stopped okay so uh so far we talked about a lot of things we talked about how data should be perceived as an asset and how you can enable it to be an asset uh we talked about what are the components of a data strategy what are its building blocks we talked about uh some different ways you could collect data if you don't have data at the moment um we also briefly touched upon how much data could be enough data for your specific AI use case or your specific machine learning algorithm and
            • 149:30 - 150:00 uh assuming that you follow these steps uh and let's say you define a data strategy then you collect some data the next logical step is to think about the quality of your data right so that's essentially the question we're trying to answer so we have data but is it good data right what do I mean by this is it does it even matter if it's good data so there have been actually a lot of studies about this one of these uh from this book the data governance the
            • 150:00 - 150:30 definitive guide uh they talk about this study that estimates that the cost of bad data and by that they mean bad or unmanaged data quality is around 15 to 20% of revenue for most companies now granted these are companies who are very technologically uh like like they have a lot of maybe AI use cases so a lot of their revenue is based around AI use cases so that's why bad data affects them more but even then
            • 150:30 - 151:00 having something like bad data like bad data quality uh that having one factor effect like 15 to 20% of your company revenue is quite a large factor for sure so it's definitely something to consider so as the title says lacking data quality can be very costly for you and just to show an example here we talked a little bit little bit before about how much data will be enough data for your specific AI use case or for your machine learning algorithm but I
            • 151:00 - 151:30 would argue that data quality is actually even more important than that right so let's take an example to understand this so let's say we want to create an AI model let's say we like a uh the plumbing company for the city of Munich or something and we have all of these water meters installed at the pipes and and we want to have an AI model that predicts when this water meter will malfunction right so that we can uh assign people to fix it or we can estimate how many you know uh
            • 151:30 - 152:00 maintenance uh what the maintenance cost will be so we can think of all those things so that's why we want to create an AI model to predict when this malfunctions so that's the goal here to identify malfunctioning water meters so what does malfunction mean maybe they're running backwards maybe the consumption is of normal it's leaking multiple possibilities and let's say you have all of these um uh water meters maybe all across Germany so let's say you have like 1 million water meters and from
            • 152:00 - 152:30 them you have 15 million readings right or 15 million data points across a period of time so now since uh this is a Time series data so you train an RNN a recurent neural network since you have a lot of data that enables you first of all to use a neural network uh which will allow you to get decent accuracy for sure and you use a recurrent neural network because it's time series data but then you see that uh with the RNN which should really perform well by all
            • 152:30 - 153:00 means you get unexpected predictions and you don't get much of a high accuracy score in predicting uh malfunctioning veters now why is that we can speculate a couple of reasons why this might happen uh one of the reasons could be that the data itself is wrong so let's say that the water company or like our employees they own only adjust the bill if the customer complains but they don't actually note down uh the issue that was there with the water meter right so they only adjust the bill in
            • 153:00 - 153:30 retrospect um another possibility that could be is that the data is fake so in reality these readings were not actually collected and they were only collected once in few months and the rest of the data let's say the weekly or the daily data was actually interpolated from the existing data so it's uh artificially generated data which uh might not reflect the accurate ground truth representation so there could be these two reasons there could be other reasons as well for sure the reason could be that you're uh something went wrong
            • 153:30 - 154:00 during your modeling uh but this is usually not the case because the rnm is also a very um it's a performant algorithm for sure so it should do well kind of out of the box um but let's assume that the data is the issue here so if you have this problem right where you have uh inaccurate data so what you want to do here is maybe you throw away onethird of this data set so you throw away 5 million of your readings right you throw
            • 154:00 - 154:30 away 5 million data points and you're left with 10 million data points and then you train the neural network and then you see that your accuracy increases by a lot and you get an 85% accurate model which is more than good enough to make uh informed decisions right so the key thing to take away from this is is that less good quality data is usually much more important than a lot of bad quality data so in that case data quality is more important than data
            • 154:30 - 155:00 quantity now the other thing about data quality is that it is also generally pretty bad um it's not just uh it's not just like one or two cases across organizations across Industries the quality of data is generally not that high and it proves to be one of the most important barriers uh that kind of stop AI use cases from being very valuable
            • 155:00 - 155:30 and actually there was a study in the Harvard Business Review uh where it said that on average 47% of newly created data records had at least one critical error and only 3% of the data quality scores in our study can be rated acceptable using the loosest possible standard so essentially what this means is that almost half of data records that were newly created had some data quality issue right so in general like like half of the data that's created is unusable
            • 155:30 - 156:00 right or somehow needs some fixing and then only 3% of the data sets that they tested actually met some sort of data quality standard like using the loosest possible standard according to them so data quality being bad is uh very common unfortunately U and it has several impacts right like bad data if you don't have good quality data that wastes your time because maybe you uh think that you can do a use case with this right and then you take the data you start working
            • 156:00 - 156:30 on an AI model and then you see the AI model just doesn't get any better after a certain point like it's just stuck at like 50% accuracy so that's a lot of time wasted for you that increases your costs right uh it weakens your decision making because you don't have trust in your data so you don't know uh if this use case will be profitable and that uh goes up to the Strategic level you cannot be sure about exactly what direction you want to take so that weakens your decision making it angers
            • 156:30 - 157:00 customers because they will not get reliable outputs uh and of course it makes the data strategy execution uh much more difficult if you have bad data quality right so these were motivations about why you should think about data quality right and also stating the fact that data quality is bad right now we want to think about how do we actually quantify this in terms of uh like how do we measure data quality and what are we
            • 157:00 - 157:30 looking for in terms of good data quality like what's the kind of Ideal thing we're looking for so the answer is actually that it depends on specifically your company and your use case and your data uh and you have to Define what measures of data quality are important for you what we have here is a kind of a list to get started to think about these things and this is like you could apply this to kind of any data set so it's very generic uh but your specific data quality measures or Dimensions or as we
            • 157:30 - 158:00 call them that are important to you uh is very specific again to the AI strategy to the use cases to your domain all of these things right so let's look at some common data quality Dimensions that would be pretty much uh relevant for everyone right so okay one of the first one is accessibility and what that means is the extent to which the relevant data is available or easily and quickly collectible right so essentially
            • 158:00 - 158:30 how long does it take to get access to the data right so we want something that has a high degree of accessibility so it's very quick and easy to attain this data set then there is completeness which means that data points in a set are exhaustive and uncorrupted essentially what this boils down to is do you have missing values in your data or uh is your data set complete right do you have readings for every single um you know time point or are there a lot of missing values in your data ideally
            • 158:30 - 159:00 you want a high degree of completeness let's say maybe 80% at least then there's accuracy so this means uh the degree to which the data correctly captures the real life object or phenomena that they are intended to represent so in the example we just talked about with the water meters if the data indeed was interpolated and uh was not really recorded So then it had a low degree of accuracy right so then that was one of the major
            • 159:00 - 159:30 problems then there's the amount of data so that's the extent to which the volume of data is appropriate for the task at hand uh the keyword here again being appropriate you don't necessarily need a huge huge amount of data if you're doing a fairly straightforward machine learning model so yeah of course more is better but um you don't always need the highest amount of data uh then there is operating ease or ease of operation uh and this means the extent to which data is easily managed
            • 159:30 - 160:00 and manipulated so uh how easily you can uh transform this data and this actually depends on your entire data or technical architecture setup like what tools are you using for data um where are you storing your data um what are all the policies so this essentially refers to your data architecture and how easy is it to navigate that data architecture then there is the timeliness of your data so this means the extent to which the data is
            • 160:00 - 160:30 sufficiently up to date for the task so maybe you have a data set that is like 5 years old and um maybe the real life scenario has changed in the last 5 years so maybe that data set is not as relevant if you want to do an AI model right so the timeliness of your data is also important then there is the consistency of your data so this means the extent to which the data is presented in the same format um usually uh consistency is a big problem for organizations because there are many different formats of data
            • 160:30 - 161:00 and it's hard to get them unified into u a consistent way of representing things and this is usually the biggest stumbling block um but this is definitely something you want to think about like if you have data that's uh readily available in a tabular format why would you not take that over an an unstructured format which essentially tells you the same information um also then you have the security of your data so the extent to which access to data is restricted
            • 161:00 - 161:30 appropriately to maintain its security uh this is super important also in terms of like the confidentiality of your data or other regulatory aspects that might be uh at play for your data sets so um you should always follow the principle of least privilege when you're granting access to data uh to maintain its uh security so only those who need the access will have the access and only the access that they need not any more than that okay so this is a a kind of
            • 161:30 - 162:00 starting list of data quality Dimensions that you might think about let's say you have you don't have any data quality Dimensions at all at the moment right this is a very generic list that kind of applies to any data set right but how would you think about the dimensions that matter to you right so let's look at some ways of thinking about that so the dimensions actually depends on the way of your thinking so you can look at it U different ways and that's why it's
            • 162:00 - 162:30 a multifaceted question so for example you can talk about the quality of that is inherent to the data itself right that's uh an intrinsic property of the data set and that includes things like believability right how you how much uh you can rely on the value that are recorded on that data it's accuracy how well it represents the real life scenario its objectivity right is that uh if it was like manually labeled through a crowd or something so you have
            • 162:30 - 163:00 multiple opinions to make it objective and of of course the reputation of the data set so if others have used it as well so what do they think about it so these are things that are a property of the data set itself right then you can actually talk about the data set with respect to a specific task called a problem right so with respect to the task at hand and that's uh that's contextual data quality so specifically for this use case that I'm thinking about uh what is the added value that
            • 163:00 - 163:30 this data set brings what is the relevance of this data set for this AI task right what is the timeliness because you don't always need the most updated data for every problem maybe you're just working on something where the timeliness doesn't matter for you right so that's a consideration the completeness that's also dependent on your AI use case right some uh use cases might require fully complete data sets but some maybe can do with uh like 50% complete data sets or something and the adequacy of your data volume as we
            • 163:30 - 164:00 talked about before this also depends very much on the task that you're trying to achieve right and then you can also think about it in terms of uh your AR your tech tech landscape or your uh like your different systems so the there are two things the representation of your data and the access ibility of your data uh the representation basically means the format of your data or how you are is like is it structured is it unstructured how are you um storing it and representing it and this includes
            • 164:00 - 164:30 the interpretability of your data so do you have uh enough information to understand what each column means uh where it's coming from you know who who sourced the data who collected the data when were these operations performed uh what operations were performed that all adds to interpretability uh Al to ease of understanding right uh can you understand the domain specific uh things that you need uh to understand about this data set to actually make some analysis of it there's the ease of
            • 164:30 - 165:00 operation that ties directly to the in what format are you uh keeping your data and how interoperable it is there's the consistency of your data as well as conciseness of your data that relates to the formats that you store into it uh basically yeah if you store it all in tabular format that's very consistent and if concise means if you can store it in a compressed format to save space uh and then you can also think about the accessibility of your data so there's the technical accessibility so
            • 165:00 - 165:30 how easy is it technically for me to uh like me as a data scientist how easy is it for me technically to access this data do I have to go through a long process do I have to gain access to like three new tools that I don't know how to use how easy is the process for me to get access to this data and then also the security aspect right so can anyone just request access to this data set right or uh what kind of approval system is in place who is the one who is monitoring uh the access
            • 165:30 - 166:00 permissions so now these are all things that you need to think about uh with your data sets that you have and when you like think of all these four pillars then you can try to identify okay these are the data quality Dimensions that are important to me specifically for the use cases that I'm trying to do for the domain that I made for the kind of data that I have um these pillars help you identify uh those questions um and then I just want to talk to you also about the data quality
            • 166:00 - 166:30 cycle right so I talked to you about why data quality is important uh how data quality is generally bad everywhere uh what you can use to measure data quality and now let's talk about the entire cycle that you need to consider when you're thinking about data quality right so first of all all you need to identify and assess the degree to which poor data quality impedes your business objectives so this is something I talked about I think in the first slide that basically bad data quality leads to 15 to 20% of
            • 166:30 - 167:00 Revenue loss so you need to assess uh how bad data quality affects you this is a obviously a difficult question to answer um if you don't have already AI use case that you can refer to um but you can also try to measure it in terms of like opport Unity cost like if you don't have good data quality you cannot achieve certain AI use cases uh you can try to measure it in terms of people or ours cost if you have bad data quality you will lose all of this Manpower
            • 167:00 - 167:30 um yeah if you have specific like AI strategy or a use cases that you want to achieve so these can be the bad data quality can be a blocker for those things so maybe it's a it's a must have in that case so you need to identify first of all strategically how important is data quality for you sorry do we have a question okay um then the next step is to Define business related data quality
            • 167:30 - 168:00 rules perform measurements and set performance targets what this means uh is that you decide on what data quality dimensions are important for you as as a company U depending on the use cases depending on your strategy depending on the data you have you define these Dimensions these are important for me you measure your current data quality you take stock of all the data that you have and you see where you're at in terms of data quality and then you set these performance targets right so let's say uh the completeness of my data sets
            • 168:00 - 168:30 needs to be 100% right or the uh accuracy of my data set needs to be uh above a certain value so you set these performance targets so that uh you know data your data is not a barrier for you but rather it becomes a strategic Advantage for you in terms of your business objectives once you set those targets then the idea is to try to achieve those targets and for that you have these uh quality improvement processes that
            • 168:30 - 169:00 remediate um process flaws so what this means is that you can use different ways to improve your data quality right so one of the uh in the example we talked about we thrw away some of the bad data that's a fair option just throw away the bad data and keep Only the Good data but maybe you don't have that much data to begin with in which case that's not a good idea what you can try to do is algorithmically you can try to improve your data quality that's not a perfect solution it performs uh reasonably well
            • 169:00 - 169:30 it can also perform reasonably bad at times um the most accurate way to do it would be to collect more data collect the missing information or collect the information that's bad replace it somehow so maybe you have another data source which you can combine uh to improve improve your data quality or maybe you manually collect some more data maybe you just uh use a manual labeling process to improve your labels so these are good ways to get really high quality accurate data uh but also
            • 169:30 - 170:00 cost a lot yeah so depending on how you want to approach it uh the criticality of your Dimensions like if accuracy is very important to you you probably want to manually collect more data um if completeness is the most important maybe you algor algorithmically improve your data quality um if if you have a lot of data maybe you just throw away the bad data so depending on the specific instance you define these quality improvement processes and you choose
            • 170:00 - 170:30 them um and then you move these data quality approval methods and processes to production right and in order to complete the cycle you need to inspect Monitor and remediate when the quality of your data is not acceptable anymore right so you keep keep checking your data you keep measuring these data quality dimensions and you apply these Improvement processes when you see they're below the the threshold okay so that was a lot of
            • 170:30 - 171:00 information right if you're just starting out thinking about data quality maybe it can be a bit too much so here are some like helpful questions or guidelines to get you started thinking about the topic right so like these these can be the first initial Point uh of when you're looking at all the data that you have try to ask yourself these questions uh to see if they provide value right so for example how many omitted values does your data set have right and how can you correct
            • 171:00 - 171:30 that is there an easy way to correct that maybe it's you can you can fetch that data from a public Source that's available that would be an easy way um another question to ask is is your data set Ade adequate to your task right for the for the AI use case that you're trying to do is your data set a very relevant data right or is there a more um applicable data set that you can use or try to look for uh thirdly is your data set imbalanced right most real world data sets are imbalanced what what imbalanced
            • 171:30 - 172:00 means is that uh uh the distribution of your data is not uh uniform or normal it means that there's uh much more focus on uh one value rather than the other one right I mean for example uh let's say you measure the height of all the basketball players in the NBA right so your data set will have a really high average height value right um and maybe that's imbalanced towards
            • 172:00 - 172:30 tall people right and then you cannot use that for um certain use cases right so you need to think about is your data set imbalanced most real world data sets are imbalanced in certain ways and then you need to deal with that uh there are techniques there are algorithmic techniques there's also um yeah I mean you can also just try to model it and then you can see uh how it affects your overall score so there are techniques to deal with that but it's important to think about it uh and the last question would be if your data set was indeed
            • 172:30 - 173:00 annotated by humans uh then how tangible is human error right so if you're using basically manual labeling by humans um then you need to consider how tangible is human error and Yan has talked about this with the pizza example so when multiple people are labeling like the same because of their different perspectives different opinions they can U label the same thing in a multiple number of ways right so then you have to think about how much should I account for human error while labeling something
            • 173:00 - 173:30 right so these are some helpful questions you can ask yourselves to get started um by the way I just got a notification that it says 5 minutes left in your meeting I hope it doesn't close in 5 minutes right that would be bad um yeah let me let me see if I can try to sort this out you can continue I
            • 173:30 - 174:00 will continue um and actually it's a good point because uh now we just want to jump back into the moboard a little bit uh do another exercise because we've been talking for a while uh so let's jump back to the My Board here I will share the link again in the chat and if you could scroll down here to where you see me uh on the
            • 174:00 - 174:30 board right so basically the point of this exercise is that we have all of these data qual quality Dimensions that we I just explained to you um and the idea is that uh you know we have deferring opinions about which might be more relevant based on like your industry and like the use cases that you have so what we want to do is uh we have these sticky notes on the side uh that say hi so uh each of you gets three of
            • 174:30 - 175:00 these sticky notes right and I want you to assign it to the data quality Dimensions that you think are most relevant for you okay so go ahead and read all of the dimensions and then just assign it to the dimensions that you think are most important for you and each of you can get three of these so feel free to just duplicate it and um yeah so uh just make sure not to use more than three I trust you guys so I will just give you a couple of minutes to do this voting and then we can take a
            • 175:00 - 175:30 look at the results um johanes would you mind setting up a timer or maybe it's fine we can just play it by uh by looking at it okay super thank
            • 175:30 - 176:00 you also try not to be biased Why by other people just think about whatever is most relevant for you right
            • 176:00 - 176:30 e e
            • 176:30 - 177:00 okay if uh no one is uh voting anymore
            • 177:00 - 177:30 maybe we can stop the or early um yeah it's really interesting to look at actually um so everyone thinks that completeness and accuracy and
            • 177:30 - 178:00 accessibility is the most important um amount not so much uh consistency is fairly important uh but timeliness ease of operation and security are not even uh in anyone's top three so that's really interesting to look at uh and I think that tells us also something about how we're all on the same page because I think these uh columns only become relevant when you have uh a lot of AI use cases and you have like a lot of
            • 178:00 - 178:30 data sets and you have like a really large data operation that's when these things become much more relevant uh but for also us at kit plus I think it makes sense that what we're trying to do is get uh really high quality data that we can generate some value from I think that's where the approach is for everyone at the moment and I totally agree so like the uh the objective should be to uh identify high quality data that you already have and how you can make good use of it and so these are completely important completeness accuracy accessibility consistency yeah
            • 178:30 - 179:00 super um cool thank you for that uh let's jump back into the slideshow okay um so that was about data quality and now I just want to talk about how you should store this data right so assuming that you've collected your data you've measured its data quality maybe you have some processes to improve the data quality uh now the question is how should you store it
            • 179:00 - 179:30 right what do you need to think about so essentially when you think about storing your data you have two options uh the first option is on premises uh basically the storage is owned by your company or you have cloud storage so you have uh these third party providers like let's say Amazon or Microsoft or Google uh they have these services and you pay for that service you pay for the storage on the cloud right so it's uh storage as a
            • 179:30 - 180:00 service uh and uh versus on premises where you store you own the storage of your data so you can also own the management um and all the overhead that also comes along with that so these are the two different ways uh that you can typically store data let's go through the pros and cons right so um for on premises the pros are that data does not leave the organization right because you own the storage so it only stays within your uh company uh
            • 180:00 - 180:30 potentially it has lower costs in the long run because you since you pay for the storage the storage itself is actually very cheap if you're only paying for storage so that can uh potentially lower your uh costs but I think this only becomes relevant if you have a lot of data uh also you have faster access to your data because you own the storage so your servers would typically be located somehow close to you so technically you could get really fast access to that data and you would not have to go through uh maybe
            • 180:30 - 181:00 some complicated processes uh the cons for having on premises storage is that you need a high initial cost for setting up the infrastructure uh you need to plan how you're going to store all of this uh data uh Additionally you need the expertise and the maintenance in order to keep this data stored and maintained properly so that's dedicated people who are U experts in this that need to be working only on that uh also you have you run the risk of losing it
            • 181:00 - 181:30 right so maybe there's a fire maybe there's a Data Theft maybe there's a natural disaster and the the physical server is like somehow compromised so there's the risk of losing it right um versus what happens in the cloud storage uh so if you have a contract with one of these providers uh you can actually have storage that's scalable on demand and what that means is basically uh let's say they assign you a few gig uh gigabytes of space
            • 181:30 - 182:00 right now if you need more space all you have to do is click a button and then your storage increases and then you can fill maybe terabytes of data uh the other thing is they provide the cyber security for you so you don't really have to worry about um you know data thefts and all of those things uh because they manage the cyber security aspect so you don't need Personnel specifically for cyber security as you would need here uh the other thing is it's uh pay as you use so it's all subscription models um so you only pay for what you use you don't pay uh a
            • 182:00 - 182:30 large monthly amount or something so that's a flexibility uh additionally there's also multiple backups of your data so there's a less risk of losing it so I mean of course I mean even like Amazon can have fire or theft or natural disaster but they have multiple backups on your data because they can afford it since they have much more data farms uh so for you the risk of losing it is less and also it it depends on them right it's their
            • 182:30 - 183:00 liability um what is a con well the data leaves the organization this could be a huge con depending on the nature of your data maybe you have a data that just cannot leave the organization like let's say maybe you're working in a hospital or something and in that case your patient data just cannot leave the hospital in any way so this is it could be a restriction and of course you have less flexibility in how you want to store and manage your data uh because you don't control everything about it so let's say
            • 183:00 - 183:30 like Microsoft controls what they how you how they store your data you just use it right you just use their services um in general I would say if you're starting out a um with um you know data collection and data storage and stuff I always recommend to go with Cloud I think it's just a much more uh feasible option it's a much more flexible option to start out with if you use a cloud storage um it's also scalable it's flexible I think on premises can be good but it requires a
            • 183:30 - 184:00 large amount of initial costs and typically it's hard to also convince businesses to spend that amount of money to um store your data on premises unless like you are like a data company or something in that case um yeah so typically I recommend on cloud storage to start with I would say on premises is important if if this is important for you that data does not leave the organization then you definitely have to do on premises uh or you can also later
            • 184:00 - 184:30 migrate to on premises when your company grows in size and you grow strategically as well and then you decide okay now we want to store all of our data because that is our competitive Advantage we don't want to give it to uh Microsoft or something we want to do it ourselves uh that represents a huge shift in your thinking okay um and also now uh whether you choose on premises or Cloud there's another layer which is where do you actually store this data like what what tools do you use to store this data and there's actually two uh archetypes for
            • 184:30 - 185:00 this one is a data warehouse and another is uh data Lake and the difference between these is that the data warehouse is for structured data right so if you have let's say tabular information if you have information in a Json or yaml or some sort of structured format of your data then you can store it in a data warehouse right if you have uh maybe unstructured data uh you can store it in a data L you can also store structured data in a data Lake basically
            • 185:00 - 185:30 data L is for storing any kind of data right if you have any kind of data you can just dump it into a data l so it could be images videos sound recordings PDFs it could be again tables Json yaml everything can go into a data Lake but because of this restriction of data warehouse of having only structured records what it gives you additionally is that you can um access this data via these business in intelligence interfaces so you can visualize the data very nicely you can run SQL queries on
            • 185:30 - 186:00 this data to uh somehow format it in some way analyze maybe get some insights so you can like query the data which is really really powerful and in data links you cannot do anything like that because the data is not transformed you just it's a direct dump of your data right you just migrate the data and that's it right you don't actually do any transformation uh and that's really the benefit that the data warehouse gives you so if you have structured data you should think about putting it into a
            • 186:00 - 186:30 data warehouse because that enables you to do so much more with that structure data so here again the typical use case of when you would use these so you would use uh the data warehouse in the use case that you know which data is needed and you know how it must look like right so you know how you want to process this data then you can um put it into a data warehouse um in the scenario that you have uh data but you don't know what to do with it with it you don't know how you want to process it in that case you should store it in a data Lake and then
            • 186:30 - 187:00 you think about it at some later point in time right so let's say your company is uh working with wind farms or something and you're collecting all of this sensor information but maybe at the moment you don't know what to do with it so you just dump directly all of the sensor information on the data Lake and then later maybe you to structure it in some way and then maybe you want to put it into the data warehouse to be able to use it better right and these uh also represent these two uh two approaches to uh
            • 187:00 - 187:30 migrate or transform your data um and basically it's ETL versus elt right uh what these uh these letters stand for is uh extract transform and load right that's what the three letters stand for and the only difference is the order in which they're executed so ETL is extract transform load and El is extract and low transform right
            • 187:30 - 188:00 so essentially this uh ETL represents data warehouses and El represents data Lakes so what it means is that extract means to uh get the data from your sources transform means to put it put it in uh some sort of structure and then load means to put it into to your use case uh database right so this is the typical uh scenario of a data warehouse you have some data from your sources you extract it you transform it you put some structure on it and then you load it into your use case database from where
            • 188:00 - 188:30 you can run some SQL queries on it uh and in a data Lake you have the other approach where basically you extract it from the data source and you load it directly onto your data L and then at some point later maybe in the future you want to transform that once you know what you want to do with it so these are the two different approaches whenever you hear ETL versus El because whenever you hear about data pipelines you will hear these acronyms so just to let you know what they mean right they're nothing scary it's extract transform
            • 188:30 - 189:00 load and this is just extract load and transform okay uh so that was about data storage any questions so far okay if not let me go back to the moboard here um and we have a demo for you today which we would like to show you um yeah so we have imagined an AI
            • 189:00 - 189:30 use case right um and it's a use case in the healthcare domain and basically it's cardiovascular disease detection uh so you have an AI model that wants to identify given patient data that if this particular patient has cardiovascular diseases or not right if they have a heart disease yes or no that's the goal of the AI model and what you have in order to predict that is you have patient data so you have like the
            • 189:30 - 190:00 patient information like their age height weight uh their glucose blood glucose levels cholesterol levels uh blood pressure all of these uh medical information patient data and from that you want to create this AI model that predicts whether the patient has a heart disease or not right so we imagined this AI use case uh to come up with a demo of how you can uh perform different data operations and what we want to show you in the demo today is these four things so we want to show you data
            • 190:00 - 190:30 transformation flow right so how you uh can get the data from your Source uh and how you can transform it and then you can process it further so kind of like the ETL process uh then we want to show you description of data or your data specifications uh and how you can uh use a data catalog to easily get an overview of your data sets and what properties they possess uh we also want to talk about data quality specifically and I'll show you what a data quality check and a
            • 190:30 - 191:00 data quality report looks like um and lastly this is not exactly data strategy but uh something I always like to mention when you're starting out with AI use cases always think about experiment tracking and I'll explain what that is uh but that's what we want to show you in today's demo um yes okay so let me start with that so this is a Microsoft Azure some of you who might use it uh might already recognize this uh and we're using
            • 191:00 - 191:30 specifically Azure machine Learning Studio um and the first thing I want to show you here is this so this is uh kind of a data pipeline of sorts right so this uh kind of shows you the ETL process of your data uh and we we again imagined let's say we have two different data sources from which we are getting uh information right um let's say this is some sort of uh SQL Server somewhere hosted by a
            • 191:30 - 192:00 third party uh and I have a contractual agreement with them I bought the data from them so now I have to inest the data from there and let's say this is another source maybe uh some healthcare professional like a doctor or a nurse or someone maybe they upload an Excel sheet somewhere on some sort of URL and that's where I get this information right so I have two sources of data so now what I do with them is I perform certain Transformations like this is this is the uh extract phase the first phase is the
            • 192:00 - 192:30 extract phase where I get the data from The Source um then I perform Transformations so like I convert it to a certain format I remove duplicate rows um I clean up missing data uh I normalize data which Yan has explained earlier uh and then I merge these data sets together because I I want columns from both of them so I join the data and then I spe select specifically Only The Columns that I need for this use case of DET detecting heart diseases right so
            • 192:30 - 193:00 maybe there is other patient data that's not relevant to me so I just don't want those so I select only the ones that are relevant to me and then I convert it into a format which maybe my data scientist would want let's say in CSV or something else um and then this is like the load step right so this this is the load step uh the first the first step here was the uh extract step and then all of these ones in between are the Transformations if you have this uh ETL
            • 193:00 - 193:30 pipeline set up uh here in Azure not only does it allow you to automate this so instead of having to do all of these steps manually for me every month for example whenever the data is updated I can just run this data pipeline once every month or I can even better I can trigger it to run automatically once every month like on the first of every month at 6:00 a.m. I uh schedule it to run automatically and then it automatically updates my data set for me right so it's an automatic data refresh
            • 193:30 - 194:00 so that's one thing that's the automation uh the other thing is it it gives you the traceability or it gives you the context so as a data scientist when you are exploring your data when you're trying to uh work on an AI use case it's very important for you to understand what are all the things that happen to my data for it to reach me right so to understand the lineage of your data to understand the provence of your data this is super important so here they can clearly see what are all the Transformations that took place they can
            • 194:00 - 194:30 go into each and view the logs of how that happened to inspect it um and it gives them a much clearer picture of why this column has this particular value right they can answer that question better uh supplemented uh so this this kind of supplements the domain knowledge that you or if you have a domain expert who is uh knows about cardiovascular diseases um you know it helps the data scientist to understand uh this data just by looking
            • 194:30 - 195:00 at this pipeline in a sense so it it provides you also that aspect of uh kind of data lineage or data Provence okay so this is like the data transformation flow um now I want to show you what happens is after you load this data set here right so then um in Azure you can actually go into your um okay you can actually go into Data
            • 195:00 - 195:30 assets here U and I think this is really cool what you get to see here is a list or a catalog of all the data sets that you have available in this in this specific space right and you can searched search by your data source here you can search by your data sets so you see I have the sources here for example um and I have like
            • 195:30 - 196:00 the I have the pre-processed data as well yeah and the pre-processed data as well um and I think this is really cool that you can have a list or a catalog of your data sets and you can even search for it right that's that's really really interesting what this helps you to do is to discover any data sets that you might have like if you have a list of all the data sets that are there in your organization imagine how much easier it would be to manage those data assets right first of all um second of all as a
            • 196:00 - 196:30 data scientist it's so much easier for me to just search uh using this function to find this particular data set that I'm looking for and then once I go inside that maybe I can see all of this other information about that right I can see who it was created by I can see when it was created I can see where it is stored right um I can see other tags that I might have added I can see the versioning of the data which is which is another important aspect where because
            • 196:30 - 197:00 data often evolves over time right like there are multiple modifications made to it over a number of years let's say for example and then um if you don't maintain this versioning of data you don't know what happened to the original data or like how it was transformed but here it maintains this version history so you can always go back and see at this point in time this was the state of this data set right so it maintains this version history uh and you can actually even uh look at a snapshot of your data so you can see these are all the columns of my
            • 197:00 - 197:30 data I have age height weight uh this is I think blood pressure uh this is like cholesterol and glucose levels uh whether they smoke or not whether they drink or not whether they are physically active and whether they have uh the disas or not um yeah and you can also look at a profile of your data so you can actually see okay these are all my columns these are the distributions of my columns right so how how balanced or imbalanced these columns are maybe uh
            • 197:30 - 198:00 and you can also see the data types of these columns you can see the minmax count some some statistics that might be relevant for you so for me the cool thing is having this list of uh data assets in your organization this can be really really cool if you can have it I must say that what I'm showing you here you cannot use this across your organization like this specific list because this is only in one uh specific workspace that it works uh but you can there are other solutions for having a
            • 198:00 - 198:30 data catalog across your organization the one we typically uh always talk about is data Hub which is a open source so it's free to use uh and it shows you a holistic like U overview of your data it allows you to search for your data sets very easily let me see if I can find some pictures of this not this um I mean yeah maybe
            • 198:30 - 199:00 uh yeah sure so here you can see for example yeah so you can kind of search across all of your data sets across your company you have like different filters if you want to apply what is the origin what is the platform on which it is stored you can put other metrics uh other configuration parameters you can search uh you can actually go by folders as well and then once you go into a data asset you can see many things you can see the schema
            • 199:00 - 199:30 of your data uh you can see the status you can see the ownership of your data which is an interesting thing so I think this is how it looks so you can add who are the owners of my data who are the responsible people for this particular data asset and that provides so much uh discoverability traceability um um like you know context as well you know for example if a certain data scientist in your organization finds this data set created by another team and they think it can be useful for their use case so they know
            • 199:30 - 200:00 who to reach out to to speak in order to further uh that conversation so this is super cool you can see also other details about your data um I think a data catalog is really really cool for managing all of your data assets it does require a high level of initial investment to set this up um but yeah maybe that's something to evaluate when you're thinking about your data strategy is do you want to have a data catalog or you know somehow you
            • 200:00 - 200:30 need to have an overview of what data assets are there in your organization okay um yes apart from that I also want to show you data quality so I have this code here that uh basically um does some data quality checks I won't show you the code uh but essentially we use this uh another open source library for doing this for doing these data quality checks and we then output it in form in the form of a Json this is a Json format which is um a format often used uh in
            • 200:30 - 201:00 programming um but this is not really super readable for like anyone who's not a technical person so there's a really nice visualization of the data quality report that we came up with uh this is also super easy to do this is actually a tool provided by Google and basically you just feed it to Chon and then it feeds you this really nice looking uh data quality report U and you can see here this is what uh basically um our suggestion would be to generate as a governance
            • 201:00 - 201:30 artifact is that when you do your data quality checks you generate a data quality report that someone can can inspect and see if it meets all the criteria uh let's look at one of the checks here so for example relevance is one of the checks and uh what is the exact measures so it's feature relevance uh the ratio of features in the data set that are relevant to the given context to the AI use case that you're trying to do and the acceptance criteria is at least 80% let's say for example but when we measured it for our use case we saw
            • 201:30 - 202:00 that only six out of 11 features are relevant in some way so this is a data quality check that fails right and now what we need to do is we need to apply quality improvement processes and for this instance it's fairly simple what you do is you drop uh every anything that's not relevant so you only keep six features and then you have an acceptance CR uh you fulfill your acceptance criteria of uh at least 80% let's look at another data quality check let's say completeness and we
            • 202:00 - 202:30 talked about it what this means is it presence of null values in your data right and let's say we apply it on the cholesterol column and we say that the acceptance criteria is 100% so all of the values must be present in the cholesterol column and we see that this this is indeed the case so all of the values are complete so this passes this particular data quality check and there are some more as well but that's an overview of your uh how you can measure your data quality so this is again a free Tool uh open source
            • 202:30 - 203:00 and I just want to show you that if it's something that you you don't have your data quality Dimensions defined yet maybe this could be an interesting starting point right because you can go here to this website and you can see what are all the uh data quality checks that they already have inbuilt like what are all the options maybe I can just use out of the box and then you can filter here by like the uh theme or Dimension you're looking for um yeah let's see distribution so maybe I can look at
            • 203:00 - 203:30 these uh individual data quality checks to see if I can use them and maybe this already gives you an idea of what data quality checks you want to apply for your use cases right okay um yeah and the very last thing I want to show you is um again not exactly data strategy but um something called experiment tracking uh what this means is that whenever you train uh um a machine learning model you should always
            • 203:30 - 204:00 track your uh model training run or we call it an experiment when you train your model uh you should always try to track this to maintain this traceability and to maintain the linking of your code your model and your data together if you can link these three things together you will never have a problem in terms of traceability right uh even when you're working on uh a model as a data scientist you often might get lost about uh you know oh this model uh that I did
            • 204:00 - 204:30 yesterday worked really well but I can't seem to find it which data did I run it on uh what what were the exact parameters that I run it on maybe you might not have always this information with you so if you enable experiment tracking all of this is automatically recorded so this this is one run of the experiment right so this is one experiment uh here you can see it tracks the input data that was used if you click here you go to the input data it tracks the model that was the output you can go to the model from here it tracks
            • 204:30 - 205:00 any other metrics that you also want to log with it like the parameters uh the accuracy and all of this information and then you can actually come here and you can find the code of the model itself as well right and you can find other things like you can find uh certain graphs that you want to output for example right so all of this information is tracked together uh where it links the code the model and the data together U and I always recommend no matter what what Workshop I do whether it's data strategy
            • 205:00 - 205:30 or something else but if you're starting out with a use cases uh always use an experiment tracker the one I always recommend is ml flow it's open source it's really good uh you don't necessarily have to use this on Azure by the way none of these things are specifically Azure you can use them on anything so that's why we try to use open source tools where on uh which you can use on any platform uh but ml flow is the thing that you we used here for experiment tracking and it's it's really really um it's really easy to use and
            • 205:30 - 206:00 it's really U helpful in my opinion okay so that was all I wanted to show you for the demo any questions about anything that I talked about are there any questions on the parking
            • 206:00 - 206:30 lot nope okay if not uh let me check my slides all right so I have basically one last topic for you today which is the topic of data governance and uh essentially this relates to data management so we want to talk about some best practices that you can use to try to govern all of the data in your organization um so first of all we want to motivate why we want to do data
            • 206:30 - 207:00 governance so why is data governance important and uh I mentioned actually some of these things already right so let's just go through them so first of all discoverability uh data governance enables discoverability is a log and then also there are other ways maybe you have really nice documentation uh somewhere or you have uh some sort of dashboard where you can
            • 207:00 - 207:30 visualize these things that's also possible then there is the reproducibility so where can I find the data sets that are connected to my ml experiments that's what I talked about when I was mentioning experiment tracking so U whenever you do these experiments or you train your model uh which version of which data set was used to train that right so this reproducibility so that I can uh duplicate the
            • 207:30 - 208:00 results okay then you have uh the accessibility of your data so on which data storage system in the company is this located and how how easy is it technically for me to access it uh there is the traceability of your data where you again talk about which data set am I using at the moment on which data set is on which data is this data set based which is the data transformation flow I showed you earlier or the data lineage uh What entity created the data set
            • 208:00 - 208:30 right so who was the person uh responsible for doing this uh then you have a bit more privacy questions call it data governance it's actually privacy so how can I make sure that sensitive data sets like let's say with a higher degree of confidentiality have access restrict in place and are only used in compliance to the company's rules right so this is a this is a matter of legal compliance at this point so this is something that's super super important um and then there is the transparency so your data in the
            • 208:30 - 209:00 organization has to be used according to the policies of a company so how can someone how can I enable someone else to audit my data usage right so that's why we need transparency so this is the motivation of why you want you want to think about data governance and now the way I uh we propose that you think about data governance is in the form of a data life cycle right and this is what a data life cycle looks like and we already talked about many of these phases throughout the session today right so you have the
            • 209:00 - 209:30 creation of your data creation collection you have the processing of your data all the Transformations you have the storage of your data and then you have the actual usage where you try to gain some insights from your data after you use it then at some point you want to AR your data right so when that data is no longer useful at any point then you want to Archive that data um and then the this data that is archived you also want to think about destroying that data at some point because you can't just let data archive forever that
            • 209:30 - 210:00 just builds up your costs uh and leaves you exposed to vulnerabilities possibly so you want to think also about destroying your archive data so this is the entire data life cycle at a glance and now we'll look at uh best practices for each of these life cycle phases um so first off in the creation of the data right so uh this phase is characterize by having uh potenti like you have data that's potentially created through different sources uh using
            • 210:00 - 210:30 different formats uh these have different uh frequencies maybe so maybe some of them are batch data some of them are streaming data um and you also uh want to create metadata at this point as soon as you're creating also the data uh and the data governance questions you want to ask here are are there any evaluations in place for this data acquired um so what that means is that uh when you're creating this data you want to assess whether um you know it's been created in the correct way right I
            • 210:30 - 211:00 mean do does it fit certain requirements right maybe you can do um some some data quality checks at this point to to see if this meets the criteria of what you want to use this data for um and then there's also the question of any limitation on who can access this data in the future right then there's the processing phase so here you want to integrate multiple data sources you would want to think about your data cleaning processes and
            • 211:00 - 211:30 this is also where your ETL and El pipelines come into play and the governance questions to ask here is are data lineage and classification done so data lineage again refers to the Providence of your data or where is my data coming from right so so that you can answer that this specific column in my data set comes from this specific data set which is uh coming from that server and this is what this means right and this is how this data was collected so you can answer these questions that's what data lineage enables you to
            • 211:30 - 212:00 do um and then again here you want to validate your data quality um to see if you can actually uh use this further for AI use cases uh or which use cases would be suited for this data um then in the storage phase this is characterized by uh the this is typically where you stored the data in the metadata you think about the protection of your storage systems and you make the choice between data warehouse or lakes there's also data lak
            • 212:00 - 212:30 houses that do both of these um so you have to evaluate where do I store my data essentially and the governance questions you want to ask here is the encryption of my data and backups to ensure redundancy of my data um typically these questions are super important uh especially when you're storing the data on premises because then you need to do the encryption yourself you need to think about the backups yourself um usually a cloud provider will do them for you but you can uh of course specify certain
            • 212:30 - 213:00 requirements like maybe you have uh data that requires a much higher degree of encryption or something so you can specify also those requirements then you have the usage phase of the data life cycle so this is uh where you visualize and analyze your data for insights and you try to make informed business decisions based on this and the governance questions here are is there an access management in place right so can anyone access this data or do you need certain credentials to be able to do that and additionally
            • 213:00 - 213:30 are there any regulatory or contractual obligations for the usage of this data uh then there's the archiving phase so this is where data that is removed from active production environments so you do you no longer need that data that no that's no longer useful for you so you remove it U that data is not processed used or maintained any long you only stored it in case it might become useful again at some point in time and the question here is how long
            • 213:30 - 214:00 to retain this uh data right and that's answered again in the destruction phase where of course it's not feasible to store all of your data forever so you need to delete some of it and this is done from the archive storage location right so it's already been taken out of use it's already been archived and then you delete it and the governance questions here are first you want to check if there are any specific obligations for retaining data for a certain period of time so for example if
            • 214:00 - 214:30 uh your AI use case is um falls under the AI act so then you might have to um store your data for at least a year before deleting it that could be an option otherwise depending on your specific domain maybe there are certain regulations that apply to you and they State they will for sure state that you need to maintain the data for this duration of time so it's worth looking into those obligations and uh along with that also understanding Federal Regulations industry standards etc
            • 214:30 - 215:00 etc okay so that was the entire data life cycle and some governance best practices now everything we talked about today uh how can you generate into an output uh that output we we propose it to be the data management plan right and what it is is the data management plan is a living document that evolves with your organization as new data sets are required and new regulations are enacted and we have these sets of questions that
            • 215:00 - 215:30 Define your data management plan and if you can answer all of these qu questions uh this provides a guiding living document or a living artifact for essentially what is the data strategy of your organization right and this is the answer to that so what are the questions that you want to answer in this data management plan so there's the collection which we talked about what type of data are you collecting where what are the data sources what is the volume of your data uh then the organization of this data so what tools are you using uh to operate
            • 215:30 - 216:00 with this data uh are you using a data warehouse or a lake or both uh what vendors are also in play with these data what contractual obligations do you have uh maybe they're managing some of the data for you or maybe you have uh third party people who uh vendors who are doing the uh data labeling for you uh then there's also storage so where are you storing the data how long are you storing it for uh what protection and backup uh do you provide for this
            • 216:00 - 216:30 data um then policies so are there any specific regulations that apply to you either ethical uh or legal from a legal standpoint and lastly also the roles right so what roles and responsibilities corresponds to all of these things in the data life cycle so who are the people responsible for taking care of all of these activities throughout this data life cycle right and um this roles and responsibilities is also something that will be discussed in a later Workshop uh where we talk about uh the
            • 216:30 - 217:00 roles across an entire machine learning project and not only for data and that will also include people specifically for data so that's in a future Workshop but yeah so this is kind of the uh outcome of your entire data strategy effort you come up with this kind of data management plan that answers these questions uh not just in Broad strokes but in detail you need to answer these questions and it's a living document you uh furnish it with more information as
            • 217:00 - 217:30 your organization evolves as your data assets evolve as your AI use cases evolve in general right so that's the data management plan okay so there is a section also on data governance tools um that I wanted to show you but I'm not sure if this is maybe too technical for the audience but basically here I just wanted to uh show you an example of certain things that you might want to do and the tools that you would use for them maybe I will just roughly give you
            • 217:30 - 218:00 the gist of it so I talked about data versioning that's one of the things um so what what is data version first of all it is basically maintaining versions of your data so for example uh if you use um a code versioning system like GitHub for example so you will know that you have versions of your data so you can always go back to a certain point in time that you had this specific code and you can point to it so essentially it's like maintaining a
            • 218:00 - 218:30 history of your data and all the changes that happened to it right uh so that's what it helps us to do it helps us to track the changes in the huge files um and also data is stored separately which allows for efficient sharing an access of developers to the latest version of the data right U and it makes it easier to use these artifacts from outside the project right so it makes sharing of your data also much more easy if you implement data
            • 218:30 - 219:00 versioning yeah so it is much easier than at the out as an outcome it's much easier to track the workflow within different stages and also reproduce this outcome right so that's the point of data versioning typically you want to be versioning your data sets because otherwise your data will change in a certain way and you can never go back right U you don't want that to happen right because humans make mistakes or I mean uh maybe you didn't think of something that you might have used uh that you could have thought of before so
            • 219:00 - 219:30 maybe sometimes you want to go back so that's why data versioning can be really helpful in that on top of that if you have very sensitive data then data versioning will be a requirement for you also under the AI act for sure you need to version your data sets so that's also something to think about uh an open source tool that we use in this case is DVC or it's called Data Version Control um it's open source and it's really nice easy to use as well so that's something to look at so I won't go over this but basically
            • 219:30 - 220:00 it's all the challenges that you want to overcome with the help of data version and then I also want to quickly touch upon data cataloges which is again I refer to it it's my favorite thing in terms of data governance so what it is is it's a single universal view of your data across all of your data stores across your organization right and what this gives you is much better data Discovery and governance right if you have a data catalog it's much much easier to manage your data sets so you
            • 220:00 - 220:30 have all of these data sources here maybe you have different data sources within the same organization that's a very typical scenario but if uh you have this data cat log this acts is an interface between the data users and the data sources right it provides an inventory for your data it uses metadata for managing your data so that enriches your data already by providing metadata and it provides a holistic organization wide view over all of your data assets and then all of these data assets right
            • 220:30 - 221:00 look at the variety of roles for this for whom this could be useful for a data scientist for a business analyst for a data Steward data engineer Chief data officer compliance officer for all of them data catalog can be useful again these are the challenges that a data catalog tries to overcome I won't go over that right so it supports you in inest ingestions of metadata discovery of available data for adhering to data governance policies uh because your your
            • 221:00 - 221:30 company might Define certain policies and U as we saw the example image I showed you of the data catalog you can actually list certain policies within the asset right you can monitor your data uh quality