AstraZeneca’s Data-Centric Approach to AI in Pharma
Estimated read time: 1:20
Learn to use AI like a Pro
Get the latest AI workflows to boost your productivity and business performance, delivered weekly by expert consultants. Enjoy step-by-step guides, weekly Q&A sessions, and full access to our AI workflow archive.
Summary
AstraZeneca is leveraging a data-centric approach to use AI and machine learning in the pharmaceutical industry, particularly in cancer therapy and rare diseases. The focus is on providing personalized education to doctors to aid in making timely treatment decisions, given the rapid innovation in targeted therapies. This involves managing a multitude of noisy data sources, such as electronic medical records and insurance claims, and using AI for data enrichment and labeling. Despite the challenges of fragmented data, AstraZeneca's approach highlights the importance of domain experts in the process, as they play a crucial role in ensuring data accuracy and actionable insights. The company has demonstrated success in identifying specific cancer biomarkers, essential for precise treatment recommendations, thus emphasizing the potential of AI to transform healthcare.
Highlights
AstraZeneca focuses on AI to tackle cancer and rare diseases. 🚀
Personalized treatment education for doctors is essential. 💼
Managing noisy data from multiple sources is challenging but key. 🔧
Human expertise remains critical in AI data processes. 🤝
AI has shown success in enhancing cancer treatment with biomarkers. 🧬
Key Takeaways
AstraZeneca is revolutionizing AI in pharma with a data-centric approach for cancer and rare diseases. 🧠
Personalized doctor education is pivotal for timely cancer treatment. 🎓
Overcoming fragmented and noisy data is crucial for effective AI implementation. 🗂️
Expert involvement enriches data, making it actionable for healthcare transformations. 🌟
Successful AI use cases include identifying cancer biomarkers for better treatment targeting. 🧬
Overview
AstraZeneca is at the forefront of utilizing AI in the pharmaceutical industry, especially in cancer therapy. The main challenge they face is the rapid pace of innovation and treatment options, which requires timely and personalized education for healthcare providers. By leveraging various data sources and employing AI, they aim to provide insights that help doctors make informed treatment decisions.
A critical part of AstraZeneca's strategy involves dealing with noisy and fragmented data from EMRs and insurance claims. They engage domain experts to enrich and label this data, which is fundamental to developing accurate predictive models. This collaboration between humans and AI is essential to overcoming the constraints of incomplete or inaccurate data, thus making the data actionable.
The company has achieved notable results in its AI initiatives, such as identifying important cancer biomarkers. These achievements underline the transformative potential of AI in healthcare, enabling more precise and individualized treatment interventions. AstraZeneca's approach not only exemplifies the future of pharma but also demonstrates the collaborative power between technological and human intelligence.
Chapters
00:00 - 01:00: Introduction and Speaker Introduction The chapter titled 'Introduction and Speaker Introduction' sets the stage for an engaging talk by introducing the speaker, Ravi Gopala, the Global Head of Commercial Data Science and AI at AstraZeneca. The introduction is enthusiastic, highlighting the excitement for the upcoming presentation. Ravi Gopala is welcomed to the stage, and he acknowledges the audience, ensuring that he is heard clearly before beginning his talk.
01:00 - 02:00: Overview of AI in Pharma The chapter titled 'Overview of AI in Pharma' begins with an introduction by Ravi Gopala Krishnan, who is part of the AstraZeneca data science and AI team. He sets the stage to discuss topics related to the application and impact of artificial intelligence in the pharmaceutical industry. The content likely covers various aspects such as innovations in drug discovery, efficiency in clinical trials, and improvements in patient outcomes driven by AI technologies.
02:00 - 03:00: Opportunities and Challenges in Personalized Medicine The chapter discusses opportunities and challenges in personalized medicine, particularly in the pharmaceutical industry. One of the key aspects is the use of data-centric AI and machine learning. The central issue being addressed is the provision of personalized therapies, especially for cancer and other rare diseases.
03:00 - 04:00: Coordination of Education and AI Systems The chapter discusses the coordination between education systems and AI in the healthcare sector, emphasizing the importance of educating doctors to make timely and accurate treatment decisions. It highlights rapid innovation in medical care, particularly the development of hundreds of new targeted therapies that are effective for specific conditions. However, it also points out that despite these advancements, many patients miss out on receiving appropriate treatment at the right time. The text suggests that one of the reasons for this issue is the difficulty doctors face in keeping up with these advancements.
04:00 - 05:00: Utilizing Data Sources for Insights The chapter explores the challenges in cancer treatment, emphasizing the rapid pace of innovation and evolving guidelines. It highlights the complexity of treating rare diseases, often requiring a multidisciplinary team approach. This involves collaboration among various specialists such as surgeons, radiation oncologists, and medical oncologists, all contributing to treatment decisions. The narrative underscores the necessity for cohesive teamwork and data utilization to enhance treatment outcomes.
05:00 - 06:00: Challenges in EMR and Claims Data The chapter discusses the challenges in integrating Electronic Medical Records (EMR) and claims data to ensure coordinated and personalized patient education. It highlights the inefficiencies of a one-size-fits-all approach and suggests that building AI software could significantly improve the coordination among various medical specialties. The focus is on creating systems that tailor investigations and educational strategies to individual patient needs, rather than applying a generic solution.
06:00 - 07:00: Data Labeling and Enrichment The chapter discusses strategies for data labeling and enrichment. It emphasizes leveraging both external, real-world data sources and internal data related to drugs and their approved uses. The approach aims to consolidate these diverse and often noisy data sources to better understand and engage with medical professionals.
07:00 - 08:00: Predictive Modeling and AI System Development The chapter discusses the process of understanding patient-level data to identify unmet needs, deviations from standard care, and the reasons behind these issues. By addressing the 'what' and 'why' questions regarding prescribing behaviors, mechanisms, and care gaps, AI systems can be developed to predict optimal interventions. These systems aim to provide actionable insights to improve healthcare outcomes.
08:00 - 09:00: Scaling Challenges and Solutions The chapter "Scaling Challenges and Solutions" discusses the coordination challenges faced by different teams within the Pharma commercial sector. These include commercial teams, medical teams, and field-based representatives. The focus is on enhancing digital experiences for marketers while ensuring effective communication among all these diverse groups.
09:00 - 10:00: Case Studies: Lung and Breast Cancer In the chapter titled 'Case Studies: Lung and Breast Cancer', the focus is on utilizing real-world data to provide insights that are easily consumable by doctors. The discussion highlights the importance of using comprehensive datasets to understand the patient journey, emphasizing the need for breadth in data to overcome certain key challenges. This approach aims to enhance the educational resources available for medical professionals, particularly in the context of studying lung and breast cancer.
10:00 - 12:00: Q&A Session - Tools and Infrastructure The chapter titled 'Q&A Session - Tools and Infrastructure' discusses the challenges of collecting comprehensive data in specific markets or diseases, such as lung cancer. It emphasizes the need to consider data from both Community Cancer Centers and academic institutions due to patient distribution across these settings. A significant challenge highlighted is the fragmentation of Electronic Medical Records (EMR), which complicates comprehensive data collection.
12:00 - 14:00: Q&A Session - Expert Input and AI Relationship This chapter discusses the integration of VMR data with various other data sources, such as insurance claims data, which is often gathered and consolidated by different aggregators. These data sources, while comprehensive in terms of coverage, tend to be noisy and sometimes contain significant amounts of missing data.
14:00 - 15:00: Conclusion and Community Engagement The chapter focuses on the challenges of data capture and accuracy, particularly in biomarker and lab testing data, which is often missing or incomplete due to privacy and business considerations. To overcome this, the chapter discusses the strategy of engaging medical and business experts to help label and enrich the data, thereby enhancing understanding and enabling better insights from the available information.
15:00 - 16:00: Closing Remarks This chapter discusses the importance of investing time and resources into understanding key challenges. The approach involves engaging with domain experts from the beginning to provide the right tools for data labeling and enrichment.
AstraZeneca’s Data-Centric Approach to AI in Pharma Transcription
00:00 - 00:30 please track here we have a really exciting talk coming up next please join me in welcoming to the stage Global head of commercial data science and Ai and AstraZeneca Ravi gopala Christian thanks beers Ravi welcome super excited for your talk um the stage is yours thank you can you all help me hear me okay
00:30 - 01:00 yeah my name is Ravi gopala Krishnan and I'm part of the AstraZeneca data science and AI team so what I'm going to be talking about is
01:00 - 01:30 uh sorry because some of the in Pharma commercial specifically is one of the key opportunities and challenges around the use of data Centric Ai and ml uh so the the main problem we are trying to kind of address uh especially in cancer therapies and other rare diseases is essentially trying to kind of provide the right level of personalized
01:30 - 02:00 education to doctors to make the right treatment at the right time and there is rapid innovation in care where hundreds of new targeted therapies are being to proved that are very effective and for very specific indications however not many patients or many patients are left behind and not getting the right treatment at the right time there are multiple reasons for this one is it's extremely hard for doctors to
02:00 - 02:30 keep up with the pace of innovation and all the changing guidelines that accompanies that and the second problem is usually rare diseases in cancer treatment is usually done through multi-disciplinary team that could involve multiple Specialties like surgeons radiation oncologists medical oncologists so they're all involved in making treatment decisions so uh because of that you know you need
02:30 - 03:00 to kind of uh have the right level of coordinated education to kind of make sure the patient gets the right sort of Investigation across all these different Specialties and the one size fits all approach will not work so this is where I think it's a huge opportunity for uh building AI software and systems to to enable this level of coordinated in personalized education
03:00 - 03:30 so I mean how do we do it right so uh I mean essentially the way to think about it is uh leveraging all the available data sources both external real world data sources which is extremely noisy as well as a lot of our internal data sources that we have about our drugs and what it's approved for and how we want to engage with the doctors so tons of data sources bringing bring them all together and trying to kind of understand uh
03:30 - 04:00 at a patient level you know what's really kind of happening what are the unmet needs what are the deviations from standard of care and also why is it happening uh what's the drivers of prescribing behaviors and mechaniques and Care gaps then once you answer the what and the why question then you can build AI systems to kind of predict what's the next best intervention and then surface those actionable insights to all the
04:00 - 04:30 different teams that are engaged in having conversations with doctors so with in Pharma commercial there are multiple teams right so there are commercial teams there are Medical Teams there are field-based reps and then there are uh there's there's a lot of emphasis on more delivering more digital experiences to our marketers so the idea is to kind of coordinate and get get all the communication to all these different people in in a manner
04:30 - 05:00 that's easily consumable by the doctors so most of the education we provide is for doctors so what are the key challenges right I think uh we have to use a lot of real world data set to be able to provide this level of insights and uh some of the key challenges are we need breadth of data to really kind of understand what's happening from a patient Journey perspective and by
05:00 - 05:30 breadth of data I mean like a specific market and a specific disease like lung cancer you need to kind of look at everything that's kind of happening across Community Cancer Centers academic Cancer Centers because there are patients everywhere right so we have to reach out to them or reach out to the to them through the doctors effectively and uh so so the problem there is EMR data sets as many of you know are highly fragmented and it's impossible to get an
05:30 - 06:00 integrated view that has the breadth of VMR data so be able to rely on other data sources that uh like insurance claims data that's done that's usually generated by a bunch of aggregators that captures every medical flame every Pharmacy claim but the problem with that is it has the breadth but it's got uh it's extremely noisy it's got a lot of missing data
06:00 - 06:30 a lot of data is masked for privacy reasons or other business reasons there is very poor capture rates of all the data especially like biomarker testing and lab testing data is completely missing so it's hard to basically understand what's kind of happening without that so uh so the way we address that is by engaging medical experts and business experts to kind of label this data and address that right so labeling is
06:30 - 07:00 is a huge part of our investment in time and resources right to really kind of understand what's happening so these are the main challenges and uh the the approach we are taking is we engage with domain experts right from the big neck right they're involved in you know you provide the right set of tools for them to kind of help us with labeling and providing all the enrichments to the data I mean
07:00 - 07:30 enrichments to the data is primarily things like to address the missing data like biomarker status for instance right I mean it's impossible to get that from claims so we look at patterns that were happening before and after and use machine learning and AI to make those enrichments and enrichments are essential for us to kind of build more predictive models to understand which patient is eligible for which drug and when right so the thing that's uh that's the process we've been kind of following
07:30 - 08:00 so the I mean like like I said the the labeling of data is uh in noisy data is being like where we spend like 80 to 90 of our time right I mean then we have to look at different categories of data and label them so that we deliver the right insights at the right time right for uh for for for engaging with the doctors so the first level is trying to
08:00 - 08:30 identify uh cohort of patients right I mean that combines medical claims with Pharma Pharmacy claims and then we have to kind of enjoyed all the missing values by using like data enrichments second is to kind of understand this is entire patient Journey right to understand the patient characteristics what are the comorbidities uh what kind of treatment have they been given in the past you know how is their
08:30 - 09:00 tolerance to certain type of drugs what are the side effects they've experienced what are the adversy ones that have experienced right that's uh that's that's the next level of data we have to kind of tag because different drugs have different side effects and we need to be very careful about how a patient is likely to respond to a drug by capturing that and the next is you know how is investigation happening how are the patients being tested how are they being tested before diagnosis
09:00 - 09:30 or after diagnosis if they're being given treatment how well they're responding to it so there's a lot of testing data that we have to kind of tag and impute and then what treatments are they getting right I mean what are the treatment characteristics what is the duration uh are there any disparities in the treatment are certain ethnic groups getting you know preferential treatment over others so there's a lot of uh you know need to kind of capture that
09:30 - 10:00 that data as well and then and then then obviously Adverse Events how are they how are they kind of progressing from a diseased standpoint are they doing well is the treatment being successful or not and then the last but not the least is identifying care gaps so you can see that this whole unless we have all this we we won't be able to develop any AI software to do reasonable predictions so this is like pretty foundational along with the enrichments to make the
10:00 - 10:30 noisy data really kind of usable so what what are the opportunities for doing this at scale right I mean I think we've spent a good part of a year and a half trying to be uh trying to be efficient at labeling all this data and building AI software but what we are realizing is it won't really kind of scale right I mean I think uh in
10:30 - 11:00 order for us to kind of scale this uh we need more efficient approaches to to do a bunch of different types of labeling and and that's what will make the AI systems more accurate and uh in more actionable right so so these are some of the ideas right I mean slice-based data evaluation some of the work being done by snorkel is a good thing to kind of understand programmatic labeling is is a way by which we can
11:00 - 11:30 definitely improve the speed accuracy and efficiency with which we get expert experts to kind of label the data and then we have to deal with multi-modal data sets because data sets in certain EMR ndhr are very deep it has a lot of deep information but it's it's only for one Hospital right so how do you kind of use how do you learn from that data and apply it to the broader Nationwide kind
11:30 - 12:00 of claims data right there's a lot of opportunity there and then also consult with uh how do you represent the knowledge of uh experts within a Pharma company and also external who have really good understanding of the guidelines and the disease profiles and the side effects for specific drugs right how do you represent their knowledge their insights is another opportunity uh and then definitely you know
12:00 - 12:30 last but not the least you know how do you make the labels more consistent from all the different experts and uh more accurate right so AI systems can kind of learn right so so these are some of the uh the the big opportunities we see so we've had some initial successes with some of the work we've done but it takes a long time to build any AI software to uh to really personalize every
12:30 - 13:00 interaction we have with with our with our physicians and I'll just give you a couple of examples right I mean I think some of the work we've done or on on lung cancer is around a biomarker uh or it's it's a mutation called egfr I mean essentially we have a very targeted therapy for lung cancer patients who have the specific egfr mutation
13:00 - 13:30 and uh it works magically right I mean both for early stage lung cancer as well as late stage lung cancer but the problem is I mean the problem we're trying to see is you know how do you guide treatment first identification of egfr patients and treatment of bgfr patients at the right time and egfr testing data is completely missing from claims and egfr biomarker status is also completely missing from claims so
13:30 - 14:00 basically we took a machine learning approach to to enrich and tag every patient record in the United States with their egfr status any every patient was diagnosed with non-small cell lung cancer with mediafr status now we're trying to see how we can capture that so we got pretty decent kind of results on that and it kind of fits well now our next challenge is to kind of see how we can make this more actionable
14:00 - 14:30 through our sales and marketing and patient Outreach kind of efforts uh so so another example is uh is again in breast cancer this is uh we have this amazing drug that were recently approved called inherto for different classes of metastatic breast cancer patients in one of the specific indications is on this particular
14:30 - 15:00 biomarker called her too low which is the No No drug exists for this right but then the problem is finding these are too low patients so that's uh that's like a biomarker enrichment kind of problem right so this is where using multimodal data sets is uh is the only way to do it right so we used an EMR data set from a subset of a very small population of metastatic breast cancer
15:00 - 15:30 patients it was curated by by flat eye in the hardware and you know it's a small data set only about 27 000 patients data sets but then we had to kind of apply that to millions of metastatic breast cancer patients across the United States that's coming from the claims so we uh we developed a series of machine learning models to kind of classify some of these patients in one data set and transfer the learnings from
15:30 - 16:00 the MR data set the claims right and then I mean this is the way this is a it's a devastating breast cancer kind of a profile and all these patients can now benefit from this new therapy called inherto so these are like a couple of quick examples and uh again the results after the enrichments and using the machine learning approaches uh is pretty decent pretty acceptable and uh we are trying
16:00 - 16:30 to see how best to kind of use this for uh to ensure that all patients who are eligible for this particular therapy get it with the right time so that concludes my talk and I'll now pass it on for any q a Robbie thanks so much for the for that presentation that was great there are a few questions that are coming in now
16:30 - 17:00 that I think I'd love to love to get your take on so bashisht has a few questions first is what what tools are we using to accelerate data enrichment don't mind speaking to that sorry can you repeat the question yeah what tools are you using to accelerate data enrichment yeah I mean I mean right now we are experimenting with a lot of uh tools we built in-house combination of what we get from
17:00 - 17:30 AWS and a lot of different tools because some of them are largely the main specific business rules where we kind of start and uh and then that's like one aspect It's a combination of using business rules and then again I showed the thing about tagging all the different data once you tag all the different data and you have the business rules you use that I mean another is again a machine learning based approach to to enrich the data
17:30 - 18:00 this is like standard modeling approaches you take a card of patience that clearly with the truth set is to have certain characteristics and use that to kind of apply to broader yes these are all custom built tools got it got it it's exciting the next question that we have is what are the main kind of data infrastructure bottlenecks in the Pharma space at large and what can be improved to enable IML projects yeah data infrastructure
18:00 - 18:30 I mean the the scale of data we are dealing with is again this is if you take us alone right I mean I think uh we have infrastructure has not been an issue for us so far because it's all about uh high quality data sets right I mean so you you from several millions of patient records especially for cancer you know you get down to like uh you know hundreds of thousands right and then and then tens
18:30 - 19:00 of thousands type thing right I mean that's what we have to kind of process so uh the the current infrastructure through like AWS is what we are currently using works pretty reasonably we are not run into any bottleneck from an infrastructure standpoint bigger problem for us is how do we do this uh you know domain in expert kind of labeling at scale right I mean because that's what is needed to actually train some of our AI systems so
19:00 - 19:30 if only if only there were solutions that were focused on that no I'm just I'm just kidding um well another question just came in how do you look at evaluation for these use cases are there like stricter requirements just given the nature of the the work that you all do at AstraZeneca and the space at Large yeah I mean we are experimenting with a lot of different evaluation criteria I mean I mean luckily the we don't it should be good enough because our use cases more at an aggregate we we go
19:30 - 20:00 we want to personalize the way we engage with doctors right we're not providing recommendations certain individual patient level so we don't run into the thing off and it we have a good compliance it's through our sales and marketing channels is how we provide okay this is the next best action for this doctor based on the patient profile pool so the the need for it's okay to have we try to minimize false negatives right and we validate
20:00 - 20:30 our results through a combination of the data sets we have that's highly curated and well-established data sets as well as through academic experts ke experts does it really kind of make sense right it's a combination of both got it it's awesome and if others you know who are watching are going to be watching later are interested in following your work is there a good way to follow along with what you're doing your team's doing at astrazeneck or just you personally is there a preferred method of kind of reaching out to you
20:30 - 21:00 that you'd like to suggest here yeah absolutely yeah I mean I think reach me out ravi.gopalakrishna at astrazeneca.com uh I mean we would want the the this data Centric AI Community we want to partner with them and live and jointly innovate it with you we can't solve with all ourselves right I mean part of the the opportunities and challenges that I kind of highlighted on is how can we leverage what's already kind of happening within the community and uh we are based in San
21:00 - 21:30 Francisco right so I'm from a location standpoint uh happy to kind of uh you know connect with as many people as possible Right to address these problems you know we're saving lives so it's a good problem to work on that's incredible thank you so much for for the work that you all are doing oh looks like actually another question just came in we do have a few minutes so we'll actually go back to it how do you see the relationship between machine learning models and domain experts
21:30 - 22:00 maturing in the future like what I guess what what's the role of uh human input moving forward Here In Your Eyes yeah I mean always I think uh human and expert knowledge given the noisiness and the kind of issues we are seeing the stage at which we are in is it has to be there has to be a override for compliance reasons and for a lot of other reasons right so uh I mean the we are using machine learning mainly as
22:00 - 22:30 a as a triage to kind of filter and then run it by experts but I think medical experts will have a huge role to play in labeling these data sets because the science is rapidly evolving right so it's going to keep changing it's not static so you can't really fill that Gap with more data right I mean I think you really need experts to be in the loop for everything we do so I mean I don't
22:30 - 23:00 see that I mean we we can we can Elevate what the experts do I mean by providing them the right kind of decision support tools and all that but eventually the experts will have to make the final decision yeah that's you know spot on what I'm observing in my conversations as well with clients in the field you know they're they just want to make it more efficient to partner with these subject matter experts make kids uh not that much friction for them to take their
23:00 - 23:30 domain knowledge and if they're having to keep up with all these advancements and science themselves like in a human level and digest it and yeah I mean it's impossible right I mean I think the if you if you really go talk to uh like oncologists and surgeons and I mean what what role does Pharma play because the science is with Pharma right the practitioners and treatment decision makers are all these highly specialized and they're all graduates from the best of universities it's impossible for them to keep up
23:30 - 24:00 right I mean just an Asco in June just from AstraZeneca alone there were like 100 abstracts that read out okay within just that week and and then on college is going to practice how are they keep going to keep up right I mean I think AI systems have a huge role to play in helping them keep up with this and you know guiding them based on the type of patience they're treating so yeah it's it's incredibly exciting you have the AI and kind of machine learning
24:00 - 24:30 World moving rapidly there's you know research papers flooding in all the time of like latest advancements in AIML and then for your industry that you you're personally in you know it's also complemented with just the domain you know evolving so rapidly so exactly yeah probably never never a boring day I bet yeah I think expert labeling is going to be a huge Air at scale right I mean uh end with efficiency is I mean how do we generate the right
24:30 - 25:00 set of tools to be able to kind of do that is a huge area for us to kind of focus on by partnering with you know folks like yourselves and others in the community and uh and ml Ops is going to be is going to change right I mean so mlapse will be a huge area as well could not agree more ml Ops monitoring really the entire life cycle of how these models you know the training data gets generated the models get developed and continuously update so it's going to
25:00 - 25:30 be exciting well Ravi on behalf of everyone at snorkel thank you for you and the folks at Astros I think are for all the work that you're doing in Pharma and then you know improving lives all around the world um thank you for also engaging with the data Centric AI Community right if you call down explicitly that's important and you know we appreciate you playing an active role here by giving this talk so thank you thank you thank you and look forward to you know jointly innovating with you guys likewise can't wait
25:30 - 26:00 for the audience um you heard Ravi kind of share his email address if you want to stay in touch or send a follow-on all right you know questions to him in the meantime we have a poll that just went live so let us know how you enjoyed the talk by by visiting the poll section in the side menu and let's uh let's talk about what's coming up next so I'm actually gonna share the here we're actually due for a 30 minute break to get refreshed and we'll all come back and you know back on the
26:00 - 26:30 applications track for an exciting new talk so enjoy your break and we'll see you back here in 30 minutes foreign