Navigating the Intersection of AI and Regulation
#4 Regulating AI Like a Gatekeeper: Trained, Scraped and Shackled
Estimated read time: 1:20
Summary
In this fourth episode of The Binary Agora, the host delves into the complexities of regulating AI, particularly in the context of EU regulations like the Digital Markets Act (DMA). The discussion centers around the challenges posed by AI training data, especially concerning privacy, data protection, and the economic implications of scraping data from the internet. The episode highlights the struggle between leveraging AI technologies and adhering to legal frameworks designed to protect personal data, exploring how AI deployers can navigate these regulations. It also reflects on the broader implications of potential regulatory solutions and the need for cooperation between legislative bodies and AI developers.
Highlights
- AI's reliance on training data from web scraping raises significant privacy concerns. 🚨
- Regulations like the EU's DMA impose strict rules on AI deployers, complicating data use. ⚖️
- Major tech companies face challenges in aligning their AI operations with privacy laws. 🏢
- Outsourcing and siloing data approaches could be solutions for AI deployers within the EU. 🛠️
- The episode emphasizes the need for regulatory guidance and cooperation to balance AI innovation with legal requirements. 🌐
Key Takeaways
- The fourth episode of The Binary Agora navigates the complexities of AI regulation, focusing on legality and privacy concerns. 📚
- AI training data poses challenges due to its reliance on web scraping, which raises privacy and data protection issues. 🔍
- European regulations like the DMA complicate how AI developers can use personal data, especially for major tech players. 🌍
- Potential solutions to regulatory challenges include outsourcing AI functionalities and segmenting data approaches. 🤖
- The episode suggests the need for collaborative efforts between AI developers and regulatory bodies for effective governance. 🤝
Overview
The fourth episode of The Binary Agora tackles the intricate dance between AI advancement and regulatory constraints, primarily through the lens of the European Digital Markets Act (DMA). The host discusses how AI technologies, particularly those involving large language models (LLMs), have accelerated, raising alarms about their compatibility with economic evidence and data protection regulations. The challenges of using scraped data for AI training, including privacy and intellectual property conflicts, are underscored.
Delving deeper, the episode elucidates how the DMA affects major tech giants like Alphabet, Apple, and Meta, highlighting the prohibitive measures placed on them regarding the use of personal data across their services. This segment critiques the feasibility of current regulations, exploring the potential for companies to navigate these rules through consent mechanisms and careful data processing strategies. The episode illustrates the balancing act required between maintaining privacy and fostering innovation.
In conclusion, the series advocates for an ongoing dialogue between AI developers and European legislative bodies to forge a path that honors both technological progress and legal accountability. The episode suggests that while outsourcing AI functionalities or segregating data might mitigate some regulatory challenges, a more comprehensive collaboration and guidance from regulatory authorities are crucial for sustainable AI development within the confines of stringent EU regulations.
Chapters
- 00:00 - 01:00: Introduction and Overview The chapter introduces the series on Binary Agura, focusing on legal tech challenges, innovations, and impacts from top researchers. The speaker discusses their latest paper on the interplay between scaling LLMs via training data in generative AI and regulation through the European Digital Markets.
- 01:00 - 02:00: AI and Economic Reality The chapter titled 'AI and Economic Reality' delves into the challenges and misconceptions surrounding the impact of AI on economic structures. It emphasizes the need for understanding the practical application of deep learning architectures, such as transformers, within the market. The current AI surge, ignited by groundbreaking technologies post-2017, raises questions about marrying technological advancements with economic practicality.
- 02:00 - 03:30: Generative AI and Privacy Concerns The chapter discusses the rapid adoption of generative AI by companies and major digital players over recent years. It highlights how this technology has been integrated across various sectors, becoming a central technology of our era. Scholars have increasingly raised concerns about the risks that AI poses to fundamental rights, particularly in the realm of privacy.
- 03:30 - 05:30: Data Scraping and Sources The chapter discusses the challenges and considerations regarding data scraping and the sources of data used in training AI models. It emphasizes the complexities involved in ensuring compliance with data protection and privacy laws. Specifically, it highlights the uncertainty surrounding whether the use of data in AI model training constitutes a breach of these laws, as the details of data usage in development and deployment are not always transparent.
- 05:30 - 08:00: Legal Frameworks and Regulations The chapter discusses the complexities and challenges involved in the creation and regulation of Large Language Models (LLMs). Key issues include the difficulties in understanding and defining the datasets used to train these models, and the regulatory concerns stemming from their creation, particularly in relation to data protection, privacy laws, and cybersecurity. Regulators are actively examining how training data for AI systems aligns with existing legal frameworks.
- 08:00 - 13:00: Prohibition of Data Combining The chapter discusses the prohibition of data combining, especially focusing on the origin and use of training data. It highlights legal challenges and conflicts with intellectual property (IP) and copyright rules, referencing significant cases like NY Times v Open AI in the United States. Additionally, it mentions the AI Act as part of the legal framework and risk regulation that contributes to transparency requirements surrounding AI training data.
- 13:00 - 16:30: Impact on Gatekeepers The chapter 'Impact on Gatekeepers' discusses the requirements placed upon AI deployers regarding the data processes involved in the development and deployment of large language models (LLMs). It highlights two main phases in the development of LLMs: pre-training with large datasets and fine-tuning, as well as the subsequent data input from users through their prompts and reactions to the AI's responses.
- 16:30 - 20:00: Possible Solutions and Conclusions In the chapter titled 'Possible Solutions and Conclusions', the discussion revolves around the data collection practices employed by AI developers. A significant portion of these practices involves consensual data extraction and data sharing agreements. However, much of the data used in widely-used Large Language Models (LLMs) is sourced from web scraping. Web scraping is the automated process of retrieving online content, typically from publicly accessible websites. The chapter likely delves into the implications, challenges, and potential ethical considerations surrounding these data aggregation methods.
#4 Regulating AI Like a Gatekeeper: Trained, Scraped and Shackled Transcription
- 00:00 - 00:30 Welcome to the fourth episode of the binary agura. In this series, we dive deep into the world of legal tech, breaking down its challenges, its innovations, and the real world impact of top researchers. Today I will be addressing my latest working paper navigating the interplay between the scaling of LLMs through training data in the generative AI context with the regulation introduced upon gatekeepers through the European digital markets
- 00:30 - 01:00 act. When thinking about AI, alarms go off uh in our heads when we are talking about their risks. But sometimes we do not really know how well they fare against economic evidence or the practical reality of the functioning of these LLMs. Uncovering and applying deep learning architecture, those are transformers into existing and new technologies ignited this artificial intelligence boom. Since 2017, AI
- 01:00 - 01:30 technologies and also applications progressively became the technology of our time. In particular, generative AI adoption by companies and the major digital players ramped up during the last couple of years and streamlined the use of AI across different sectors. At this point, many scholars have voiced their concerns in terms of the risks that AI poses for fundamental rights at
- 01:30 - 02:00 large, especially relating to the rights of the protection of personal data and to privacy. As a matter of fact, in the particular case of training data and what data is inputed into the training and the fine-tuning stages of AI models, we cannot really assert that there are that their deployment and their development entails a direct breach of data protection and privacy laws and principles because simply we do not know
- 02:00 - 02:30 and we do not have enough knowledge to try uh to determine what data sets go into those LLMs. Nonetheless, training data and the way in which they can be extracted from the internet has posed several challenges from the economic but also from the legal perspective. For instance, data protection regulation regulators are uh considering how AI training data reconciles with privacy laws or with cyber security
- 02:30 - 03:00 requirements. Similarly, the origin of training data has also been put into question, especially when it conflicts with IP and also copyright rules as we now know from renowned uh cases such as NY Times v Open AAI in the US. And finally, the legal support surrounding AI training data is also rounded up by risk regulation such as the AI act, which includes a few transparency
- 03:00 - 03:30 requirements imposed upon AI deployers. From the practical viewpoint, training data originates in two types of processes involved in an LLM's development and deployment. First, large data sets are inputed into the pre-training and the fine-tuning phases of the development. And second, the LLM also feeds off uh the data inputed by users via their prompts and also via their reactions to the answers displayed
- 03:30 - 04:00 by the LLM. And although AI deployers uh collect training data via consensual extraction and also via data sharing, billions of tokens built into popular LLMs derived from web scraping. Scraping refers to the retrieval of content available online through automated tools. Normally scraping takes place over publicly accessible websites
- 04:00 - 04:30 including for instance social media profiles. Technically paywalt websites can also be subject to scraping. And as a matter of fact, most AI deployers now use uh common crawl that is the largest freely available connection of scrape data with more than 9.5 pabytes of data ranging from 2008 as a database line for training their models on a v on a variety of filtered versions from it. Uh
- 04:30 - 05:00 on the other side largecale artificial intelligence open network that is lion also operates as a database of copyrighted images to try to develop multimodel generative AI. On top of that, inference uh attack methods have also been applied to existing LLMs and demonstrated that they trained on paywalled websites, copyrighted content and books also
- 05:00 - 05:30 scraped from the web. Preliminarily, we can therefore establish that if training data is mainly based on scrape data, some personal data may also be contained in those unstructured data sets. used for uh training the AI model. For example, Meta recently disclosed its intention to try to expate uh its uh social networks to train its AI systems, including posts, photos, and also
- 05:30 - 06:00 content uploaded by users. In this sense, the embedding of scrape data into LLM at both the pre-training and the finetuning stages brings risks meriting regulatory scrutiny on two different fronts. First, the impacts caused to the rights to privacy and the protection of personal data by scraping and processing personal data as a means to power LLM. users expectations over the protection
- 06:00 - 06:30 of their personal data, even if they have published bits and pieces of uh it online do not align with an indiscriminate scraping of the web happening without their prior knowledge. And given that users are not given a chance to consent to the scraping of their data, it is also at odds with the functioning of notice and consent mechanisms implemented by privacy regulation. And additionally, the scraping of unprecedented volumes of online data entails direct and material
- 06:30 - 07:00 harm to consumers since LLMs face a risk of memorizing their data when they are inputed for training and fine-tuning and then they can leave their context exactly upon a user prompt. On the right hand of the slide, I have listed some of the immaterial and material harms that uh personal data embedded in AI models may also provoke in terms of the data protection framework. In trying to cope with all of
- 07:00 - 07:30 these uh challenges, the European EDPB decided to classified AI models into two distinct groups. First, those which could be classified as anonymous because they are not designed to provide training data. so that the likelihood of data extraction upon their deployment is insignificant and second those which cannot be classified as anonymous under the data protection parameters. Those lying within the first group do not face
- 07:30 - 08:00 the imposition of the legal requirements under data protection regulations whereas those which cannot be classified as anonymous remain captured by these frameworks. For those cases where AI deployers do not surpass the anonymity threshold, the GDPR and other regulations compel data controllers to process personal data based on a legal basis. In this case, article uh 61 of the GDPR. Other protection uh frameworks
- 08:00 - 08:30 also require such uh justification in other jurisdictions such as in China or in Canada. From the long list of legal bases, data protection authorities have proposed two of them as the most feasible alternative to try to justify the processing of personal data. In this context, on one side, an affirmative and an specific action by the data subject that is consent could justify such activities. Imagine consent in
- 08:30 - 09:00 third-party context such as those scenarios arising from scraping content on social media is not feasible in practice due to the enormous volume of data process and also their diverse origins. For instance, OpenAI did not even contemplate uh to include consent as a valid legal basis to try to process data because it was practically impossible to try to obtain consent from every data subject whose personal data might be a script. The assumption
- 09:00 - 09:30 therefore seems to be barred from being used by AI deployers in this context. On the other hand, most AI deployers demonstrate that they process their data and their personal data on their models based on their legitimate interests. Those legitimate interests designates the broader impacts and benefits the controller or a third party enjoys when engaging in the processing of personal data. In this case, the AI deployers
- 09:30 - 10:00 capacity to try to extract value from publicly and readily available information online. OpenAI's ChBT, MEAS, Llama or Alphabet's Gemini justify the processing of their personal data in this particular fashion. However, the legal benus grows slimmer for some economic operators captured by the DMA. The regulation applies to seven designated economic operators that have been designated by the European Commission, Alphabet, Apple, Amazon,
- 10:00 - 10:30 Biden's, Booking.com, Meta, and Microsoft. Thus only these participants in the market will be subject to the limitations spelled out for AI deployers. Building on the experience of uh cases surrounding meta's processing activities, the DMA introduces the prohibition embedded in article uh 52 of the DMA. The provision compels the economic agents designated by the European Commission not to process,
- 10:30 - 11:00 combine or crossuse personal data from the services into their other services either either first party or third party. In particular, processing of personal data using services of third parties is only barred for those cases where it is performed to provide online advertising services. By imposing therefore this prohibition, the DMA seeks to end with the barriers to entry placed by the data accumulation capacity
- 11:00 - 11:30 of these incumbent digital platforms. The regulation exempts the prohibition in those cases where the end users consented by the processing and also the combination of personal data uh consent to those activities. And on top of that, article 52 also provides an additional caveat to the whole prohibition. The conduct of the prohibition is also without prejudice to the gatekeeper processing personal data relying on the legal basis set out in article 61 of the
- 11:30 - 12:00 GDPR. However, the DMA explicitly details that gatekeepers cannot rely on the legitimate interest legal basis. If we apply uh the prohibition without bearing in mind the wider interplay with the legal basis of consent and legitimate interest, the result would be as shown now on the slide. For instance, Google's innovations in developing and implementing AI overviews to its Google
- 12:00 - 12:30 search functionality powered by its foundational model Gemini would be barred from combining and cross-using personal data with the rest of its services namely its ad services YouTube or pro and in a similar vein Biden's proprietary foundation model seat thinking to the extent that it may be incorporated into its CPS tik tok would have no access to its data generated in firstparty contexts such as Tik Tok ads, news republic or hello
- 12:30 - 13:00 services even if some of them are only available in other jurisdictions different from the EU. And within the slide you have also some more examples of the uh concerned gatekeepers and also the potential implications and impacts of the application of article 52 of the DMA. And as if uh such a consequence was not dire enough, the DMA complicates the issue even more. Recital uh 36 establishes that it excludes the
- 13:00 - 13:30 legitimate interest legal basis for the processing of personal data. The first question therefore that comes to mind is whether the exclusion of the legitimate interest legal basis comprises all types of processing of personal data for gatekeepers as a whole and also how the legal basis interlace with the gatekeepers collection of consent for exempting the prohibition. In my own mind there are two alternative readings that one can draw out from article 52 of the GMA.
- 13:30 - 14:00 First one points to the fact that consent is the one and only legal basis for exempting the gatekeeper's prohibition of combining and cross-using personal data across services. In the absence therefore of consent, gatekeepers will only be capable of processing personal data based on the rest of the legal basis listed under the DMA. Therefore, not including legitimate interest, but only outside from the prohibition scope of application. And in
- 14:00 - 14:30 turn, the second scenario presents an alterative view. Consent should be interpreted as the preferable legal basis for the gatekeeper to try to seek the prohibition's exemption. But if consent is not feasible, then other legal bases are readily available to exempt the prohibition again not including legitimate interest. So on one hand the combinations and the cross use of personal data across CBS's can only take place upon the condition of
- 14:30 - 15:00 consent. So such processing is not feasible in the context of AI training data. On the other hand, those same processing activities may take place either via consent or via uh further justification under the GDPR. And still in the context of AI training data, this second scenario results in the fact that the gatekeepers cannot really exempt the prohibition by justifying the value that they generate in the market by catering
- 15:00 - 15:30 these AI services. And if we take a step further, recital 36 seems even more complex to try to apply in the AI context. Aside from the fact that consent visav the rest of the available legal basis may operate as exclusive or subsidiary conditions to exemping the prohibition. There is a broader impact with uh the provision and its interplay with LLM training and fine-tuning. As you'll remember, recital 36 highlights that the
- 15:30 - 16:00 prohibition applies without prejudice to the gatekeeper's capacity to process personal data based on the legal basis under article 61 of the GDPR except the gatekeeper's legitimate interest. The interpretation therefore of the clause processing personal data plays a significant role in trying to narrow down further the possibilities of gatekeepers of joining or carrying on within this AI race. The first uh
- 16:00 - 16:30 possibility is that processing of personal data is understood to be equivalent to the concept under article 42 of the GDPR. By this token, gatekeepers would be barred from processing personal data based on the legitimate interest legal basis for all of their processing activities, not only those concerning the prohibition's scope of application. LLM development and deployment would be therefore completely out of the question for any designated
- 16:30 - 17:00 gatekeeper. The second interpretation to be drawn out is that of understanding that this processing of personal data refers back to those scenarios where the prohibition does not apply. That is all other processing activities performed by the gatekeepers within their CPS's and their services. The narrowing down of the GDPR legal basis entail therefore that the gatekeepers cannot process personal data within the CPS context
- 17:00 - 17:30 with all of the available legal bases under the data protection framework. And finally, the most inconsequential bearing in mind the stakes of uh the game possibility would be to interpret processing of personal data in the sense of article 52A of the DMA. The limitation of these uh legal basis would only touch upon the processing of personal data for the purposes of delivering advertising and as such not all of their processing activities would
- 17:30 - 18:00 remain restrictive in principle by the DMA. Before all of these challenges, there are some potential also solutions to try to address these challenging interplay between the training of AI foundation models and the DMA. In practice, AI deployers categorized as regulated targets under the GMA would have no real possibilities of developing their own foundation models and applying them to downstream applications in the
- 18:00 - 18:30 EU as they do in other jurisdictions. In turn, the option open to them would be to outsource their generative AI reliant functionalities. If they are not in charge of the AI model and therefore of its decision making and therefore cannot be categorized as data controllers in the sense of the GDPR, then article 52 of the DMA would have no bearing over them. Apple's choice uh to integrate OpenAI's CH GPT and Alphabets Gemini into its Apple intelligence feature on
- 18:30 - 19:00 its operating systems iOS and iPad OS is also a good example of such a detachment from a reliability. By far, this would be the most harrowing consequence of the DMA's application on generative AI models developed abroad. And alternatively, AI deployers could also simply exclude every single EA user from any kind of processing surrounding the model be that in training, in fine-tuning or in deployment, which would also complicate development but
- 19:00 - 19:30 would also de facto remove the effects of the prohibition. As a matter of fact, Microsoft's compliant report relating to the integration of NLMs into its linking functionality remarked that it did just that to try to dodge the tension between these two fields of action. And in the same sense, gatekeepers could also start by silo siloing their approach to developing and to deploying LLMs by taking into account CBS categories and
- 19:30 - 20:00 artificially segmenting the training data and the unstructured data sets necessary for for pre-training and also for fine-tuning. All of these potential solutions however entail that an inverse Brussels effect takes place to the extent that EA uh based outputs in AI models would run scars and the gatekeeper incentives to integrate AI reliant uh functionalities would diminish substantially. In any case, the
- 20:00 - 20:30 tension I explored throughout the video demonstrates the need for guidance and cooperation coming from the European Commission towards GPA to try to square this circle of gatekeeper development and deployment of AI technologies and functionalities. And that's a wrap uh for this episode of the binary. If you'd like to learn more about the topic, check out my working paper which is also included in the description below. And
- 20:30 - 21:00 if you have any thoughts or uh feedback, drop a comment below.