Get the latest AI workflows to boost your productivity and business performance, delivered weekly by expert consultants. Enjoy step-by-step guides, weekly Q&A sessions, and full access to our AI workflow archive.
Summary
In this episode of the Whitepaper Companion Podcast by Kaggle, the hosts dive deep into the fascinating world of embeddings and vector stores. They explain that embeddings serve as a cheat sheet for AI, unlocking new efficiencies and insights by enabling machines to understand the meaning of text, images, and various data forms. Focusing on a recent white paper by Naria Ren and Sugnet, the discussion explores how embeddings map data into simpler numerical representations, allowing machines to process and compare vast data efficiently. Highlighted applications include retrieval, recommendations, and the groundbreaking concept of joint embeddings, which integrate multimodal data. The detailed process of creating, using, and evaluating embeddings is discussed, offering insights into their profound impact on AI and machine learning projects.
Highlights
Embeddings serve as AI's cheat sheet for understanding various data forms. 🤓
Simulating geographic coordinates, embeddings map data without losing meaning. 🌍
They're essential in processing data like text, audio, and images seamlessly. 🖼️
Joint embeddings mix text and image data, broadening analytical possibilities.
The podcast delves into Google's advancements in embedding models.
Metrics such as precision and recall help evaluate the usefulness of embeddings.
Key Takeaways
Embeddings are game-changers in data analysis, acting like cheat sheets for AI. 🤓
They transform complex data into simple numeric forms, making processing efficient. 🚀
The podcast emphasizes exploring techniques from a pivotal white paper by experts Naria Ren and Sugnet.
Listeners learn the exciting applications of embeddings in retrieval, recommendations, and more.
Joint embeddings break data barriers by integrating text and images seamlessly.
The show highlights the significance of using advanced vector databases for efficient data handling.
Key metrics like precision and recall are crucial in evaluating embeddings' effectiveness.
Overview
The world of embeddings is transforming AI, offering a revolutionary way for machines to comprehend text, images, and more complex data. This episode breaks down these concepts with clarity, drawing on a recent whitepaper to showcase how embeddings have become the backbone of modern data processing and analysis.
Embeddings convert raw data like texts and images into vectors, allowing vast amounts of information to be processed faster and more efficiently. Discussions include how these vectors maintain semantic meaning, enabling automated systems like search engines to perform better and recommendation systems to be more accurate and nuanced.
Joint embeddings, a concept that integrates different data types into a single cohesive vector space, opens new possibilities for AI development. By comparing these embeddings, systems can improve accuracy and relevance, paving the way for innovative AI applications such as advanced search engines and intelligent recommendation algorithms.
Chapters
00:00 - 00:30: Introduction to Embeddings and Vector Stores This chapter introduces the concept of embeddings and vector stores, highlighting their importance in AI and machine learning. It explains how embeddings function as a 'cheat sheet' for machines to interpret text, images, and various forms of data, thus enhancing their efficiency and understanding. With the advent of large language models, embeddings are becoming a critical component for unlocking deeper insights and performance capabilities, effectively serving as the 'secret sauce' in the development of advanced AI technologies.
00:30 - 01:00: Understanding Embeddings The chapter "Understanding Embeddings" aims to demystify the concept of embeddings, exploring various techniques and applications. It uses the newly released white paper "Embeddings and Vector Stores" by Naria Ren and Sugnet as a foundational resource. The chapter intends to explain how embeddings help in managing large datasets and how they can be utilized to create innovative projects.
01:00 - 02:00: Basics of Embeddings The chapter introduces the concept of embeddings, explaining that they are low-dimensional numerical representations designed to capture the underlying meaning and relationships of different data types. The transcript mentions a white paper that breaks down these concepts comprehensively and acknowledges the contributions of many individuals to the work.
02:00 - 03:00: Applications of Embeddings The chapter discusses the concept of embeddings, describing them as compact vectors that retain essential semantic information, making it easier to process and compare large datasets. The analogy of latitude and longitude is used to explain how embeddings map complex data into a simpler space without losing meaning. The chapter also explores the significance of embeddings, particularly for a technically inclined audience.
03:00 - 04:00: Joint Embeddings for Multimodal Data The chapter discusses the power and efficiency of using joint embeddings for multimodal data, which includes text, audio, images, video, and structured data as found in databases. It emphasizes the ability of embeddings to not only save storage space but also to unveil patterns and relationships that might otherwise be elusive. The concept of semantic relationships is introduced, describing how embeddings can capture the similarity or difference between two data pieces, as explained in a white paper.
04:00 - 05:00: Measuring Embedding Effectiveness The chapter discusses the concept of embedding spaces, where words with similar meanings are positioned closer together, such as 'king' and 'queen', in contrast to unrelated words like 'king' and 'bicycle'. This highlights the transformation from raw data to meaningful numerical representations. Furthermore, these embeddings are lower dimensional, allowing for more efficient large-scale processing and storage. The chapter also references a white paper that discusses key applications of these embeddings.
05:00 - 06:00: Training and Evaluating Embedding Models This chapter discusses the use of embeddings in the context of retrieval and recommendation systems, similar to how Google search operates. The process involves pre-computing embeddings for a vast number of web pages, and when a search query is received, it's converted into an embedding to find relevant web pages.
06:00 - 07:00: Different Types of Embeddings This chapter discusses the concept of embeddings and their application in search and recommendation systems. It explains that in vector space, embeddings that are close to each other indicate semantic similarity. This principle is used to retrieve search results and power recommendation systems by identifying items or content similar to what a user has interacted with. The chapter concludes with a mention of joint concepts in a white paper, although details are not provided in the transcript.
07:00 - 08:00: Text Embeddings This chapter discusses the concept of joint embeddings, emphasizing its capability to handle multimodal data, which involves integrating different types of data like text and images. The key advantage of joint embeddings is in applications like searching for videos using text descriptions, where both text and images are mapped into the same embedding space, facilitating comparison and matching across different modalities. This approach effectively breaks down the barriers between various data types.
08:00 - 09:00: Document Embeddings The chapter 'Document Embeddings' discusses the effectiveness of embeddings in various tasks. The main focus is on how to evaluate their performance, which depends on the specific task at hand. In general, the goal is to assess how well embeddings help in retrieving relevant items and filtering out irrelevant ones, akin to separating the signal from the noise. For search tasks, classic metrics such as precision and recall are often used to measure their effectiveness.
09:00 - 10:00: Improvements in Embedding Models The chapter titled 'Improvements in Embedding Models' discusses key metrics used to evaluate the performance of embedding models. Precision and recall, two fundamental metrics, are defined and explained in terms of their role in reflecting the success of the model in hitting the target (precision) and covering all relevant items (recall). The white paper further delves into variations like 'Precision at K' and 'Recall at K', which assess the accuracy of the top K predictions. These metrics are essential for understanding the effectiveness of a model, particularly in ensuring that the top predictions are accurate and inclusive of all pertinent options.
10:00 - 11:00: Creating Image and Multimodal Embeddings The chapter discusses the importance of ranking in search results, using Google as an example. It introduces the concept of normalized discounted cumulative gain (ndcg), which is a metric used to evaluate the effectiveness of a ranking algorithm. The main idea of ndcg is that higher scores are awarded when the most relevant items appear at the top of the results list, rather than on subsequent pages. The chapter emphasizes the effectiveness of having relevant information appear early in search results.
11:00 - 12:00: Graph and Structured Data Embeddings The chapter discusses the importance of benchmarks in evaluating and comparing different embedding models for graphs and structured data. By using standardized collections of datasets and tasks, such as those provided by benchmarks like BE and MTB, a fair and consistent evaluation can be achieved. Additionally, these benchmarks prevent the problem of 'comparing apples and oranges' by providing a level playing field for comparison. The chapter further suggests utilizing established libraries like TR rugs, Tral, or Pip Receival for conducting the evaluations.
12:00 - 13:00: Training Process of Embedding Models The chapter discusses the importance of a standardized measurement process for embedding models to ensure reproducibility and minimize errors.
13:00 - 14:00: Vector Search Techniques The chapter discusses the importance of balancing factors such as speed and accuracy when generating embeddings, especially for real-time applications. It introduces the concept of Retrieval Augmented Generation (RAG), a technique that uses embeddings to enhance language models, emphasized as a significant trend for AR applications and among Kagglers.
14:00 - 15:00: Efficient Vector Search Algorithms The chapter 'Efficient Vector Search Algorithms' details a two-stage process for enhancing language models through the use of a knowledge base. It begins with creating an index by breaking documents into chunks and generating embeddings using a document encoder. These embeddings are stored in a vector database, forming a searchable library to improve the accuracy and relevance of responses from language models.
15:00 - 16:00: Utilizing Vector Databases This chapter discusses the use of vector databases in handling query requests. It explains how numerical embeddings are created and utilized to process user queries. Upon receiving a question from a user, the system converts it into an embedding using a query encoder. A similarity search is then conducted within the vector database to find and retrieve chunks of data whose embeddings closely match the query embedding. This process helps in efficiently finding relevant information, akin to locating a needle in a haystack.
16:00 - 17:00: Operational Considerations for Embeddings The chapter discusses the significance of embedding models in improving the speed and relevance of query responses. It emphasizes the importance of efficient vector databases in searching for meaning rather than keywords. A significant progress in embedding models is highlighted, specifically the improvement in Google's embeddings, which saw their average B-score increase drastically from 10.6 to 55.7. The key takeaway is to design systems that leverage these advancements effectively.
17:00 - 18:00: Applications of Embeddings and Vector Stores The chapter explores the applications of embeddings and vector stores, highlighting the benefits of upgrading to newer embedding models. It emphasizes the importance of having evaluation pipelines to ensure improvements in results. The chapter references a white paper that includes a code snippet demonstrating practical implementation using the NF Corpus data set, showing how to embed questions and documents with Google Vertex AI APIs and how to utilize the Face library for efficiency.
18:00 - 19:00: Conclusion and Future Directions The chapter discusses the evaluation of retrieval quality in similarity searches using various metrics, emphasizing the importance of hands-on demonstrations for understanding. It highlights the challenge of needing labeled data to train embedding models, which can be costly and time-consuming. The chapter also mentions emerging approaches to overcome this challenge, such as methods used in the development of Google's get-go embedding model.
00:00 - 00:30 all right welcome to the Deep dive today we're going deep on embeddings and Vector stores oo embeddings yeah so for our kaggle audience out there think of it like this embeddings are like giving machines a cheat sheet for understanding the meaning of text images all kinds of data it's like unlocking a whole new level of efficiency and insight yeah it's like speaking the language of AI right and you know with the rise of these large language models that are getting more and more powerful embeddings are kind of becoming the secret sauce yeah exactly they're
00:30 - 01:00 helping us make sense of this massive amount of data that we're dealing with now it's pretty exciting so our mission today is to really unpack how these embeddings work explore different techniques and then see how you can use them to build really cool Innovative projects so for this deep dive we're going to be using this excellent white paper it's called Uh embeddings and Vector stores and it was released let me see just last month in February 2025 and it's by naria Ren and sugnet a great team yeah and a bunch of other
01:00 - 01:30 contributors I'm not going to try to list them all here but huge shout out to everyone who worked on this yeah there's a big list there yeah really good stuff okay so let's start with the basics the white paper does a fantastic job of breaking this down essentially embeddings are like these low dimensional numerical representations of well almost anything and they're designed to capture the underlying meaning and relationships between different pieces of data so instead of dealing with uh like raw Tech or pixels
01:30 - 02:00 you're working with these compact vectors that preserve the essential semantic information right and that makes it way easier to process and compare huge amounts of data exactly the white paper actually uses this really cool analogy of latitude and longitude just like you can pinpoint any location on Earth with those two numbers embeddings kind of map complex data into a simpler space without losing its meaning yeah makes sense so all right let's get into why this matters why are embeddings so important especially for our technically minded audience well for
02:00 - 02:30 starters they're incredibly efficient you can represent all sorts of data text audio images video even structured data like you'd find in a database and it's not just about you know saving storage space it's about making it possible to find patterns and relationships that would be really hard to spot otherwise yeah and that's where the idea of semantic relationships comes in the white paper explains that embeddings can actually capture how similar or different two people pieces of data are
02:30 - 03:00 so for example in an embedding space the word king would be closer to Queen than it would be to say bicycle right it's all about capturing that underlying meaning which is pretty amazing when you think about it right so you're going from this messy world of raw data to this elegant numerical representation that actually encodes meaning it's super cool yeah and because those embeddings are lower dimensional than the original data they let you do large scale processing and storage much more efficiently okay so the white paper highlights a couple of key applications
03:00 - 03:30 where embeddings really shine retrieval and recommendations what's the basic idea there well think about how Google search works it's a massive retrieval system right you type in a query and it finds web pages that are relevant to what you're searching for with embeddings the process goes something like this first you pre-compute embeddings for billions of web pages then when someone enters a search query you convert that query into an embedding to and finally you find the web pages
03:30 - 04:00 whose embeddings are the nearest neighbors to the query embedding in that Vector space and that's basically how you get your search results so the closer the embeddings are in that space the more semantically similar they are yeah exactly and recommendation systems work on a similar principle you find items or content that have similar embeddings to what a user has liked or interacted with in the past so it's all about finding those patterns and making predictions based on similarity in this embedding space now the white paper also mentions this concept of joint
04:00 - 04:30 embeddings which sounds even more powerful can you explain that yeah joint embeddings are all about handling multimodal data where you have different types of data like text and images combined so imagine you want to search for videos based on a text description joint embeddings allow you to map both the text and the images into the same embedding space making it possible to compare them and find matches across those different modalities so it's like it's breaking down the barriers between different types of of data allowing for
04:30 - 05:00 more nuanced understanding yeah exactly that's a really exciting area so okay we've talked about how cool embeddings are but how do we know if they're actually doing a good job how do we measure their effectiveness well it depends on the task but in general we want to see how well they help us retrieve relevant items and filter out irrelevant ones it's kind of like separating the signal from the noise and for search tasks we often rely on these classic metrics like precision and recall right right Precision is all about how many of the retrieved items
05:00 - 05:30 are actually relevant it's like are you hitting the Target and recall tells you what proportion of all the relevant items you actually manag to find you know are you casting a wide enough net right are you getting all the good stuff yeah and the white paper also mentions Precision at K and recall at K these are variations that Focus specifically on the top K results so for example Precision at five tells you how many of your top five predictions were correct which is often what we care about in
05:30 - 06:00 practice right because nobody really goes past the first page of Google results right exactly now for cases where the order of the results matters you might use something like normalized discounted cumulative gain or ndcg okay that sounds complicated it's not that bad really the main idea is that it gives higher scores when the most relevant items are at the very top of the results list so finding a highly relevant document on page two isn't as good as finding it on page one yeah that makes sense so a higher ndcg means you're doing a better job of ranking
06:00 - 06:30 things in a way that's actually useful to the user now the white paper also mentions these benchmarks like be and MTB what are those all about well those are like standardized collections of data sets and tasks that allow you to evaluate and compare different embedding models in a fair and consistent way it's like having a Level Playing Field so it helps you avoid you know comparing apples and oranges exactly and the white paper suggests using these established libraries like TR rugs tral or pip receival to do the actual evaluation
06:30 - 07:00 this makes the process more reproducible and less prone to errors right it's all about having a standardized way to measure things but it's not just about those you know standard metrics is it there are other practical considerations when you're choosing an embedding model oh yeah definitely the white paper lists a few key factors like model size embedding dimensionality latency and cost so for example a really accurate model might be great in theory but if it's so large that it's impossible to deploy real world setting it's not going
07:00 - 07:30 to be very useful right and if it takes forever to generate those embeddings well that's not going to work for real-time applications exactly you have to balance all of these factors to find the sweet spot for your particular application okay let's get a little more concrete the white peer dies into this technique called retrieval augmented generation or rag which is all about using embeddings to make language models even smarter for AR kagglers this is a really hot topic right now can you break down how it works sure the basic idea is is that you use embeddings to find
07:30 - 08:00 relevant information from a knowledge base and then use that information to kind of boost the prompts that you give to your language model this helps the language model generate more accurate and relevant responses so the white paper outlines this two-stage process first you create an index this involves breaking down your documents into chunks generating embeddings for each chunk using a document encoder and then storing those embeddings in a vector database so it's like you're creating the searchable library of knowledge
08:00 - 08:30 represented by these numerical embeddings exactly then when a user asks a question you go into the query processing stage you take the user's question turn it into an embedding using a query encoder and then perform a similarity search in the vector database to find the chunks whose embeddings are closest to the query embedding those are the chunks that are most likely to contain the relevant information okay so it's like it's finding the needle and the haze daack of your knowledge base right and because language models need
08:30 - 09:00 this information quickly to generate their responses the speed of that query phase is super important that's where these efficient Vector databases really shine so you're not just searching for keywords you're searching for meaning in that embedding space right and the white paper points out how much progress has been made in embedding models particularly Google's embeddings their average be score has jumped from 10.6 to 55.7 that's huge wow that's a massive Improvement so the takeaway here is to build your system systems in a way that
09:00 - 09:30 allows you to easily upgrade to newer better embedding models as they become available yeah and to have good evaluation pipelines in place so you can make sure that those upgrades actually lead to better results the white paper even includes a code snippet snippet one that walks through a practical example using the NF Corpus data set oh cool so it shows how to you know actually implement this stuff exactly it shows how to embed questions and documents using Google vertex AI apis how to use the face library for efficient
09:30 - 10:00 similarity search how to evaluate the retrieval quality using pital with all those metrics we talked about it's a really good Hands-On demonstration that's super helpful now one challenge that the white paper mentions is the need for labeled data to train these embedding models that can be a real bottleneck right oh absolutely getting highquality labeled data can be expensive and timec consuming but there are some interesting approaches emerging to address this for example the white paper talks about how in the development of Google's get-go embedding model they
10:00 - 10:30 Ed large language models to generate synthetic question and document pairs for training W so it's like AI helping to train better AI yeah pretty much and this can really help scale up training data and potentially lead to even better performance okay let's shift gears a bit and talk about the different types of embeddings out there the white paper explains that the main goal is always to create these low dimensional representations that capture the essential information and it categorizes embeddings based on the type of data you're dealing with so let's start with
10:30 - 11:00 text embeddings which are kind of the foundation for a lot of NLP applications right text embeddings allow you to represent words sentences paragraphs even entire documents as these dense numerical vectors which is what makes them so powerful for all sorts of NLP tasks and the white paper divides them into token or word embeddings and document embeddings but before we dive into those it kind of walks through this life cycle of how text gets converted into embeddings starting with tokenization right tokenization is all
11:00 - 11:30 about breaking down a text string into these smaller meaningful units called tokens this could be splitting by words or subword units like word pieces or even characters then each unique token in your entire data set gets assigned a unique numerical ID so you're basically replacing all these words and punctuation marks with numbers yeah and the white paper also briefly mentions one hot encoding as a way to represent these token IDs as binary vector there's a tensorflow example in snippet
11:30 - 12:00 2 that illustrates this but as the white paper points out those raw integer IDs and one hot and coded vectors don't really capture the semantic relationships between words that's where embeddings come in exactly so let's start with word embeddings okay so what are word embeddings well they're basically these dense fix length vectors that aim to capture the meaning of individual words you know how those words relate to each other semantically and the white paper talks about these early breakthroughs in this area like gloves swivel and the super famous word
12:00 - 12:30 to ve ah word to ve I remember when that came out as a big deal yeah it really was and the core principle behind word Tove which is highlighted in the white paper is that you shall know word by the company it keeps meaning the meaning of a word is largely determined by the words that tend to appear around it so it's all about context exactly and there are two main architectures for word Toc CB and Skip gram CBO w or continuous bag
12:30 - 13:00 of words tries to predict a Target word based on the context words that surround it skip gr kind of does the opposite it uses a Target word to predict the surrounding words so which one is better it depends CB is usually faster to train and works well for common words but skip gram tends to be better for infrequent words and smaller data sets there's a good visual representation of this in figure five okay so you have to choose the right tool for the job exactly and the white paper also mentions fast text which is an exension to word Tove that
13:00 - 13:30 takes into account the internal structure of words at the subword level interesting so it's like it's getting even more granular with the meaning right now one limitation of word Toc is that it mainly captures these local relationships between words within a small window of context glove on the other hand tries to capture more Global Information about how words co-occur in the entire Corpus oh okay so how does glove do that well it starts by building this co-occurrence Matrix which basically counts how often each word appears in the context of every other
13:30 - 14:00 word in the entire data set then it uses a matrix factorization technique to learn word vectors that encode these global co-occurrence patterns so like it's looking at the big picture right and swivel is another method that also uses a co-occurrence matrix it's really efficient for large data sets because it's designed for distributed processing the white paper actually includes a code snippet snippet 3 that shows how to load and visualize pre-trained word Toc and glove embeddings in 2D space using
14:00 - 14:30 jensome and TSN and how to find similar words based on their proximity in this space you can see this visually in figure six oh that's cool so you can actually see those semantic relationships yeah it's pretty neat Okay so we've covered word embeddings let's move on to document embeddings okay so document beddings are all about representing the meaning of larger chunks of text right like paragraphs or whole documents exactly and the white paper traces their development from these early bag of words models to the much more sophisticated large language
14:30 - 15:00 models we have now so let's start with the bag of words models the white paper mentions LSA and LDA can you give us the like the eli5 version of those okay so LSA or latent semantic analysis uses a matrix of word counts and documents and applies dimensionality reduction techniques to kind of uncover the hidden semantic relationships between words and documents like it's looking for the underlying themes and LDA which stands for latent durlet allocation takes a more probabilistic approach it models
15:00 - 15:30 each document as a mixture of topics and each word as having a certain probability of belonging to each topic so it's like each document is a recipe and the words are the ingredients yeah that's a good way to think about it and then there are tfidf based models which are statistical methods that weigh words based on how frequently they appear in a document compared to how often they appear in the entire Corpus bm25 is highlighted as a really strong Baseline okay so all these bag of words models they all kind of ignore the order of
15:30 - 16:00 words right they just treat each document as like a bag of words right which is obviously a limitation that's where doc tvet came in it's inspired by word tuac but extends it by adding this paragraph Vector to the model there's a nice diagram of this in figure s this paragraph Vector learns to represent the entire document and snippet 4 shows how to train dovec models using jensome so it's like word to VC but for whole document yeah pretty much now the real game changer in recent years has been these deeper pre-trained large language
16:00 - 16:30 model right like Bert and all the models that have come after it exactly the white paper emphasizes the use of deep neural networks massive amounts of training data and this idea of fine-tuning Bert which was introduced back in 2018 really set a new standard it uses this Transformer architecture and was trained on these huge data sets using techniques like Mass language modeling and next sentence prediction figure eight illustrates this really well yeah and Bert produces these contextualized embeddings which means
16:30 - 17:00 the embedding for a word can actually change depending on the words around it right it's not just a static representation of the word itself it takes into account the context that's super cool yeah and the Cs token is often used as a representation of the entire input sequence okay and Bert has led to a whole family of models like sentence Bert simc and E5 which are designed to produce really good sentence embeddings and then there are these even larger language models like T5 PM Gemini GPT and llama so many acronyms I know
17:00 - 17:30 right but the point is that these models are getting more and more powerful and they're leading to even better embedding models like GTR and sentence T5 and the white paper mentions this exciting new development of embeding models powered by Google's Gemini architecture on vertex AI which are achieving really impressive results on benchmarks wow so embeddings are like constantly evolving yeah it's a fast moving field and then there's this idea of matri embeddings which let you choose the dimensionality
17:30 - 18:00 of the embeddings based on what you need for your specific task and for dealing with documents that contain both text and images there are these multi Vector embeddings like col bear XTR and kpali okay so we're way beyond bag Awards now yeah we've come a long way and the white paper does a good job of contrasting these deep neural network models with the earlier approaches highlighting the need for a lot more data and computational resources but also the much greater ability to understand and context and meaning snippet 4 shows how
18:00 - 18:30 to use these pre-trained document embedding models from tensorflow Hub and vertex Ai and snippet 5 even demonstrates how to use vertex generative AI text embeddings with Lang chain and big query so the tools are getting more sophisticated but they're also becoming more accessible exactly and all this applies to images too right oh right image embeddings so how do those work well you can get image embeddings by training convolutional neural networks or cnns or Vision trans formers on large image data sets the
18:30 - 19:00 activations from one of the later layers in these models often serve as good image embeddings they capture the Learned features of the images so it's like the model is learning to see the important parts of the image right and then there are multimodal embeddings which combine these image embeddings with other types of embeddings like text embeddings to create these joint representations that capture the relationships between different modalities snippet 6 actually shows how to compute image and multimodal embeddings using the vertex AI API and
19:00 - 19:30 the white paper mentions Cali which can even do retrieval on documents that contain both text and images based on a text query without needing to do OCR that's pretty amazing wow that's really cool okay so what about structured data can we create embeddings for things like tables in a database yeah definitely though the white paper points out that structured data embeddings tend to be more application specific because the meaning of the data is so dependent on the schema and the context right it's
19:30 - 20:00 not as straightforward as with text or images yeah for General structure data like a table with rows and columns you might use dimensionality reduction techniques like PCA to create embeddings for each row these can be used for things like anomaly detection or as input features for other machine learning models and for user and item data which is crucial for recommendation systems the goal is to map both users and items into the same embedding space so you can find users and items that are similar to each other indicating
20:00 - 20:30 potential matches or recommendations so it's all about finding those hidden connections right and you can even combine these structured embeddings with unstructured embeddings like product descriptions or user reviews to create even richer representations now the last type of embedding that the white people mentions is graph embeddings this is where things get really interesting because now you're not just representing individual data points you're representing relationships exactly graph embeddings are all about representing objects and their
20:30 - 21:00 relationships within a network like think of a social network where people are nodes and connections are edges a graph embedding for a person would capture not just their attributes but also their position in the network and who they're connected to so it's like it's encoding the social fabric right and you can use this for all sorts of cool things like predicting who might know each other categorizing users and building graph-based recommendation systems there are a bunch of algorithms out there like deepwalk no Toc line and
21:00 - 21:30 graph Sage so many algorithms so little time okay we've talked about all these different types of embeddings let's get into how they're actually trained what's the typical process well many modern embedding models use this dual and pooder architecture there's an encoder for the query and an encoder for the documents or images or whatever you're working with and they're trained using contrastive loss which basically tries to pull the embeddings of similar data points closer together while pushing deimar ones farther apart so it's like
21:30 - 22:00 attracting the good stuff and repelling the bad stuff yeah that's a good way to think about it and the training usually involves two main stages pre-training and fine-tuning pre-training is all about training the model on a massive data set to learn the general representations and it's becoming really common to initialize these models with weights from these huge Foundation models like Burt T5 GPT Gemini and Coca this lets the embedding model benefit from all the that's already been learned
22:00 - 22:30 during that pre-training phase so it's like it's getting a head start right and then you fine-tune the model on a smaller task specific data set to really dial it in for your specific application and there are a bunch of ways to create these fine-tuning data sets you can do manual labeling or you can generate synthetic data or you can use techniques like model distillation or hard negative mining a lot of options there yeah and once you have these trained embeddings you can use them for all sorts of Downstream tasks like if you want to build a class ifier you can just add a classification layer on top of the
22:30 - 23:00 embedding layer and train it and you have a few options there right you can freeze the embedding model or you can fine-tune it along with the new layers right it depends on how much data you have and how specific your task is vertex AI actually provides some really cool tools for customizing and fine-tuning text embedding models and snippet 7 shows how to build a classifier using a trainable layer from tensorflow Hub Okay so we've got our embeddings now how do we actually search through them efficiently at scale that's where Vector search comes in yeah Vector
23:00 - 23:30 search is all about finding items based on their meaning not just on matching keywords so you compute embeddings for all your data store those embeddings in a vector database then embed your query into the same space and find the data items whose embeddings are closest to the query embedding okay and because we're often dealing with huge data sets we need to use approximate nearest neighbor or a Ann search techniques to speed things up right we can't just compare our query embedding to every single embedding in the database that would take forever exactly a&n methods
23:30 - 24:00 let us find really good matches really quickly even with millions or billions of data points so what are some of these a&n techniques well the white paper talks about locality sensitive hashing or lsh and tree based methods lsh uses these hash functions to map similar items to the same bucket so you only have to search within those buckets the white paper uses this analogy of postal codes which is pretty helpful tree based methods like KD trees and ball trees work by recursively partitioning the
24:00 - 24:30 data space snippet 8 actually demonstrates using Brute Force search ball tree and lsh with psychic learn and zing and there are also techniques that combine hashing and tree based approaches like f i s with hnsw and scan okay let's talk about hnsw what's that all about so hnsw which is part of the Fay ice Library builds this hierarchical proximity graph it starts with these long range connections to quickly get you in the right neighborhood and then it uses shorter range connections to refine the search snippet 10 shows how to use face with hnsw and snippet 9 even
24:30 - 25:00 shows how to build Ann index with vertex AI Vector search which also uses hnsw and how to integrate it into a rag Pipeline with Lang chain so hnsw is like the go-to algorithm for a lot of applications yeah it's a really solid performer now Google has this other technique called scan which stands for scalable approximate nearest neighbor it's used in a lot of Google's products and is available through vertex AI Vector search and Google Cloud databases for really large data sets it can even
25:00 - 25:30 do this partitioning step to further speed up the search so it's like it's breaking down the problem into smaller more manageable chunks right and then it uses a combination of scoring techniques to find the best matches snippet 11 shows how to use scan with tensi flow recommenders Okay so we've got all these cool algorithms for searching through embeddings but we need specialized systems to manage all of this at scale that's where Vector databases come in right tradition databases weren't designed for this kind of
25:30 - 26:00 high-dimensional data and similarity based queries Vector databases are built specifically for this purpose so it's like their poop is built for the world of embeddings exactly and the white paper mentions that even traditional databases are starting to add Vector search capabilities which is called hybrid search figure 16 shows the typical workflow embed your data index the vectors embed your query and then search for similar items makes sense and there are bunch of options out there vertex Vector search allb and Cloud SQL
26:00 - 26:30 for postgres schol with extensions for Vector search Pine Cone wv8 chrom ADB they all have different strengths and weaknesses so you have to choose the right one for your needs so it's a growing ecosystem yeah it's really exciting and the white paper also talks about some of the operational considerations when you're working with embeddings and Vector stores at scale right because it's not just about the algorithms it's about the whole infrastructure exactly there are challenges with scalability availability consistency updates backups security and
26:30 - 27:00 the white paper emphasizes that embeddings themselves can change over time as the underlying models are updated so you need to think about how to manage those updates and reindex your vector store it's a lot to keep in mind yeah and the white paper also points out that embeddings might not be the best way to handle every type of information sometimes you might need to combine them with traditional full text search techniques so it's about choosing the right tool for the job exactly Okay so so we've covered the basics of embeddings the different types the
27:00 - 27:30 training process the search algorithms the vector databases now let's talk about the applications what can we actually do with all of this right because that's what really matters in the end absolutely the white paper summarizes a whole bunch of applications information retrieval recommendation systems semantic text similarity classification clustering reranking and when you combine embeddings with Vector stores and a&n search you can build even more powerful applications like rag
27:30 - 28:00 large scale search engines personalized recommendation systems anomaly detection and even fuse shot classification okay so there's a lot you can do yeah the possibilities are pretty much endless and the white paper highlights how embeddings are often used in the first stage of ranking for these large scale applications to quickly narrow down the search space a&n techniques like scan are really good at this and then of course there's ARG which we talked about earlier it's a really powerful way to make language models more accurate and less prone to hallucinations yeah our
28:00 - 28:30 rag is definitely a game changer right and the white paper also stresses the importance of providing sources for the information that the model retrieves so users can verify things and Trust the results more figure 17 shows a good example of this okay so the key takeaway here is that embeddings and Vector stores are incredibly powerful tools but you need to understand their strengths and weaknesses and choose the right combination for your specific needs yeah and for our kaggle audience out there this is a really important skill set to
28:30 - 29:00 develop because it's going to become even more critical as language models get more and more powerful so we encourage all of you to dive deeper into the white paper explore the different tools and techniques and start experimenting see what you can build yeah with the advancements in embedding models like the Gemini based models and Vector search algorithms like scan we're just scratching the surface of what's possible who knows what kind of amazing applications will be built in the next few years so go out there and build a future thanks for joining us for this deep dive