Exploring BERT's Capabilities

BERT Consult with Rola - GenAI Essentials

Estimated read time: 1:20

    Summary

    In this engaging video, Andrew Brown and Rola dive into the intricacies of BERT, a bidirectional encoder representation from Transformers, originally developed by Google. The discussion highlights BERT's architecture, its functionality, and its applications in natural language processing (NLP). Although initially perceived as outdated, BERT proves to be a valuable tool for tasks like sentiment analysis and text classification. They also delve into conceptual misunderstandings like the term 'bidirectional' and explore interesting details about the size, optimization, and training of models, revealing fascinating insights into the evolving field of NLP models.

      Highlights

      • Andrew and Rola demystify BERT, a transformative model in NLP that's more relevant than you'd think. ๐ŸŽ“
      • The conversation reveals that BERT specializes in encoding-only tasks, making it perfect for text analysis rather than generation. ๐Ÿ’ฌ
      • Rola clears up the 'bidirectional' confusion, explaining BERTโ€™s true strength in recognizing language context. ๐Ÿ“š
      • They explore BERTโ€™s versatility, from sentiment analysis to embeddings, show-casing its enduring utility. ๐Ÿ“ˆ
      • A deep dive into the evolution and variants of BERT reveals the model's adaptable nature and continued innovation. ๐ŸŒŸ

      Key Takeaways

      • BERT is not obsolete! It still serves unique purposes in NLP tasks like sentiment analysis and classification. ๐Ÿค–
      • 'Bidirectional' in BERT is a misnomer; it's more about understanding context, not directionality. ๐Ÿ”„
      • BERT excels in tasks where the goal is to analyze text and reach conclusions, rather than generate new text. ๐Ÿง 
      • Understanding the intricacies of Transformers, like BERT, requires digging past the surface - itโ€™s more than just code lines. ๐Ÿ’ก
      • BERT, despite being an older model, serves as a crucial baseline in NLP experiments, showing its lasting influence. ๐Ÿ“Š

      Overview

      Join Andrew Brown and expert Rola as they peel back the layers of BERT, an influential transformer model in the realm of natural language processing. Right off the bat, they challenge the notion that BERT is outdated, showing its critical role in specific NLP tasks. Rola brings clarity to common misconceptions, like the misinterpretation of 'bidirectional' in BERT's context, making the discussion both enlightening and accessible.

        Emphasizing BERTโ€™s capabilities, the video showcases its use in tasks requiring text analysis and classification. This encoder-only model stands apart by not generating new text but decoding inputs to achieve specific outcomes. Rola and Andrew also highlight BERTโ€™s significance as a baseline for NLP experiments, emphasizing its historical and ongoing importance in the field.

          The talk ventures into BERT's diverse applications, including sentiment analysis and beyond, highlighting its flexibility and enduring relevance. They touch on the evolving landscape of NLP models, with discussions on variants and model size optimization, offering both technical insights and practical advice for utilizing such powerful AI tools effectively.

            Chapters

            • 00:00 - 01:30: Introduction and Overview The chapter titled 'Introduction and Overview' features an introductory dialogue between Andrew Brown and Rola. Andrew acknowledges the expertise of Rola in the subject matter, especially compared to his own experience level. The focus is on understanding how BERT (Bidirectional Encoder Representations from Transformers) works. Andrew has prepared some slides for the session, indicating a structured presentation, and Rola is invited to contribute her knowledge to enhance the information being discussed. The chapter serves as a primer to the topic of BERT within the broader context of machine learning.
            • 01:30 - 03:00: BERT Model Introduction The chapter introduces the BERT model, which some may consider outdated due to the emergence of large language models (LLMs). However, it's highlighted that BERT still holds value in specific use cases where LLMs may be overly complex or excessive. This discussion was prompted by an event in Montreal's user group, sparking further exploration into the applications of BERT. The chapter sets up a presentation by Rola to delve deeper into understanding the BERT model and its relevance.
            • 03:00 - 04:30: Transformer Architecture The chapter delves into the structure and research behind the Transformer architecture, specifically highlighting the work done by Google on Bidirectional Encoder Representations from Transformers (BERT). It describes a misconception or simplification that BERT simply involves splitting the Transformer architecture and stacking the first part, focusing on its encoder component. The encoder is responsible for processing and understanding natural language inputs.
            • 04:30 - 06:00: BERT vs. GPT The chapter discusses the basic mechanism behind encoding and decoding in large language models (LLMs). It begins by explaining how natural language inputs are transformed into mathematical representations through an encoder. This mathematical information can then be passed through a decoder to convert it back into natural language. The discussion highlights that the first generation of LLMs operated on an encoder-decoder setup, where natural language was inputted and processed through this system to output language again. It mentions both encoder-decoder examples and briefly touches upon GPTs in this context.
            • 06:00 - 09:00: Bidirectional Encoding The chapter discusses the concept of bidirectional encoding with a focus on how the BERT model functions as an encoder-only architecture. It highlights the difference between encoder-only and decoder-only models, explaining that encoder-only models like BERT take natural language as input and convert it into mathematical representations.
            • 09:00 - 13:00: Emergent Tasks and Use Cases In this chapter titled 'Emergent Tasks and Use Cases,' the discussion centers around the capabilities of BERT (Bidirectional Encoder Representations from Transformers) and its application in various tasks. The chapter explains how by adding layers to neural networks, different representations and functionalities can be achieved. BERT is particularly highlighted for its proficiency in tasks such as sentiment analysis and text classification. Essentially, it processes large amounts of text to produce simplified outputs like a word or a sentiment, among others.
            • 13:00 - 18:00: Model Training and Parameters The chapter discusses the classification process and references the BERT model in the context of the Transformer architecture. There is mention of an attempt to view an architectural diagram more clearly, highlighting difficulties in doing so.
            • 18:00 - 24:00: Distillation and Model Optimization The chapter discusses the Transformer architecture, highlighting the distinctions between BERT and GPT models. BERT utilizes a stack structure on the left side, while GPT models are described as 'decoder only' models, which essentially means stacking the right side of the Transformer architecture.
            • 24:00 - 29:00: BERT Applications and Fine-Tuning The chapter begins with a clarification about the nature of BERT models, emphasizing that they are not encoder-decoder models but rather decoder only models.
            • 29:00 - 32:00: Complexity in Model Application This chapter discusses the significance of word sequence in understanding meaning in language models. It highlights the difference in interpretation of phrases like 'A killed B' versus 'B killed A,' demonstrating how order affects comprehension. Traditionally, recurrent neural networks (RNNs) and their variations have been used to process and understand sequences to maintain this contextual importance.
            • 32:00 - 33:30: Naming and Legacy of BERT This chapter discusses the foundational aspects of BERT, focusing on its architecture, specifically the Transformer model. The key highlight is the non-sequential processing, where BERT does not read text in a traditional left-to-right or right-to-left manner. Instead, it processes the entire sequence simultaneously using attention mechanisms, allowing for a more comprehensive understanding of context and semantics. The chapter underscores how this transformative approach defines BERT's ability to grasp nuances in language effectively.
            • 33:30 - 34:10: Conclusion and Acknowledgments The chapter focuses on the concept of the attention mechanism, particularly in sequences, and its ability to highlight certain words in association with others. The discussion draws a parallel with the character Dr. Manhattan from a comic book movie, who is described as all-knowing and possessing an elevated level of understanding.

            BERT Consult with Rola - GenAI Essentials Transcription

            • 00:00 - 00:30 you hear me well yeah I hear you good cool hey everyone it's Andrew Brown and welcome to another video I have Rola who is definitely an expert comparative to me which I I know nothing as you just heard me talking about Bert but let's see if we can actually find out really how Bert Works um and so I have my slides pulled up here let's bring them up here and uh R is goingon to help us uh enrich the information I have so when I learned about Bert I actually uh learned about it in the machine learning
            • 00:30 - 01:00 um a certification I didn't think much of it because I thought it was old and maybe uh folks don't use it anymore but I was at an event with Rola um at the adus Montreal user group and I heard her say there are use cases where you can still use Bert or should use Bert where llms um might be uh too much too much work and so I wanted to kind of dig a little bit more into that um and so I have my slides up for display here for Rola and so I understand Bert to this
            • 01:00 - 01:30 bidirectional encoder representation from Transformers it was uh researched by Google and you know I saw a video which showed the Transformer architecture and what they did is they just split it in half and took the first part of the architecture and stacked it and they called that b so is that true is it just cut in half and then stacked well so um Transformers come I they have two pieces right they they've got the encoder piece which encodes data so it takes um natural light language and
            • 01:30 - 02:00 encodes it uh into mathematical representation and then you've got a decoder where if you've got mathematical representation you can pass it through a decoder and get natural language right and um the first generation I think of um llms came as an encoder decoder couple and so you would put in natural language it goes through this mathematical system and then it would you would still get um uh language out of it exactly that so there's a encoder decoder um gpts for
            • 02:00 - 02:30 example are a decoder only but ber itself is an encoder only model um and so what that means is it takes you have to think of these things as what is the input and what is the output and so uh an encoder only model uh takes in natural language or or takes in um for for in our case in a language model takes natural language and then it comes out with a natur with a mathematic
            • 02:30 - 03:00 representation right and so what birth is and then you add a few layers of um you know these these things are uh neural Nets at the at the heart of them and so you add a few layers and and you can um come up with different representations and so what Bert is really good at is things like sentiment analysis for example or classification um so you think of it as input is text a lot of text and output is less text so either a word or uh um a s a sentiment a
            • 03:00 - 03:30 classification something like that does that make sense yeah and one thing that I I I I think I got from that as well is that uh you were saying that Bert if we go back here to the architectural diagram and I just want to confirm if this is what I understood I'm trying to get one of these images a little bit larger here it's a little bit hard today I'm not sure why it's uh making it hard for me but I want to go back to that Transformer architecture um and I just want to make it a little bit larger can here so just
            • 03:30 - 04:00 give me a moment here to get the link so I'm just copying the link address here and so here is the Transformer architecture and you said well I said Bert's just this side stacked uh over onto itself and that's what we get for Bert but uh I think you had suggested that GPT is the right side that is stacked upon itself or is it if you just took the right side would that be considered GPT yeah GPT are decoder only models I mean in a nutshell yes uh but
            • 04:00 - 04:30 yes so they're they're not uh they're not encoder decoder they're decoder only models okay interesting uh so let's go over into the next part of Bert um and so yeah so Bert is B directional meaning it can read uh text both from left to right right to left to understand the context of text Sor that's not exactly what it means it's supposed to be non the naming is a misnomer um it's non-ir it should be named non-directional it's called bidirectional the idea is that it
            • 04:30 - 05:00 doesn't the sequence of words often matter right um and so if you say um I don't know um a killed b or B killed a it it's a very different understanding of how things are and so traditionally the way it worked is um we had rnns and different variations of recurring neural networks and the way they did that is through sequential understanding um
            • 05:00 - 05:30 the the Transformer architecture changes that and attention changes that in that um that sequence is understood differently not not sequentially and this is why the non- directionality comes in I think okay so all right so that makes sense so Tak all of the data at once it doesn't read right to left nor right right left to right it just takes in all the data uh but internally it works in a mechanism where um it is not reading
            • 05:30 - 06:00 through a sequence it's really the the attention mechanism make it so that it can highlight certain words um in association to others okay so so uh have you ever seen the movie um um uh it's a comic book movie um have heard Dr Manhattan have you ever SE heard of Dr Manhattan so Dr Manhattan is uh uh he's all knowing like he becomes um uh I don't
            • 06:00 - 06:30 know he becomes like kind of like a God and he and he's aware of everything all at once at all time so would we interpret its understanding as it contextually is aware of all the words in all directions would that be a better description or um I think the I it is aware aware of all of the words but it is able to give certain words more importance based on what it understands of language so there's a combination of things
            • 06:30 - 07:00 where one language uh where one word kind of enhances another um and the it starts to pick up these nuances of word connections based on its own training so it is aware of all of the words it just understands which words it really needs to focus on right all of the articles for example doesn't necessarily care for um does that make sense yeah yeah it's contextually aware Okay so
            • 07:00 - 07:30 I see so instead of think instead of thinking of like it's reading left to right just understand that it's contextually aware of each word and it's importance in a sentence okay exactly and the movie I was thinking of is called Watchman if anybody cares I don't know why I I just uh I had to I had to remember what the movie was um so Bert is a pre-trained model on the following task so two things that it says it does is mass language modeling so MLL and the other one is next sentence prediction NSP so you know my thoughts is like you train the model based on these two
            • 07:30 - 08:00 things does that mean in its pre-trained state like it's been trained for these two things that that's all it can do at this point is like it can it can uh uh infer if you were to do these two tasks it would be able to do it in this state because I was trying to figure out what can you do with this model I assume you can't do anything model unless you train it again but can you do anything with it in it in the in the uh non-rain state or sorry the pre-train state um I think there's a lot so there's there's some something called emergent tasks in llms where you train
            • 08:00 - 08:30 it for a specific thing but it turns out it does this other thing really well and we call that an emergent task okay um and we're learning that the bigger the model uh size then the more emerging tasks it it has so the fact that it is pre-trained on these things then we know that it can do these things because it has been trained on them um but there has been a lot of emerging tasks and a lot of uh I think a lot of people were surprised at how much for example the gpts can do
            • 08:30 - 09:00 um in terms of solving different problems and that's because of the size now Bert is a model that is about 340 million so it is a couple of orders of magnitude smaller than our bigger models you have to remember this is one of the first one of the few first models that came in the Transformer architecture came out in 2017 this is this came out in 2018 um so it is one of the first models it is one of the smaller models at 340 billion um the gpts right now are about
            • 09:00 - 09:30 1.2 trillion so it's about a couple of order of magnitudes bigger so I'm sure I can do more than that um I have to remember use cases I have seen it we have projects where we are running sentiment analysis and and classification uh with Bert but um I'm sure there's other things that it can do I just haven't uh played with it enough okay so so that that word emergent emergent uh did you say emergent models
            • 09:30 - 10:00 or emergent eent tasks or eer ENT tasks yeah things that can do so so again to to ground it to a real world example I got to keep doing this as a real world example is like OIC was uh supposed to be for something else and now it's used for weight loss is that kind of like exactly you discovered that it can do this um right so but was that the case with Bert they like did not expect to you like or was it like they made the Transformer architecture and then they said okay we're going to just take off the first half and see what we can do with it and then they were they're like oh I guess I can do this
            • 10:00 - 10:30 or um I think the first thing that I have to go back through the the literature um I think they were just ex experimenting with different models and things like that I don't know that this is not the one for the sequence uh sequence to sequence I have to go back and see um well if you want I have a pause feature I can pause if you want yeah let's let's pause pause let's pause and see all right
            • 10:30 - 11:00 we're back from uh from paus Land here and so we did a bit more research in here and I don't know if we remember exactly what we asked because we had a really good sidebar conversation but um I think we were talking about um the discovery of Bert and what came before it uh or or you know like did did GPT or Bert come first I think that was it GP Bert came first or GPT came second so it depends on different references we looked at two different references and
            • 11:00 - 11:30 so like we said the Transformer paper which is the attention all you need paper uh came out in 2017 and that was a uh a work between uh Google but also the University of Toronto I should say that as a Canadian people forget that it's University of Toronto it's a you know uh and then um there in 2018 uh Bert came out as a an encoder only model and the gpts came out also uh as a decoder only model and based on my
            • 11:30 - 12:00 reference on the book I'm reading it said that birit came first but when we looked in the internet there were references that said gpt1 came first so I I would say they came at the first in the same year um they they're the first two I guess one of the first two that came out well that kind of makes sense because you know if they had this Transformer architecture and then they split it in half and played around with it I could see them doing that I can't imagine that they made them in Iceland and attached
            • 12:00 - 12:30 them or maybe they did I'm not sure but uh all we know is that all these things happen close together um and uh we are we are where we are now um but anyway we'll continue on here so Berke can be fine tuned to perform uh the following tasks now I only know Bert for one thing and that's to use it for embeddings I shouldn't say that because I did run Bert a few times uh in this gen Essentials course for different fine tuning but I did not realize how broadly it could be applied to things so we had name entity recogition question and
            • 12:30 - 13:00 answering sentence pair task summarization feature extraction embeddings uh and more I actually have another example I can't remember what it is but it's on the next slide um and so there's all these things that you can do for fine tuny what do you see folks using Bert the most for uh in the industry and or like today like what would people gravitate towards uh Bert to utilize for so I see it uh for I've seen it work with the classification uh
            • 13:00 - 13:30 so you give it a piece of text and then you classify um that as um a particular label you're interested in um and that makes sense all of these things if you think about what they are named entity recognition uh I see sentiment analysis as well we we some people use that which is also a label in a sense um they're they're all the same type of work so encoder only models are really good at um analyzing text to reach a conclusion
            • 13:30 - 14:00 right to to do some sort of task to reach a conclusion not for uh to generate a sequence to sequence system so yeah this is all in line um and I think yeah the most I've seen is classification um and sentiment analysis okay so classification sentiment analysis that's good because those are the two main or sorry embeddings and then the other two as you mentioned were the ones that I were utilizing it for um I'm surprised I can do question and answering but uh I I always wonder if like the very early um
            • 14:00 - 14:30 oh S I think even if it does question answering it's fairly short like the context out the output context window would be fairly limited well this wouldn't be something that people would use as uh something for their um website say as a as a customer service uh body would be too incapable of it um yeah I don't think that's what it's made for it's not a um I think the context would be fairly short mm uh so there are multiple sizes of BS we
            • 14:30 - 15:00 have 100 million parameters 240 million parameters 4 million parameters and they said there's like 24 other varying models um I'm assuming Bert base uncase is the most used model I that's the one I keep seeing the most I'm not as sure as to why but obviously I would imagine that you're going to get different levels of Performance Based on the amount of uh parameters being used is Bert tiny for me uh for million is there
            • 15:00 - 15:30 any use case for such a small model or um other than learning I well I think if you can put it on edge or something like that potentially uh it depends again what you want to do with it and how well it does so you'd have to it's really hard to know what a model will do in any particular situation what we do know though is that sorry uh what we do know is that so there's a really good paper that came out called the chinchilla paper I don't know if you've heard about that one okay
            • 15:30 - 16:00 so the chinella paper looks at optimizing model the model sizes versus the data versus the compute so it it looks at um let me pull it actually let me pull it in uh but some really cool things came out of of that paper um and the idea is that oh let me get you the name of that
            • 16:00 - 16:30 um and so it really looked at the relationship between the model uh the model size the data set size the Computer Resources required and what it came out with it is it realized that um obviously the bigger the model the more information it can understand the the better uh the more things it can do uh but also the bigger the model the more data it needs right so um it came up with the idea that many models seem to be overparameterized and under trained so
            • 16:30 - 17:00 um can we do a lot with 340 million I think we can uh it depends on what data it's seen and what data you use to give it so um yeah we you can if you find T it specific so sorry the one key thing that I the one thing I heard that was sounded very important was that uh there are models that are overparameterized but undertrained meaning that they have a lot of um connections between their nodes but
            • 17:00 - 17:30 the amount of passes I'm assuming that's when we me training is uh is as few or the data plus passes going through is not enough yeah so there's a lot of space in that model to learn it's like having a big brain but not actually filling it with with much in a sense right like you so the idea of um the size of the model we have to understand so is is 300 40 million enough well it
            • 17:30 - 18:00 depends on what data it's seen right so these things tend to be very difficult to judge like that um and so it's a combination and and the more and the bigger the model is then the more data it should see for it to do a decent job and so the chinchilla paper came out with the conclusion that a lot of models seem to be overparameterized and under um and under Trin so they they they would get used they would do better if they see more data and just as a side note I think they came up with the
            • 18:00 - 18:30 number that the data set size and tokens should be about 20 times the number of parameters so if for a 340 million parameter model you'd have to multiply that by 20 to understand what the size of the data set that's ideal for that model okay so I guess my other thought is that if you have a model that has a lot of parameters but you don't do many passes of training and your data sets not that large is the
            • 18:30 - 19:00 size of the model going to be smaller because the training was less intense and or the parameter size is going to basically determine the size model the the parameter model the parameter size which is the which is in a sense dict just uh a how many model how many parameters this model has that's constant so even before it um before a model is trained at all it still has the exact same parameter models and what
            • 19:00 - 19:30 ends up happening in um the in in the training world is those let's say 340 million parameters they're initialized to random weights they're initialized randomly and then as it sees the data what it does is it um it makes a prediction and then it goes back uh it we say it it optimizes objective function so it estimates how how off it is and then it adjust those parameters accordingly and then it Loops over and
            • 19:30 - 20:00 over it it makes a prediction calculates it its errors readjusts and then it does that over and over and over until it minimizes its error margin and so in the beginning of the training you still have you have 340 million parameters that are just random and at the end of it you have 340 million parameters that have some sort of representation of the real world of what it trained not of the real world but of what it trained on um and
            • 20:00 - 20:30 so the the data comes in in that how good are these parameters tuned like how good are these parameters to allow you to to complete a task well if that makes sense I actually uh I I I could show I maybe I should share this with you I did um a little bit of a representation on a linear uh and a logistic on a linear regression about how it starts and how it adjusts you could see it on a notebook I'll share that with you it's
            • 20:30 - 21:00 it's really nice how to while you see it train and adjust itself that sounds really cool um I one thing I was trying to try to figure out was that um or for for folks that might think is like is a larger model in file size mean that it's more trained or more intelligent but what I'm hearing is that the size of the model is based on the amount of parameters because that's the amount of data it's being held within the model weights and so all you're doing is adjust ing those weights so you're not going to they're just numbers right so
            • 21:00 - 21:30 it's not going to be bigger or smaller I mean there might obviously be some flux to some degree but it's not going to be you know if you did you know 10 times more train it's going to be 10 times larger it's really the parameter size that's going to determine its end size exactly so regardless of what where the model is whether it's good or bad it it'll have the same size it is a constant size um but what what really matters is how well what those numbers are these 340 parameters what are the and are they because they you have to
            • 21:30 - 22:00 think about it is they they they in that's a mathematical model right and so that's what matters um the file size though some people tell me oh the file size increases the file size should not increase the reason the file size increases is because certain programs attach metadata to the system and so when you when you train the the optimizer sometimes or or um attaches metadata of of training States that's what makes a bigger file the
            • 22:00 - 22:30 model itself should always be constant at whatever level you set it at okay that makes sense so so it there are cases where the data gets larger but it's not for the reasons that you think it's because you know as you're saying there's additional data being attached um as metadata and that makes total sense there um just continue on here with Bert yeah there's obviously a bunch of variants I didn't look at any of the variants I just assume like I know I obvious see the word distill so I assume that that's just a more optimized
            • 22:30 - 23:00 optimized model like I don't know if it's a pruned model or whatever you call it but I just know that um they're generally more performant like two three times and they're cutting Corners to to do that not in a bad way I mean like they're they're figuring a way to make it more performant um I haven't looked at all these either um but you have to think about of these as so I think Roberta let me uh they all came out later like Roberta came out in 2019 so a year later you have to think of this as a natural like you know how man evolved um you
            • 23:00 - 23:30 know the evolution of man where we and then and you have to think of it like that right every year there's new models um based on new understanding based on New Concepts based on better optimizations and so um this is the Natural Evolution of it yeah so so yeah I understand that I guess I guess I just really getting hung up on the word distill because I keep seeing it and every time I come ac across a distilled model uh they always describe it as like two
            • 23:30 - 24:00 three four times faster and they've done something to it whereas to um again I'm not sure it's pruning or they're doing some kind of optimization to the model to make it more proficient like for instance whisper has distilled whisper and I I just keep seeing the term whis uh distill so I guess my real question is like when you see these terms like distill does that do you know what that means or is just that it is they're just putting that name there and it's suggesting that like is there a rhyme and reason to the name across models or there's no convention
            • 24:00 - 24:30 and people canach whatever they want like sometimes I see models that have Omni in front of it and are they just putting Omni in front of everything when they decide that they think Omni sounds cool or yeah I think um these are researchers in Labs right or people in different companies and there's no real naming convention uh and so they can realistically name things the way they want we just talked about how really bidirectional is somewhat a misnomer should be nire but these you know the
            • 24:30 - 25:00 the namings I don't I don't attach too much to it um and they also want it to be really cool when when you make an acronym out of it so sometimes it is a misn Namer for the acronym um but a distilled model so there's a lot of ways that you can um optimize a model uh like you said some of it is pruning some of it is um these uh teacher uh teacher models uh aw actually I reinvent they now now
            • 25:00 - 25:30 um Bedrock you can create these um smaller student models and what that allows you to do um is to take this really big model and create a smaller model that is very optimized for your task um and so the idea is to um instead of and we talked about this in in in the previous um system we ran but maybe we we can mentioned some of this uh these models can be really
            • 25:30 - 26:00 really big right they're they're fairly huge um and so what um the the and and a huge model means more latency it means more resources it means more cost and so this idea of creating smaller models uh distilled models in in various ways you could do this in in many ways you can you can prune you can uh teach uh a a
            • 26:00 - 26:30 student model you can um you know do different things uh the idea is to reduce cost and latency in resources uh by doing that and there's various ways we can do that we can potentially I can potentially put um some resource for you as to like the different ways that is done so I I think what I'm seeing here is that dist uh distillation uh distill could suggest that it's this teacher transfer thing where you have a more intelligent model transferring to a more
            • 26:30 - 27:00 cost effective model but also to warn that naming conventions are not not necessarily set in stone so someone could put distill in front of it and the means could could could be different um and that's just a general a general a general um I think a general warning that we might suggest is that um you everything you have to look at you have to look at see what it does um regardless of its of its uh name tag model name I think conventions are not there yet in terms of consistency
            • 27:00 - 27:30 across different teams yeah like I might I I might I might create a model and then I decide that it's dist distillation but I actually don't know what distillation means and I came from a chemistry yeah I came from a chemistry background I think the the world distillation means to to take the essence of right to to you know when you distill a chemical you're really um taking all of the impurities and and getting the the essence of that chemical see I would I would just make a model
            • 27:30 - 28:00 smaller somehow and then I would put distill in front of it and everyone be like what are you doing Andrew but I'd be like I don't know I thought that's what it was um uh yeah so then I put this while bird is an older model still used uh for ubiqus Baseline in natural language processing that's specifically what um uh Wikipedia said so they're saying that I is I guess a comparative for when you're doing NLP experiments to use it as a baseline I don't know what they mean by that I just know that they say that so not sure how you use it as a
            • 28:00 - 28:30 baseline but that's what they say um I'm not sure I think the biggest thing in my mind like we said is that uh it is an encoder model only so it's great for when you have text that you want to do a very specific um task on you want to reach a particular conclusion um and that's where these encoder only models come in uh and so here I have an example of Bert using sentiment analysis so here we're using the hugging face pipeline we're using the Bert Bas uncased uh and
            • 28:30 - 29:00 then we have a couple sentences we uh and then uh it I guess what it does here is it downloads the classifier model I'm not sure how you like to work with models and um I'm not sure if hugging face is too high abstract for uh uh for what you want to do a dayto day but as far as I understand what's happening here is that when you use the hugging face Pipeline and you say sentiment analysis even though we're specifying The Bert based uncase model it's going to download a version of The Bert based
            • 29:00 - 29:30 uncase model that is fine-tuned for sentiment analysis I think that's what's happening here um as that's what the documentation kind of suggests um but yeah this is an example here of Bert and I don't know that's that's all I got um but I'm not sure if you have any thoughts on on the code or if there's anything to to talk around the the code for for pre-training models um or toor F train models sorry well I mean I think
            • 29:30 - 30:00 going back to the reinvent uh keynote from uh uh rner um the idea of simplexity like everybody want the simplexity remember that that was a really good talk and this idea of abstracting for you so I think there's a lot of all of these like the the fact that you can do this with a huge model with a few lines of code is really impressive but that what also that what that also means is that there's a whole lot of complex it under the hood and the way you need to deal with these is to
            • 30:00 - 30:30 actually understand how the system works how the API the all of these libraries work what this provider um deals with and so I the way I like to work is to stick to a single provider or or at least like a few providers so that you can understand kind of their way of working and and how to look at their documentation and and and do all of that why because again there is a huge amount
            • 30:30 - 31:00 of complexity that is kind of removed and abstracted for you and you can just call a few lines of code and while that sounds great um you to really understand what is going on you really have to peel the the the lid off a little bit and understand and so if you kind of get used to um particular providers then you can understand how their documentation work where where to find particular information uh where to read particular things um if that makes sense so so that
            • 31:00 - 31:30 word simplexity it it sounds like it almost sounds like they're the opposite but um but the idea is like you have a complex system but they simplify it for you and so you you under you understand these complex tools through through this simple simple interface so you are doing complex things but things are abstracted away from you exactly like look at that is that enough like the amount of things that happen like for it to understand that sentence there's a process process of tokenization it's It's cutting these
            • 31:30 - 32:00 into tokens and then it's embedding them and then it is um you know running them through a mathematical system and then it's um and then it's bringing them back to to a label so there's a whole lot happening and you write two lines of code right it's interesting but as the person that's using it I should generally understand what's actually happening um like the love like the complexity and appreciate how much is being um abstracted away from me um or or the fact that there are cases where
            • 32:00 - 32:30 yes this is good in this use case but you might have to go and do this more at a lower level like the same thing with highle abstract languages and lower level abstract languages is that um come's flexibility or come come more complexity will get more flexibility um yeah yeah but um yeah that's Bert and uh I don't know do you think uh they named it Bert because of Sesame Street character like somebody had a kid somebody had a
            • 32:30 - 33:00 kid and they're likeor was Bert and they're like let's just make it let's figure out a way to name it Bert it it's very possible I you know I worked in labs for a long time and a lot of times we did reverse engineer the name to be the acronyms so yeah very possible so so so you're part of the problem is what what you're telling I mean you marketing you have to it has to be easy it has to roll off the tongue it has to you M well I I appreciate your uh your expert input on
            • 33:00 - 33:30 Bert here I definitely understand it a lot better I I hope that people that are watching when they hear me talking they don't realize how much information that is missing from what I'm saying and I'm just trying to give them the most practical route the best way I understand it kind of like Chef Ramsay where uh um sometimes he'll have videos and he'll talk about the process of something it's totally wrong but the outcome of what he does is correct and so uh that's what we call a practitioner someone that's good at doing someone but doesn't necessarily
            • 33:30 - 34:00 know exactly how they're doing it and gives inaccurate uh explanations and so folks just need to understand that as I'm making content here and that's why I have you here to expose the lack of knowledge I have um so I appreciate your time and we will see Rola in more videos and we're going to continue on with the course Chia Chow