Understanding Naive Bayes Classification

Naive Bayes, Clearly Explained!!!

Estimated read time: 1:20

    Learn to use AI like a Pro

    Get the latest AI workflows to boost your productivity and business performance, delivered weekly by expert consultants. Enjoy step-by-step guides, weekly Q&A sessions, and full access to our AI workflow archive.

    Canva Logo
    Claude AI Logo
    Google Gemini Logo
    HeyGen Logo
    Hugging Face Logo
    Microsoft Logo
    OpenAI Logo
    Zapier Logo
    Canva Logo
    Claude AI Logo
    Google Gemini Logo
    HeyGen Logo
    Hugging Face Logo
    Microsoft Logo
    OpenAI Logo
    Zapier Logo

    Summary

    In this video, Josh Starmer from StatQuest provides a comprehensive explanation of the Naive Bayes classification algorithm, focusing specifically on the Multinomial Naive Bayes. Starting with a fun introduction, the video explains the step-by-step process of using word histograms to classify normal and spam messages based on word probabilities. Key concepts such as likelihoods and prior probability are discussed alongside common challenges like zero probability which is countered using Laplace Smoothing. The "naivety" of Naive Bayes, due to its assumption of word independence, is also highlighted. Despite these oversimplifications, it's shown that Naive Bayes performs surprisingly well in practice. The video wraps up with a few promotions for further learning materials and ways to support the creator.

      Highlights

      • Josh explains Naive Bayes using fun and approachable language, making complex concepts easy to understand. 🎉
      • Histograms of word occurrences help calculate probabilities for message classification. 📈
      • Laplace Smoothing resolves the zero probability issue, ensuring smooth calculations. 🤓
      • The 'naive' part of Naive Bayes refers to its assumption of word independence, which is both a limitation and a strength. 🤔
      • Though simplistic, Naive Bayes consistently performs well in separating normal messages from spam. 💪

      Key Takeaways

      • Multinomial Naive Bayes is effectively used to classify text data, distinguishing between normal and spam messages. 📧
      • A histogram of words is used to calculate probabilities for word occurrence in normal and spam messages. 📊
      • Initial guesses, or prior probabilities, play a crucial role in Naive Bayes calculations. 🔍
      • Zero probability obstacles are overcome with methods like Laplace Smoothing. ➿
      • Despite treating language as a 'bag of words', Naive Bayes performs well, illustrating its practicality in real-world applications. 🌍

      Overview

      Josh Starmer kicks off the video with an engaging introduction, setting the stage for a deep dive into the world of Naive Bayes classification. This video focuses on the Multinomial Naive Bayes classifier, which is a crucial tool in natural language processing and text classification.

        Through detailed examples, Josh explains how we can use word frequency histograms to determine the likelihood of a message being 'normal' or 'spam,' allowing us to tackle spam filtering effectively. He addresses key elements like likelihoods, prior probabilities, and potential challenges such as zero probabilities, which are cleverly overcome with Laplace Smoothing.

          Highlighting the 'naive' aspect, Josh discusses how Naive Bayes assumes all words have independent occurrences. Despite this simplification, the technique is praised for its practical efficacy. The video concludes with some promotional content, urging viewers to dive deeper into statistical learning with additional resources offered by StatQuest.

            Chapters

            • 00:00 - 00:30: Introduction and Sponsorship During lockdown, I'm working on a project called 'stack quest'.
            • 00:30 - 01:00: Introduction to Naive Bayes This chapter introduces the concept of Naive Bayes, focusing mainly on the Multinomial Naive Bayes classifier. It notes that while the Multinomial Naive Bayes is frequently discussed, there is also a Gaussian Naive Bayes classification, which will be covered elsewhere.
            • 01:00 - 02:30: Histogram and Probabilities for Normal Messages The focus of this chapter is on distinguishing between normal and spam messages using a histogram. Initially, it discusses receiving normal messages from friends and family as well as spam messages. The main goal described is to filter out the spam messages. The process begins by creating a histogram of words from the normal messages to aid in this differentiation.
            • 02:30 - 03:30: Histogram and Probabilities for Spam Messages The chapter discusses using histograms to calculate the probabilities of words appearing in spam messages, specifically focusing on the example of calculating the probability of the word 'dear' in normal messages. The probability is determined by dividing the frequency of the word 'dear' in normal messages by the total number of words in those messages.
            • 03:30 - 05:30: Terminology and Initial Guess In the chapter titled 'Terminology and Initial Guess', the focus is on calculating the probabilities of certain words appearing in messages. It discusses the calculation of the probability of seeing words like 'dear' and 'friend' in normal messages. For instance, the probability is calculated by dividing the occurrence of the word 'friend' in normal messages by the total number of words in all normal messages, leading to a clear and methodical approach to understanding word frequency in a given context.
            • 05:30 - 07:30: Calculating Scores for 'Dear Friend' Message In this chapter, the process of calculating word occurrence probabilities in spam and normal messages is discussed. Specifically, probabilities for the words 'friend,' 'launch,' and 'money' in normal messages are provided. The probability for 'friend' is 0.29, for 'launch' is 0.18, and for 'money' is 0.06. Additionally, a technique for visualizing word occurrence through histograms is introduced as a way to distinguish between spam and normal messages.
            • 07:30 - 09:30: Introduction to Complex Example with 'Lunch Money' Message In this chapter, the probability calculation of words appearing in spam messages is introduced. Specifically, it explains how to compute the likelihood of encountering the word 'dear' in spam emails. The chapter provides a breakdown of dividing the frequency of 'dear' in spam by the total number of words in spam, arriving at a probability of 0.29. The discussion suggests a similar approach for other words in spam communication, hinting at a broader application in spam detection.
            • 09:30 - 12:30: Handling Zero Probability with Alpha The chapter discusses how to handle zero probability using alpha. It mentions the use of histograms and the concept of likelihoods. The chapter clarifies that since the probabilities are calculated for discrete individual words and not continuous variables like weight or height, these probabilities are referred to as likelihoods.
            • 12:30 - 14:30: Why Naive Bayes is Naive The chapter explains the interchangeable use of ‘probabilities’ and ‘likelihoods’ in the context of Naive Bayes. It reassures readers not to worry about the terminology difference when learning about Naive Bayes, especially with terms that cause confusion in tutorials. The distinction will be further discussed in the context of Gaussian Naive Bayes in future sections. The chapter sets up a scenario involving a message with the phrase 'dear friend' to illustrate decision making using Naive Bayes.
            • 14:30 - 15:30: Conclusion and Promotions The chapter delves into determining whether a message is normal or spam by starting with an initial probability guess. This guess can be arbitrary but is often derived from the training data. In this context, since 8 out of 12 messages are normal, the initial probability is guessed at 0.67.

            Naive Bayes, Clearly Explained!!! Transcription

            • 00:00 - 00:30 I'm at home during lockdown working on my step quest yeah I'm at home during lockdown working on my stack quest yeah stack quest hello I'm Josh starburns welcome to static quest today we're gonna talk about naive Bayes and it's gonna be clearly explained this stack quest is sponsored by jad bio just add data and their automatic machine
            • 00:30 - 01:00 learning algorithms will do the rest of the work for you for more details follow the link in the pinned comment below note when most people want to learn about naive Bayes they want to learn about the multinomial naive bayes classifier and that's what we talk about in this video however just know that there is another commonly used version of naive Bayes called Gaussian naive Bayes classification and I cover that in a
            • 01:00 - 01:30 follow-up stat quest so check that one out when you're done with this quest BAM now imagine we received normal messages from friends and family and we also received spam unwanted messages that are usually scams or unsolicited advertisements and we wanted to filter out the spam messages so the first thing we do is make a histogram of all the words that occur in the normal messages
            • 01:30 - 02:00 from friends and family we can use the histogram to calculate the probabilities of seeing each word given that it was in a normal message for example the probability we see the word dear given that we saw it in a normal message is eight the total number of times deer occurred in normal messages divided by 17 the total number of words in all of
            • 02:00 - 02:30 the normal messages and that gives us 0.47 so let's put that over the word dear so we don't forget it likewise the probability that we see the word friend given that we saw it in a normal message is 5 the total number of times friend occurred in normal messages divided by 17 the total number of words in all of the normal messages and that
            • 02:30 - 03:00 gives us zero point two nine so let's put that over the word friend so we don't forget it likewise the probability that we see the word launch given that it is in a normal message is 0.18 and the probability that we see the word money given that it is in a normal message is 0.06 now we make a histogram of all the words that occur in the spam
            • 03:00 - 03:30 and calculate the probability of seeing the word dear given that we saw it in the spam and that is two the number of times we saw deer in the spam divided by seven the total number of words in the spam and that gives us zero point two nine likewise we calculate the probability of seeing the remaining words given that they were in the spam BAM now because
            • 03:30 - 04:00 these histograms are taking up a lot of space let's get rid of them but keep the probabilities oh no it's the dreaded terminology alert because we have calculated the probabilities of discreet individual words and not the probability of something continuous like weight or height these probabilities are also called likelihoods
            • 04:00 - 04:30 I mention this because some tutorials say these are probabilities and others say they are likelihoods in this case the terms are interchangeable so don't sweat it we'll talk more about probabilities versus likelihoods when we talk about Gaussian naive Bayes in the follow-up Quest now imagine we got a new message that said dear friend and we want to decide
            • 04:30 - 05:00 if it is a normal message or spam we start with an initial guess about the probability that any message regardless of what it says is a normal message this guess can be any probability that we want but a common guess is estimated from the training data for example since 8 of the 12 messages are normal messages our initial guess will be 0.67 so let's
            • 05:00 - 05:30 put that under the normal messages so we don't forget it oh no it's another dreaded terminology alert the initial guests that we observe a normal message is called a prior probability now we multiply the initial guess by the probability that the word dear occurs in a normal message and the probability that the word friend occurs in a normal message now we just plug in the values that we've worked out earlier and do the math
            • 05:30 - 06:00 beep-boop beep-boop it and we get 0.09 we can think of 0.09 as the score that dear friend gets if it is a normal message however technically it is proportional to the probability that the message is normal given that it says dear friend so let's put that on top of the normal messages so we don't forget
            • 06:00 - 06:30 now just like we did before we start with an initial guess about the probability that any message regardless of what it says is spam and just like before the guests can be any probability we want but a common guess is estimated from the training data and since four of the twelve messages are spam our initial guess will be 0.33 so let's put that under the spam so we
            • 06:30 - 07:00 don't forget it now we multiply that initial guess by the probability that the word dear occurs in spam and the probability that the word friend occurs in spam now we just plugged in the values that we worked out earlier and do the math BIP BIP BIP BIP BIP and we get 0.01 like before we can think of 0.01 as the
            • 07:00 - 07:30 score the dear friend gets if it is spam however technically it is proportional to the probability that the message is spam given that it says dear friend and because the score we got for normal message 0.09 is greater than the score we got for spam 0.01 we will decide that dear friend is a normal message double
            • 07:30 - 08:00 BAM now before we move on to a slightly more complex situation let's review what we've done so far we started with histograms of all the words in the normal messages and all of the words in the spam then we calculated the probabilities of seeing each word given that we saw the word in either a normal message or spam then we made an initial guess about the probability of seeing a
            • 08:00 - 08:30 normal message this guest can be anything between zero and one but we based hours on the classifications in the training data set then we made the same sort of guess about the probability of seeing spam then we multiplied our initial guests that the message was normal by the probabilities of seeing the words dear and friend given that the message was normal then we multiplied our initial
            • 08:30 - 09:00 guests that the message was spam by the probabilities of seeing the words dear and friend given that the message was spam then we did the math and decided that dear friend was a normal message because 0.09 is greater than 0.01 now that we understand the basics of how naive Bayes classification works let's look at a slightly more complicated example
            • 09:00 - 09:30 this time let's try to classify this message lunch money money money money note this message contains the word money four times and since the probability of seeing the word money is much higher in spam than in normal messages then it seems reasonable to predict that this message will end up being spam so let's do the math calculating the score for a normal
            • 09:30 - 10:00 message works just like before we start with the initial guess then we multiply it by the probability we see lunch given that it is in a normal message and the probability we see money four times given that it is in a normal message when we do the math we get this tiny number however when we do the same calculation for spam we get zero
            • 10:00 - 10:30 this is because the probability we see lunch in spam is zero since it was not in the training data and when we plug in zero for the probability we see lunch given that it was in spam then it doesn't matter what value we picked for the initial guess that the message was spam and it doesn't matter what the probability is that we see money given that the message was spam because
            • 10:30 - 11:00 anything times zero is zero in other words if a message contains the word lunch it will not be classified as spam and that means we will always classify the messages with lunch in them as normal no matter how many times we see the word money and that's a problem to work around this problem people usually add one count represented by a
            • 11:00 - 11:30 black box to each word in the histograms note the number of counts we add to each word is typically referred to with the Greek letter alpha in this case alpha equals one but we could have said it to anything anyway now when we calculate the probabilities of observing each word we never get 0 for example the probability of seeing lunch given that
            • 11:30 - 12:00 it is in spam is 1/7 the total number of words in spam plus for the extra counts that we added and that gives us 0.09 note adding counts to each word does not change our initial guess that a message is normal or the initial guess that the message is spam because adding a count to each word did not change the number of messages in
            • 12:00 - 12:30 the training data set that are normal or the number of messages that are spam now when we calculate the scores for this message we still get a small number for the normal message but now when we calculate the value for spam we get a value greater than zero and since the value for spam is greater than the one for a normal message we classify the message as spam spam
            • 12:30 - 13:00 now let's talk about why naive Bayes is naive the thing that makes naive Bayes so naive is that it treats all word orders the same for example the normal message score for the phrase dear friend is the exact same for the score for friend dear in other words regardless of
            • 13:00 - 13:30 how the words are ordered we get 0.08 treating all word orders equal is very different from how you and I communicate every language has grammar rules and common phrases but naivebayes ignores all of that stuff instead naivebayes treats language like it is just a bag full of words and each message is a random handful of them naive bayes ignores all the rules
            • 13:30 - 14:00 because keeping track of every single reasonable phrase in a language would be impossible that said even though naive bayes is naive it tends to perform surprisingly well when separating normal messages from spam in machine learning lingo we'd say that by ignoring relationships among words naivebayes has high bias but because it works well in practice naive Bayes has
            • 14:00 - 14:30 low variance shameless self-promotion if you are not already familiar with the terms bias and variance check out the quest the link is in the description below triple spam oh no it's one last shameless self-promotion one awesome way to support stack quest is to purchase the naivebayes stack quest study guide it has
            • 14:30 - 15:00 everything you need to study for an exam or job interview it's eight pages of total awesomeness and while you're there check out the other stack quest study guides there's something for everyone hooray we've made it to the end of another exciting stat quest if you liked this stack quest and want to see more please subscribe and if you want to support stack quest consider contributing to my patreon campaign becoming a channel member buying one or
            • 15:00 - 15:30 two of my original songs or a t-shirt or a hoodie or just donate the links are in the description below alright until next time quest on