Speech Based Emotion Recognition #CH30SP #swayamprabha

Estimated read time: 1:20

    Learn to use AI like a Pro

    Get the latest AI workflows to boost your productivity and business performance, delivered weekly by expert consultants. Enjoy step-by-step guides, weekly Q&A sessions, and full access to our AI workflow archive.

    Canva Logo
    Claude AI Logo
    Google Gemini Logo
    HeyGen Logo
    Hugging Face Logo
    Microsoft Logo
    OpenAI Logo
    Zapier Logo
    Canva Logo
    Claude AI Logo
    Google Gemini Logo
    HeyGen Logo
    Hugging Face Logo
    Microsoft Logo
    OpenAI Logo
    Zapier Logo

    Summary

    In this engaging lecture, Abhinav Dull from IIT Kharagpur discusses the importance of understanding emotions through speech analysis as part of effective computing. Emphasizing voice as a critical modality, the lecture delves into how machines can interpret human emotions based on vocal cues. The talk covers real-life applications ranging from man-machine interaction to mental health diagnostics and addresses the challenges of voice data collection and emotion recognition across different languages and cultural contexts. Additionally, it highlights the need for better emotional speech synthesis and cross-lingual emotion recognition technologies.

      Highlights

      • Abhinav Dull introduces Effective Computing focusing on voice-based emotion recognition. 🎓
      • Voice modality can significantly convey user emotions, even without visual cues. 👂
      • Practical applications include enhancing man-machine interaction and aiding mental health diagnostics. 💡
      • Research challenges include crosslingual recognition and achieving privacy in data handling. 🛡️
      • Emerging areas like emotional speech synthesis are advancing, yet still face technical hurdles. 🚀

      Key Takeaways

      • Voice is a powerful modality for emotion recognition in effective computing. 🎤
      • Understanding emotions through speech has various applications, from tech to health. 🧠
      • Challenges include data collection, language barriers, and privacy concerns. 🔍
      • Emotional speech synthesis and cross-lingual recognition are emerging fields. 🌍
      • Continuous exploration is needed to improve emotion-based human-machine interactions. 🤖

      Overview

      Abhinav Dull from IIT Kharagpur leads an insightful lecture on speech-based emotion recognition, a key aspect of effective computing. He emphasizes the unique role of voice as a medium to understand emotions, even when visual cues are absent, highlighting its importance across multiple applications from tech to healthcare.

        The lecture dives into practical scenarios like smart devices interpreting user emotions to tailor responses, and the role of voice in understanding mental health states. There's a focus on existing challenges of voice emotion datasets, cultural differences in expression, and privacy concerns that impact the collection and processing of such sensitive data.

          Advanced topics are explored such as emotional speech synthesis, which aims to create more emotionally aware responses from machines. The speaker touches on the limitations of current technologies in recognizing emotions across languages and explains ongoing research efforts to bridge these gaps, paving the way for more intuitive human-machine interfaces.

            Chapters

            • 00:00 - 00:30: Introduction to Effective Computing The chapter "Introduction to Effective Computing" begins with a warm welcome to the audience, introduced by Abhinav Dull from the Indian Institute of Technology, Roorkee. The session starts with applause, and is set in the context of a digital and educated India. The chapter likely explores themes around digital literacy and the role of education in advancing computing skills.
            • 00:30 - 01:00: The Importance of Speech in Emotion Recognition The chapter discusses the significance of speech in recognizing emotions. It focuses on how analyzing voice can help in understanding a user's emotions. The lecture aims to cover various aspects of the voice modality, beginning with examples that demonstrate why speech is a crucial component in emotional recognition.
            • 01:00 - 03:00: Applications of Speech-Based Emotion Recognition The chapter begins with a focus on understanding the user's emotions through speech and voice. It then explores several applications where speech-based emotion recognition and affective computing are already in use. The latter part of the chapter introduces the first essential component needed to build a voice-based affective computing system: labeled datasets. Additionally, the chapter discusses various types of labeled data and their importance.
            • 03:00 - 05:00: Challenges in Emotion Recognition through Voice The chapter explores the difficulties involved in recognizing emotions through voice. It mentions the importance of data attributes and recording conditions in capturing accurate emotional expressions. An example is provided where the speaker describes their feelings, illustrating how subtle variations in voice and expression can affect emotion recognition.
            • 05:00 - 06:00: Voice-Based Affect Computing Systems The chapter discusses scenarios where voice-based affect computing systems are used to detect emotions. It gives an example of two different situations: in the first, the speaker's face is clearly visible and the voice is clear; in the second, the face is partially visible but the voice remains clear. The speaker mentions that in the first case, they were showing a neutral expression. This highlights how such systems may rely on a combination of vocal and facial cues to interpret emotions accurately, even when some visual information is obscured.
            • 06:00 - 08:00: Speech Emotion Databases and Annotation In 'Speech Emotion Databases and Annotation', the chapter explores the significance of utilizing voice as a key modality in affective computing. It illustrates an example where despite the individual's face being downwards and not visible to a camera, the speech conveyed a happy emotion. This emphasizes the potential of voice in analyzing emotions effectively.
            • 08:00 - 10:00: Limitations and Challenges in Data Sets This chapter discusses the limitations and challenges encountered in data sets. It uses the analogy of understanding a friend's emotions through facial expressions and speech to illustrate the complexity of analyzing various aspects of data, particularly focusing on voice analysis. The chapter emphasizes the importance of considering multiple modalities in data interpretation to gain a comprehensive understanding.
            • 10:00 - 11:00: Conclusion: Overview of Speech-Based Emotion Recognition The chapter provides an overview of speech-based emotion recognition. It includes an example from the audio-visual group affect dataset to illustrate how emotion can be interpreted through audiovisual cues. The example emphasizes the observation of body language in the subjects as a key aspect of emotion detection.

            Speech Based Emotion Recognition #CH30SP #swayamprabha Transcription

            • 00:00 - 00:30 [Applause] swam prha digital India educated India hello and welcome I am abhinav dull from the Indian Institute of Technology Robert friends this is the lecture in the series of
            • 00:30 - 01:00 Effective computing today we will be discussing about how we can recognize emotions of a user by analyzing the voice so we will be fixating on The Voice modality so the content which we will be covering in this lecture is as follows first I will introduce to you give you some examples about why speech is an extremely important cue for us to
            • 01:00 - 01:30 understand the emotion of a user then we will discuss several applications where speech and voice based Effective computing are already being used and then we will switch gears and we will talk about the first component which is required to create a voice based Effective computing system which is labeled data sets and in this pursue we will also disc discuss the different
            • 01:30 - 02:00 attributes of the data and the conditions in which it has been recorded now if you look at me let's say I have to say a statement about how I am feeling today and I say well today is a nice day and I am feeling contended now let me look down a bit
            • 02:00 - 02:30 and I say today is a wonderful day and I'm feeling great now in the first case you could hear me and you could see my face very clearly and in the second case my face was partially visible but you could hear me clearly and I'm sure you can make out that in the first case I was showing neutral expression
            • 02:30 - 03:00 and in the second case even though my face was facing downwards not directly looking into the camera I was sounding to be more positive right so there was a more happy emotion which could be heard from my speech so this is one of the reasons why we are using voice as one of the primary modalities in Effective computing
            • 03:00 - 03:30 you talk to a friend you understand their facial expressions you look at the person's face but in the parall you're also listening to what that person is speaking so you can actually tell how that person is feeling from their speech and that is why we would be looking at different aspects of how voice can be used in this
            • 03:30 - 04:00 environment so here is an example so this is a video which I'm going to play from uh audio visual group effect data set so let me play the video protects the other delivers protects the other delivers so if you notice in this case the body language of the subjects
            • 04:00 - 04:30 here that is trying to be a bit aggressive so this looks like a training video but if you hear the voice over the explanation voice in this video you can tell that there is no fight going on there is no aggressive behavior it is simply a training going on and how are we able to find that by simply looking at the tonality of The Voice
            • 04:30 - 05:00 if it was let's say actually a fight or some aggressive behavior shown by the subjects in the video and the voice was also from one of the subjects we would also hear a similar pattern which would tell us that let's say the subjects could be angry but in this case even though the body language facial expression says that they are in an aggressive pose but from The Voice we can tell that this is actually a training video so so it's the
            • 05:00 - 05:30 environment is actually neutral now let's look at and hear another video now in this case the video has been blacked out you can hear the audio and you can tell that there are several subjects in the audio video sample and the subjects are happy right how are we able to tell that we can hear the laughter now if I was to
            • 05:30 - 06:00 play the video now this is the video which we had earlier blacked out so you can look at of course the facial expressions but even without looking at the facial expressions and the gestures just by hearing you can tell that the subject are happy so this gives us enough motivation to actually pursue voice as a primary modality in Effective computing now as I mentioned earlier here there are a large number of
            • 06:00 - 06:30 applications where speech of the user that is being analyzed to understand the effect and friends this is similar to how when we discussed in the last lecture about facial expression analysis the several applications were there in health and in education we find the similar use cases for voice based effect but
            • 06:30 - 07:00 which are applicable in different circumstances in circumstances in scenarios where it could be non-trivial to have a camera look at the user of course there's a privacy concern which comes with the camera as well so instead we can use microphones and we can analyze the spoken speech and the information which is there in the background
            • 07:00 - 07:30 now the first and quite obvious application of voice in Effective computing is understanding the man machine interaction on a natural basis what does that mean let's say there is a social Robo now the robo is greeting the user and the user has just entered into the room the robo greets the user
            • 07:30 - 08:00 and the user replies back now based on the voice of the user and the expression which is being conveyed by the user the machine which is the robo in this case is able to understand the emotion of the user and then the feedback to the user can be based on the emotional state let's say if the user is not so cheerful so the
            • 08:00 - 08:30 robo reacts accordingly and then tries to understand with a question which could better make the either the user more comfortable relaxed or the robot tries to investigate a bit so that it can have a conversation which is appropriate with respect to the emotion of the user
            • 08:30 - 09:00 the second we see is in entertainment particularly looking at computer movies so friends in this case we are talking about the aspect of indexing let's say you have a large repository of movies okay so these are let's say a large repository now the user
            • 09:00 - 09:30 wants to search let's say the user wants to search all those videos which are belonging to a happy event okay you can think of it as a set of videos in your phone's gallery and you want to fetch those videos which let's say are from events such as birthdays which are generally cheerful right so from this audio visual sample we can analyze the a AIO which would mean the spoken content
            • 09:30 - 10:00 by the subject and the background voices could be music so we can analyze and get the emotion now this emotion information it can be stored as a
            • 10:00 - 10:30 metadata into this Gallery so let's say the user searches for all the happy videos we look through the metadata which tells us that when we analyze the audio these are the particular audio video samples which based on their spoken content and the background Voice
            • 10:30 - 11:00 or music sound cheerful so the same is then shown to the user now moving on to another very important application here let me first clear the screen a bit so looking at the aspects of operator safety let's say there is a driver and the driver is operating a complex
            • 11:00 - 11:30 heavy machinery you can think of an En environment for example in mining where a driver is handling a big machinery which has several controls what does that apply well harsh working environment a large number of controls of the machine and the high cost of error so the driver
            • 11:30 - 12:00 would be required to be attentive right now you can clearly understand the state of the driver by listening to what they are speaking and how they're speaking from The Voice pattern one could easily figure out things such as if the person is sounding tired
            • 12:00 - 12:30 is not attentive has negative emotion so if these attributes can be figured out the machine let's say the car or the mining machine it can give a feedback to the user an example feedback can be please take a
            • 12:30 - 13:00 break right before any accident happens please take a break because when I analyzed your voice I could figure out that you sounded tired distracted or an indication of some negative emotion which can hamper the productivity and affect the safety of the user and of the people who are are in this environment
            • 13:00 - 13:30 around the user now friends the other extremely important aspect of where we use this voice based effect is in the case of health and well-being now an example of that which is right now being experimented in a large number of academic and Industrial Labs is looking at the mental
            • 13:30 - 14:00 health through the voice patterns so an example of that is let's say we want to analyze data of patients and healthy controls who are in a study where the patients have been clinically diagnosed with un bolar
            • 14:00 - 14:30 depression so when we would observe the psychomotor retardation which I briefly mentioned in the facial expression recognition based lecture as well the changes in the speech in terms of let's say the frequency of Words which are spoken the intensity you the pitch you could learn a machine learning model model which can predict the intensity of
            • 14:30 - 15:00 depression similarly from the same perspective of objective diagnostic tools which can assist clinicians let's say there is a patient with ADHD so when a clinician or an expert is interacting with the patient we can
            • 15:00 - 15:30 record the speech of the interaction we can record the voices and then we can analyze how the patient was responding to the expert to the clinician and what was the emotion which was elicited when a particular question was asked that can give very vital useful information to the
            • 15:30 - 16:00 clinician now another aspect where voice based Effective computing is being used is for automatic translation systems now in this case a speaker would be let's say communicating between a party or translating right so let me give you an example to understand this let let say we have speakers of
            • 16:00 - 16:30 language one you know group of people who are in a negotiation deal trying to negotiate a deal with group of people who speak language too and both parties do not really understand each other's language now here comes a translator could be a machine could be a real person who is listening to group one translating to group two and vice Versa now along with the task of
            • 16:30 - 17:00 translation from language one to language two and vice versa there is a very subtle yet extremely important information which the translator needs to convey since the scenario is about negotiation let's say a deal is being cracked the emotional aspect of what is the emotion which is conveyed when the speakers of language one are trying to make a point to the other team that also
            • 17:00 - 17:30 needs to be conveyed and based on this simply by understanding the emotional the and the behavioral part one could indicate one could understand if let's say the communication is clear and if the two parties are going in the direction as intended you can think of it as an interrogation scenario as as well let's say interrogator speaks
            • 17:30 - 18:00 another language and the person who's being interrogated speaks another language right so how do we understand that and what is the direction of communication are they actually able to understand each other and when the context of the communication has changed all of a sudden a person let's say who was cooperating is now not cooperating but speaks another language so that is where we analyze the voice when you analyze the voice you can understand the emotion and that is a very extremely uh useful full Q in this kind of di
            • 18:00 - 18:30 conversation or multiparty interaction and of course in this case the same is applicable to the human machine interaction as well across different languages friends also another use case is mobile communication so let's say you are talking over a device could be on a using a mobile phone now from Strictly privacy aware health
            • 18:30 - 19:00 and well-being perspective can the device compute the emotional state of the user and then let's say after the call or communication is over maybe in a subtle way suggest some feedback to the user to to perhaps let's say calm down or simply indicate
            • 19:00 - 19:30 that you have been using the device for n number of hours this is actually quite long you may like to take a break right now of course you know in in all these kind of passive analysis of the emotion of the user the Privacy aspect is extremely important so either that information is analyzed used used as is on the device
            • 19:30 - 20:00 and the user is also aware that there is a feature like this on the device or it could be something which is prescribed suggested to the user by an expert so the confid confidentiality and privacy that need to be taken care of now this is a very interesting aspect friends on one end we were saying well when you use a camera to understand
            • 20:00 - 20:30 the facial expression of a person the is a major concern with the Privacy therefore microphone could be a better sensor so analysis of voice could be a better medium however the same applies to your voice based analysis as well because we can analyze the identity
            • 20:30 - 21:00 of the subject through the voice and also when you speak there could be personal information so where the processing has to be done to understand the effect through voice is it on device of the user where is it stored so these are all very extremely important applications which come into the picture when we are talking about these applications
            • 21:00 - 21:30 now let us discuss about some difficulties in understanding of the emotional state through voice so according to bordon and others there are three major factors which are the challenges in understanding of the emotion through voice the first is what is
            • 21:30 - 22:00 said now this is about the information of the linguistic origin and depends on the way of pronunciation of words as representatives of the language what did the person actually say right the content for example I am feeling happy today right so the content what is being spoken the interpretation of this based on the pronunciation of
            • 22:00 - 22:30 the speaker that could vary if that varies if there is any noise in the understanding of this content which is being spoken then that can lead to noisy interpretation of the emotion as well the second part the Second Challenge is how it is is said you know
            • 22:30 - 23:00 how is a particular statement said now this carries again paral linguist information which is related to the speaker's emotional state an example is let's say you were in discussion with a friend and you asked okay do you agree to what I'm saying the person replies in scenario one yes yes I agree to what you are saying in scenario 2 the person
            • 23:00 - 23:30 says yes I agree now in these two examples there is a difference right the difference in which how the same words were said the difference was the emotion let's say the confidence in this particular example of how the person agreed to the if the person agreed or not or was a bit you know hesit
            • 23:30 - 24:00 so we have to understand how the content is being spoken which would indicate the emotion of the speaker now looking at the third challenge third difficulty in understanding emotion from voice which is who says it okay so this means you know the the cumulative information regarding the speaker's uh basic
            • 24:00 - 24:30 attributes and features for example uh the age uh gender and even body size so in this case let's say a young individual saying I'm not feeling any pain you know as an example versus an individual uh you know uh adult saying I'm not feeling any pain
            • 24:30 - 25:00 right a young individual versus an adult speaking the same content I am not feeling any pain maybe the young individual is a bit hesitant maybe the adult who speaking this is too cautious so what that means is the attributes the characteristics of the speaker which not only is based on just their
            • 25:00 - 25:30 age gender and body type but also their cultural context So In some cultures it could be a bit frowned upon to express certain type of emotion in a particular context right so that means if we want to understand the emotion through voice of a user from a particular particular culture or particular age
            • 25:30 - 26:00 range we need to equip our system our Effective computing system with this meta information so that the machine learning model then could be made aware during the training itself that there could be differences in the emotional state of the user based on their
            • 26:00 - 26:30 background their cultures so this means to understand emotion we need to be able to understand what is spoken okay so you can think of it as speech to text conversion then how it is said a very
            • 26:30 - 27:00 trivial way to explain will be you got the text uh what was the duration in which the same was said were there any breaks were there any arms and reputation of the same words you know so that would indicate how it is being said and then the attributes of the speaker so we would require all all this information when we would be designing
            • 27:00 - 27:30 this voice based affect Computing system now as we have discussed earlier when we were talking about facial expression analysis through cameras the extremely important requirement for creating a system is access to data which has these examples which you
            • 27:30 - 28:00 could use to train a system right now when we are talking about voice based effect then there are three kind of databases you know the three broad categories of databases which are existing which have been proposed in the community now the attributes of these databases is essentially based
            • 28:00 - 28:30 on how the emotion has been elicited so we'll see what does that mean and what is the context in which the participant of the database have been recorded okay so let's look at the category so the first
            • 28:30 - 29:00 is natural data okay now this you can very easily link to facial expressions again we are talking about spontaneous speech in this case spontaneous speech is what you are let's say creating a group discussion you give a topic to the participants and then they start discussing on that
            • 29:00 - 29:30 let's say they are not provided with much constraints it is supposed to be a free form discussion and during that discussion within the group participants you record the data okay so that would be spontaneous replies spontaneous questions and within that we will have the emotion which is represented by a particular speaker now other environments scenarios where
            • 29:30 - 30:00 you could get this kind of spontaneous speech data which is reminescent of Representative of natural environment is for example also in call center conversations okay so in this case you know let's say customer calls in there's a call center representative a conversation goes on and if it's a real one then that could give you spontaneous speech similarly you could have you know cockpit recordings during abnormal
            • 30:00 - 30:30 conditions now in this case what happens right let's say there is an adverse condition there is an abnormal condition the pilot or the user they would be communicating based on you know how they would generally communicate when they're under stress and in that whole exercise we would get the emotional speech right then also conversation between a patient and a doctor and I already gave you an example right when we're talking about how voice could be used for affect
            • 30:30 - 31:00 Computing in the case of health and well-being right a patient asking questions to a patient replying to questions to doctors a patient replying to the questions of the doctor and in that case you know we would have these conversations about emotions same goes for you know these communication which could be happening in public places as
            • 31:00 - 31:30 well now the other category for The Voice based data set friends is simulated or acted now in this case the speech utterances the voice patterns they are collected from experienced trained professional artist so in this case you know you would have let's say uh actors who would be coming to a recording studio and then they could be given a
            • 31:30 - 32:00 script or a topic to speak about and you know that data would be recorded now there are several advantages when it comes to the simulated data Advantage is well it's relatively easier to collect as compared to Natural data since the speakers they are all already you know informed about the
            • 32:00 - 32:30 content or the theme which they are supposed to speak they also have given an agreement so you know the as compared to your natural data privacy could be better handled in this case or I should say easier to handle now the issue of course is when you're talking about simulated data acted
            • 32:30 - 33:00 data then not all examples which you are capturing in your data set could be the best examples of how the user Behavior will be in the real world now the third category friends is elicited emotion which is induced so in this case an example of coures you know let's say you show a
            • 33:00 - 33:30 stimuli a video which contains positive or negative effect and after the user has seen the video you could ask them to answer certain questions about that video and the assumption is that the stimuli would have elicited some emotion into the user right and that would be
            • 33:30 - 34:00 affected represented shown when the speaker the user in the study is answering questions now let's look databases which are very actively used in the community the first is the IBO database by bat liners and others now this contains the interaction between children and the ibor
            • 34:00 - 34:30 robo contains 110 dialogues and the emotion categories the labels are anger bordedom empathetic helpless ironic and so forth so the children are interacting with this Robo the robo is a cute you know uh son IO dog Robo so uh the Assumption here is that the participant would get a bit comfortable with the
            • 34:30 - 35:00 robo and then emotion would be elicited within the participant and we can have these labels these emotion categories you know labeled afterwards into the data which has been recorded in during the interaction between the robo and the children the other data set which is very commonly used is the Bern Berlin database of emotional spe speech which
            • 35:00 - 35:30 was proposed by badat and others in 2005 now this contains 10 subjects and you know the the data consists of 10 German sentences which are now recorded in different emotions notice this one is the acted one okay so this is the acted type of data set wherein the content was already provided to the participants actors and you know they
            • 35:30 - 36:00 try to speak it in different emotions so what does this mean is now the quality of emotions which are reflected are based on the content and the quality of acting by the participant now friends the Third data set is the riers and audio visual
            • 36:00 - 36:30 database of emotional speech and song now again this is an active data set contains professional actors and these actors were given the task of focalizing two statements in North American accent now of course you know again the cultural context is coming into the picture as well you know this is also acted and if you compare that with the first data set of the I about data then this was more spontaneous okay
            • 36:30 - 37:00 of course you know you would understand that getting this type of interaction is non-trivial so extremely important to be careful about the privacy and all the ethics approvals which are required to be taken now in these kind of databases where you have actors it is relatively easier to scale the datab B because you know you could hire actors and you can have multiple
            • 37:00 - 37:30 recording sessions and you can give different content as well now moving on to other databases so friends the next is the IIT kgpc data set which was proposed in 2009 by kagi and others again an acted data set 10 professional artist now this is a non-english data set it's in an Indian language
            • 37:30 - 38:00 Telugu now each artist participant here they spoke 15 sentences trying to represent eight basic emotions in one session another data set is again from the same lab called the IIT kgpc Again by ra and kagod and others posing 11 now now in this case you again have 10 professional
            • 38:00 - 38:30 actors but the recording is coming from Radio jockies okay so from all India radio so these are extremely good speakers and the emotions are eight categories again acted but if you have high quality actors then the assumption is that we would be able to get emotional speech as directed during the creation of the data set now moving forward
            • 38:30 - 39:00 friends the number of erences here you know this is a fairly good size data set 15 sentences eight emotions and then 10 artist and they were recording 10 sessions so we have 12,000 samples which are available for the learning of an emotional speech analyzation system now now in the
            • 39:00 - 39:30 community there are several projects going on now they are looking at different aspects of effect and behavior so one such as the empathetic Grant in the EU so there as well you know there are these data set resources which are used for analysis of emotions in speech and another extremely useful very commonly used platform is the
            • 39:30 - 40:00 computational paralinguistic challenge platform by sker and batliner so this is actually hosted as part of a conference called inter speech now this is a very reputed conference in speech analysis so in the the compare benchmarking challenge the
            • 40:00 - 40:30 organizers have been proposing every year different sub challenges which are related to speech analysis and a large number are related to emotion analysis and different tasks and different settings in which we would like to understand the emotion of a user or a group of users now there would be some acted and some spontaneous uh data sets which
            • 40:30 - 41:00 are available on this benchmarking platform now moving on from the speech databases let's say the databases have been collected could be acted could be spontaneous the next task is to generate the annotations the labels of course before the decoding is done
            • 41:00 - 41:30 the design of the experiment would already consider the type of emotions which are expected to be annotated generated from the data so one popularly popularly used tool for annotation of speech is the aino tool now here is a quick go through of how the labeling is done let's say friends here is the waveform
            • 41:30 - 42:00 representation the labeler would listen to the particular chunk let's say this chunk can relisten can move forward backwards and they also could have access to the transcript what is being spoken during this time so what they can do is you know they can can then add the emotion which they interpret
            • 42:00 - 42:30 from this audio data point and they can label you know different things such as the topic of the spoken text and also things such as you know who is the speaker and the metadata so once they listen to the content they generate the labels then they can save and then they can move forward now extremely important to have the right annotation tool because you may be planning to have
            • 42:30 - 43:00 create a large data set representing different settings so in that case you would also have multiple labelers right so if you have multiple labelers the tool needs to be scalable and as friends we have already discussed in the facial expression recognition lectures as well if you would have multiple labelers you
            • 43:00 - 43:30 will have to look at things such as consistency for each labeler how consistent they are in the labeling process with respect to the sanity of the labels so you would like the labels to be as less affected by things such as confirmation bias so after you have the database
            • 43:30 - 44:00 collected annotated by multiple labelers you may like to do the statistical analysis of the labels for the same samples where the labels are generated from multiple labelers so that at the end we have one or multiple coherent labels for that data point
            • 44:00 - 44:30 now let's look at some of the limitations and this also is link to the challenges in voice based effect analysis what we have seen till now is that there is a limited work on non-english language based effect analysis you already saw the IIT kgp data sets which were around the Telugu language and then the Hindi language there are a few uh German speaking uh
            • 44:30 - 45:00 data sets as well uh Mandarin as well but they are lesser in number smaller in size as compared to English only based data sets so what that typically would mean is let's say you have a system which is analyzing the emotion of a speaker speaking in English you use that data set and then you train a system on that data
            • 45:00 - 45:30 set now you would like to test it on other users who are speaking some other language now this cross data set performance across different languages that is a big challenge right now in the community why is it a challenge because you've already seen the challenges which are there three challenges you know what is being said who said it and how it was
            • 45:30 - 46:00 said so these will vary across different languages the other is limited number of speakers so if you want to create emotion detection in a system based on voice which is supposed to work on a large number of users on a large scale you would hardly like a data set where you can
            • 46:00 - 46:30 get a large number of users in the data set so that we learn the variability which is there when we speak right different people will speak differently we'll have different styles of speaking and expressing emotions now with respect to the data sets of course there's a limitation based on the number of speakers which you can have uh there is a prct practicality limit let's say you wanted to create a spontaneous data
            • 46:30 - 47:00 set so if you try to increase the number of data participants in the data set there could be challenges such as getting the approvals uh getting the approval from the participant themselves and so forth now on the same lines friends issue is there are limited natural databases and I've already explained to you right uh creating spontaneous data is a challenge because if you the user is aware they are being
            • 47:00 - 47:30 recorded that could add a small bias the other is the Privacy concern needs to be taken into picture so the spontaneous conversations you know uh if the proper ethics uh and the permissions have been taken or not you know the ethics based considerations are there or not so all that affects the number and size of the
            • 47:30 - 48:00 natural databases now this is fairly new but extremely relevant as of today there is not a large amount of work on emotional speech synthesis so friends till now I have been talking about you have speech pattern of some someone spoke machine
            • 48:00 - 48:30 analyzed we understood the emotion but remember we have been saying right effect sensing is the first part of f Computing and then the feedback has to be there so in the case of speech we can have emotional synthesis done so the user speaks to interacts with the system system understands the emotion of the user and then the reply back let's say that
            • 48:30 - 49:00 is also through speech that can have emotion in it as well right now with respect to the progress in the synthetic data generation of course we have seen large strides in the visual domain data phase generation facial movement generation but comparatively there is a bit less progress in the case of emotional speech and that is due to
            • 49:00 - 49:30 of course you know the challenges which I have just mentioned above so this of course is being you know worked upon in several Labs across the globe but that is currently a challenge how to add emotion to the speech some examples you can check out uh for example uh from this link uh from uh developer.amazon.com you know there are a few Styles few emotions which are added but essentially the issue is as
            • 49:30 - 50:00 follows let's say I want to create a text to speech system which is emotion aware so I could input into let's say this TTS text to speech system the text this is the text from which I want to generate the speech and let's say as a one hot Vector the emotion as well now this will give me the emotional speech but how do you scale
            • 50:00 - 50:30 across large number of speakers typically high quality TTS systems are subject specific you will have one subject's text to speech model of course there are newer systems which are based on machine learning techniques such as zero short learning or one shot you know where you would require lesser amount of data for
            • 50:30 - 51:00 training or in zero shot what you're saying is Well I have the same text to speech system which has been trained for a large number of speakers along with the text and emotion I would also add the speech from the a speaker for which I want the new speech to be generated based on this text input right so that is the challenge how do you scale your text to speech system across
            • 51:00 - 51:30 different speakers and have the emotion synthesized the other friends extremely important aspect which is the limitation currently is cross lingual emotion recognition I've already given you an example when we're talking about the limited number of non-english language based emotion cognition works so you train on language one a system for
            • 51:30 - 52:00 detecting emotion test it on language two generally a a large performance drop is observed but one thing to understand is let's say for some languages it is far more difficult to collect data to create databases as compared to some other languages right some languages are spoken more there are larger number of speakers other languages could be older languages are spoken by less number of people so obviously the creating data
            • 52:00 - 52:30 sets would be a challenge therefore in the pursue of crosslingual emotion recognition we would also like to have systems where let's say you train a system on language one which is very widely spoken and the assumption is that you can actually create a large data set then can we learn systems on that data
            • 52:30 - 53:00 set and later borrow and do things such as domain adaptation adapt from that learn from that borrow information and fine tune on another language where we have smaller data sets so that you know now we can do emotion recognition on data from other smaller data set now another challenge
            • 53:00 - 53:30 limitation is this is applicable to not just voice or speech but other modalities as well when you're looking at the explanation part of why given a speet sample the system said the person is feeling happy if you use traditional machine learning systems for example your decision trees
            • 53:30 - 54:00 or support Vector machines it is a bit easier to understand why the system reached at a particular emotion why the system predicted a certain emotion however speech based emotion through deep learning so this is deep learning friends DL deep learning based methods even though it has the state-of-the-art
            • 54:00 - 54:30 performance even then the explanation part of why you read at a certain consensus based on the perceived emotion through the speech of a user that is still a very active area of research so we would like to understand why the system reached at us certain point with respect to the emotion of the
            • 54:30 - 55:00 user because if you're using this information about emotion state of the user in let's say a serious application such as health and well-being we would like to understand how the system read at that consensus so friends with this we reach the end of lecture one for The Voice based emotion
            • 55:00 - 55:30 recognition and we have seen why speech analysis is important why is it useful for emotion recognition and then what are the challenges in understanding of emotion from speech from there we moved on to the different characteristics of the databases the data which is available for learning voice-based emotional recognition systems and then we concluded with the
            • 55:30 - 56:00 limitations which are currently there in voice-based Emotion recognition systems thank you