Technical Guide to CFA

Confirmatory Factor Analysis in R with lavaan

Estimated read time: 1:20

Learn to use AI like a Pro

Get the latest AI workflows to boost your productivity and business performance, delivered weekly by expert consultants. Enjoy step-by-step guides, weekly Q&A sessions, and full access to our AI workflow archive.

Summary

In this engaging seminar led by UCLA's Johnny Lin, participants are introduced to the essentials of performing Confirmatory Factor Analysis (CFA) using R and the lavaan package. The seminar, tailored for those with a quantitative psychology background or interest in latent variable modeling, thoroughly outlines the theoretical underpinnings of CFA—covering topics from variance-covariance matrices to model fitting. It provides hands-on exercises, explores different standardization methods, and highlights common pitfalls in model fitting, offering strategies to improve model accuracy. The session is enriched with interactive quizzes, diagrams, and real-world application scenarios.

Highlights

Johnny Lin, a consultant at UCLA, introduces CFA and its applications in lavaan. 🎓
Participants are guided through the initial setup and execution of CFAs in R. 🖥️
Key concepts like latent variables, matrix algebra, and model fit checks are explored. 📐
Quizzes and discussions break down complex ideas into digestible bits. 💡
Real-world datasets demonstrate the practical utility of CFA in research. 📊

Key Takeaways

CFA with lavaan in R is accessible even for beginners with a background in regression. 📊
Understanding the role of latent variables and their identification is critical. 🔍
Model fitting involves nuanced considerations like degrees of freedom and model fit indices. 🎯
Interactive exercises help solidify concepts and refine analytical skills. 💡
A relaxed and supportive learning environment enhances comprehension and engagement. 🙌

Overview

Johnny Lin, with his expertise in quantitative psychology, commenced the seminar by explaining the basics of Confirmatory Factor Analysis in R using lavaan. Emphasizing latent variable modeling, Lin shared his academic journey to contextualize the learning pathway for attendees. The seminar covered various central concepts, like understanding latent variables, structural equation modeling, and interpreting model outputs.

Participants engaged through live coding sessions, where they learned to install necessary libraries and execute CFAs. Lin's explanations on how to set up a CFA, whether using marker methods or variance standardization, equipped participants with a foundational understanding of analyzing model structures. Exercises were strategically integrated to facilitate hands-on learning, enabling attendees to practice executing models with real-world data.

The seminar concluded with a focus on model fitting and its challenges. Lin stressed the importance of fitness measures like the chi-square, CFI, TLI, and RMSEA values. Through quizzes and participant interaction, he encouraged problem-solving and critical thinking, ensuring that complex statistical theories were approachable. This comprehensive approach left attendees better prepared to tackle the nuances of CFA in their research endeavors.

Chapters

00:00 - 02:30: Introduction to Confirmatory Factor Analysis (CFA) The chapter titled 'Introduction to Confirmatory Factor Analysis (CFA)' is part of a seminar series. This is the third seminar and focuses on CFA using R and a package called Lavaan. The speaker, Johnny Lin, introduces himself as a consultant from the Statistical Consulting Group at UCLA with a PhD in quantitative psychology.
02:30 - 12:30: Understanding CFA and EFA The chapter introduces the topic of latent variable modeling, focusing specifically on Confirmatory Factor Analysis (CFA) and structural equation modeling. The speaker, with a background in these areas from UCLA, plans to discuss these topics during the session. Additionally, an update on the availability of a recording is provided, stating that it typically takes about a month for the recordings to be processed and uploaded.
12:30 - 22:00: Covariance and Correlation Matrices The chapter discusses the availability of resources related to statistical consulting through a specific YouTube channel managed by IDRE. It mentions that a playlist dedicated to statistical consulting currently has only one video, but there are plans for more content in the future. Viewers are directed to the channel's link which is shared in the chat.
22:00 - 29:29: Factor Analysis Linear Equations The transcript discusses the plan to upload previous videos along with closed captioning, which will take some time. It mentions the availability of the playlist and encourages participants to unmute themselves to ask questions during the seminar, reminding them to re-mute to avoid background noise distractions.
29:29 - 38:30: Path Diagrams and Covariance Model The chapter introduces the concept of path diagrams and covariance models, which are essential tools in statistical analysis.
38:30 - 55:30: Degrees of Freedom in CFA The chapter titled 'Degrees of Freedom in CFA' begins with a suggestion for direct communication through unmuted microphones, led by someone named Johnny Lin. Before delving into the main content, there are preparatory steps involving setting up the necessary software environment. This includes installing required libraries in R, setting up R and R Studio, and loading the appropriate packages. There are also references to downloadable code and slides to assist with the learning process.
55:30 - 74:29: CFA Syntax and Interpretation The chapter 'CFA Syntax and Interpretation' begins with an interactive lecture involving polls and questions to engage the readers. The first part focuses on delivering the lecture content. The second part consists of practical exercises, with three exercises planned, although completing two exercises is considered sufficient.
74:29 - 96:30: Two-Factor CFA Analysis This chapter begins with an introduction explaining that the seminar will have two main parts: a lecture and an interactive exercise. The audience is encouraged to ask questions at any time, though the first part will mainly be a lecture. Participants are asked if they have any questions about the structure of the lecture before it proceeds.
96:30 - 107:30: Model Fit Indices and Interpretation The chapter introduces the concept of model fit indices, emphasizing their importance in the interpretation of statistical models. It starts with an initial interaction asking if the audience can see the shared PowerPoint slides, which are central to the presentation. Further details on model fit indices and methodologies are implied to be covered through the visuals and subsequent explanations in the presentation.
107:30 - 127:30: Advanced Exercise and Conclusion The chapter provides an overview of a seminar on confirmatory factor analysis in R. The speaker thanks an individual named Andy for sharing the slide link and assures the audience that they are in the correct session for learning about this specific type of factor analysis. The speaker shares feedback from previous sessions and aims to keep the seminar concise and focused.

Confirmatory Factor Analysis in R with lavaan Transcription

00:00 - 00:30 welcome to the third seminar in our series this is introduction to confirmatory factor analysis and r with lavon my name is johnny lin i am one of the consultants at the idre or statistical consulting group at ucla just a little bit about my background um i i got my phd in quantitative psychology
00:30 - 01:00 here at ucla as well and during that time i kind of focused on latent variable modeling so including cfa and structural equation modeling so that's kind of my training and so that's what i'm going to be talking about today as well so someone asked if the recording will be available so we are working on processing the recordings a little bit faster but um it does take typically a month for us to upload it so
01:00 - 01:30 but here is where it's going to be uploaded to the full link is here maybe i should put that in the ch i think i put that in the chat but i'll put it again so this is our idre uh youtube channel it does include other departments within idre but uh more specifically you go on this playlist you'll see the statistical consulting playlist and currently there's only one video right now but we're um in the coming
01:30 - 02:00 months we'll be uploading all the previous videos uh pretty soon it just we have to work on closed captioning so that's gonna take a little bit but otherwise you'll you're going to be able to find the the playlist there okay any questions about that and during this seminar just feel free to unmute yourself if you want to ask a question just make sure to re-mute yourself because sometimes we hear like background noise and that's kind of distracting so just remove yourself every time okay
02:00 - 02:30 the other thing is uh we have other consultants here so andy lynn is the supervisor of our group so he will be moderating the chat along with siavash jalal so if you have any questions that you don't want to ask over the microphone you can put them in the chat and then andy and sivash will be moderating it and answering your questions there okay for me personally i i don't have enough resources to go back and forth between the chat
02:30 - 03:00 so if you want to ask me a question directly just unmute your microphone and my name is johnny lin just call me johnny uh you can do that if you want and all right so the first thing you kind of want to do before we get started is really just to make sure you have the libraries installed in r make sure you have r and r studio set up and then load the packages okay you can download the code here and then the slides
03:00 - 03:30 all right the exercises we're going to work on later so how this is going to work is i'm first going to kind of do a kind of a lecture style but i want to make it interactive so i'm going to include some polls i'm going to include some questions for you to answer and then uh the second part though will kind of be involved more like exercises for you to work on just i have three exercises set up but if we just get the two that's fine
03:30 - 04:00 and it's more of a place for you to ask questions and interact so um but throughout the seminar feel free to ask questions at any time it's just the first part is going to be a lecture format the second part is going to be more of an interactive exercise format okay and yeah and then so do you guys have any questions so far about how this the lecture's gonna be set up okay if not we're gonna go into the
04:00 - 04:30 powerpoint slide so make sure you can click on the powerpoint slide there and i'm just going to share my screen there so you should be able to see something that looks like this let me know if you don't see that and this is where we're going to work off of most for most of the presentation okay all right so if any questions before we
04:30 - 05:00 start otherwise we're just gonna jump right into it yeah and thanks andy for including the link to the slide so if you're here for confirmatory factor analysis in r you're in the right place okay all right so here's an outline of what we're going to be talking about you know the previous feedback i got from other seminars is try to keep it kind of concise okay so we're going to focus
05:00 - 05:30 on very specific topics that are basically going to help you run your analysis and live on all right we're not going to go over every single like advanced topic in cfa that's beyond the scope of this seminar it's really just to get your hands wet in terms of the theory and then in terms of how to run a cfa in lavon we're not going to go over like advanced topics like multi-group as cfa or something like that so it's really just going to be very intro as long as you kind of understand
05:30 - 06:00 how to run the code and how to interpret the output i think that's what we're going to accomplish today so i like kind of uh you know applying what i know to a data set okay so instead of always just focusing on the theory what we're going to do is kind of motivate this example with a fake data set called the saq and i'll i'll explain what that means i come from a psychology background so i'm i'm very familiar with questionnaire and likert type sales and that's
06:00 - 06:30 kind of where um factor analysis came from too we're going to talk briefly about the variance covariance matrix so if you're not familiar with matrix algebra it's just a kind of a way to kind of describe the cfa model and it's a very important element of it so i want all of us to understand that before we jump into cfa and then um the model itself what what a factor analysis model is okay if you've if you've taken
06:30 - 07:00 regression before then um this is going to be uh kind of a bridge between regression and and this then i'm going to talk about the model implied covariance matrix which is basically how to recreate that variance covariance matrix using the model that i'm talking about here the factor analysis model and if matrices are kind of scary and unfamiliar path diagrams are going to help you because it's a way to visualize that okay and in terms of getting straight into
07:00 - 07:30 the cfa you do have to kind of understand what are parameters what are free parameters what are degrees of freedom um it's a little technical but bear with me because it really helps you to understand uh you know the the concept of identification in cfa which is important so the basic cfa is a three item cfa and how do you identify that i'll talk about that later because like basically you can't just run that cfa
07:30 - 08:00 because you have something called a latent variable which is a factor so it's not like a regression where everything's identified so you have to uh fix something to be identified in cfa and i'll show you how to do that and then i'll show you how to run that cfa and levon so once you have a cfa there are certain cfas where you are able to assess for example how how well the model fits how well does your cfa fit and um that's using something called the chi-square or you have other non non-exact indices which are called
08:00 - 08:30 approximate indices and i'll explain why we need that later but there's the cfi the tli and the rmsea and then finally we'll just do a really brief example of a two-factor cfa with correlated factors then we'll take a quick break and then we'll go into exercises okay so that's going to be the format of our seminar but before we do that i kind of want to uh start our first poll it's it's not content related it's just basically i want to understand like how
08:30 - 09:00 well your kind of background and if you've had experiences with like linear regression and r or if you've heard of like cfa or efa or things concepts like that and if you don't mind taking just like 30 seconds to kind of answer that that'll help me to target my seminar or at least change the pace a little depending on your familiarity with these topics
09:00 - 09:30 all right i'll give maybe like 10 more seconds for you to respond three two and one let's show the results
09:30 - 10:00 okay so you guys are okay some of you use r frequently or sometimes most of you use it sometimes that's okay okay so that's good all of you have taken a course in linear that's perfect really or most of you and you've heard of cfa but never learned it okay awesome and 40 of you have taken a workshop and see if that's good okay all right perfect thank you for answering that by the way that's actually really helpful
10:00 - 10:30 so that means i can kind of talk about concepts like um the linear model right like the regression model or like the null hypothesis maybe all right perfect okay so we're gonna just introduce what cfa is kind of all right so before you talk about cfa though you kind of have to see it in the broad context of just the latent variable models in general so so what i've drawn the circle is basically what we call a latent variable a set of
10:30 - 11:00 latent variable models okay there's other latent variables besides this but these are the main three ones so there's efa there's cfa and then there's sem efa stands for exploratory factor analysis and by the way the link underneath there is a link to one of my pages that i wrote so you can click on that link if you want like a thorough treatment on the website for that topic so efa is is called exploratory factor analysis and it is a latent variable model
11:00 - 11:30 because the factor is unobserved so this is more appropriate when you're just exploring a study and you have no idea how the i the items in your survey kind of relate to each other so if you're developing a new depression inventory okay and you just collect kind of questions that you think are related to depression but you don't know exactly if it measures depression that would be
11:30 - 12:00 appropriate for an efa because you can use that to explore the structure of the the survey you can remove items you can you can add items you can you know use it to kind of figure out is it a one factor is a two factor and that's kind of what i explained in that in the seminar there cfas if you already have a developed hypothesis so if you already know that you have a depression inventory like the beck depression inventory let's say or cesd these are popular psychology depression inventories and
12:00 - 12:30 you just kind of want to see how well this survey is conducted in your particular sample you want to test the hypothesis that uh your sample kind of replicates the the what we call the covariance structure of the the back depression inventory then you would use cfa so it's testing a particular hypothesis there's actually a null hypothesis that you're trying to either refute or just unrefute okay so that's that's cfa and sem is kind of
12:30 - 13:00 related to cfa in that it's a broader framework where where it's not just factors analysis but it's actually relating factors to each other and i talked about that in in my intro to scm seminar it's also in levon but it's basically uh allowing you to run kind of regressions on the factors where cfa only is concerned with what we call the measurement model so in terms of software traditionally too it's uh it's kind of different um efas traditionally has
13:00 - 13:30 uh has has been a precursor to cfa so efa was developed before cfa and so uh programs like spss there's a function called spss factor that does efa but for example spss itself does not do cfa so recently they expanded kind of a package called amos which was traditionally separate from spss
13:30 - 14:00 which does through cfa but it looks very different from spss because it's not it's not traditionally part of spss so um another example of a cfa package is called m plus that's a very powerful program and i recommend if you're really into studying like cfa and sem and you want to use that in your analysis i highly recommend considering m plus beyond levon if it's beyond a basic analysis then plus is very powerful and speaking of uh and plus then ethnos also does scm okay so scm is is and cfa kind of go
14:00 - 14:30 hand in hand so like i would like kind of like group scm and cfa together and then efa is kind of its separate own entity okay so efa is more exploratory like the name implies and cfa and sem are more confirmatory any questions about kind of what the conceptual difference is between efa cfa all right
14:30 - 15:00 so today we're going to be talking about cfa okay so to motivate the example we're going to talk about the spss anxiety questionnaire so maybe for some people they are scared of using statistical packages like spss so this is just an example of eight items from the saq and for example statistics makes me cry you know like um for me a statistics makes me happy right but
15:00 - 15:30 for some people it makes them cry so these are example items from the survey and we hypothetically collected 2571 people for this particular sample and for each of these items they are scaled in what we call a leica type scale going from strongly disagree to strongly agree one to five okay so that's some people ask like okay can i do kind of regression or cfa on leica
15:30 - 16:00 skills yes um just know that um you know it ideally it should be kind of normally distributed and if it's not normal then you have to consider other kind of non-normal adjustments to your your cfa or sem but we won't talk about that in the seminar that's more of an advanced topic okay but um i would say this is a good one because it has five categories now it wouldn't be as appropriate if you just had like zero or one like yes or no that that's something called item factor analysis where you have to kind of look
16:00 - 16:30 at kind of a logistic regression for that okay but otherwise you can treat this as a continuous or ordinal scale okay so like i said like before we start the seminar if you want to be part of the exercise portion of this class you're you'd want to install this uh these two packages the reason why we want the foreign package is because i want to um download this spss data set you'll notice the data set is an spss data set
16:30 - 17:00 on sav as aqua sav and uh foreign converts that to a r data set but primarily you want to work with levon which is going to be the scfa or sem package that we're going to be talking about today and if you're able to use this uh command to download the data set then you're good to go for the exercises okay so but we're not going to get into
17:00 - 17:30 the exercises for now we're going to be mostly talking about kind of the levon syntax and the output and then we're going to go into exercises after the intermission okay but before we uh jump into factor analysis i kind of wanted to just review concepts of the covariance or correlation matrix okay because that's important for understanding factor analysis so basically uh factor analysis is looking at the
17:30 - 18:00 correlations among items right so what you want to consider is basically how well do these items correlate and then across all of them how well how well in general do they core really okay because factor analysis kind of reduces that dimension of your items down to just one or two let's say you have one or two factors and you want to make sure then that all the items are interrelated with each other because if you if you think that these items are measuring spss anxiety they should all be generally correlated
18:00 - 18:30 and if some of them are not correlated you have to question whether it really measures spss anxiety right because maybe you're just measuring something else so the first thing you'll notice is that there is a one on the diagonal just make sure we call this a diagonal of this matrix but that's because um you know everything correlates with this of itself first perfectly right so remember our correlation ranges between negative one and one
18:30 - 19:00 okay so one just means it's perfectly correlated all right so the other thing to note is what is the correlation uh on the off diagonal so that was the diagonal now this is an element in the off diagonal right so what does that mean this is the correlation of item one with item two and let's see what item one was item one was statistics makes me cry and then item two is my friends will think i'm stupid for
19:00 - 19:30 not being able to cope with spss so you notice that that's a negative correlation but it's kind of low okay so so that means item one and item two are negatively correlated now what about um item two and item one okay so we looked at item one and item two which is here what about item two and item one well you notice it's the same correlation right so correlations are symmetric meaning like basically
19:30 - 20:00 one and two is the same as two and one so is that true for everything yes well if you look at three and one it's the same as uh one and three is the same as three and one right so if you go through all the lists of of some symmetry you'll basically find that there's something called the upper triangle here and then there's something called the lower triangle so i'm drawing two triangles and i claim that the upper triangle
20:00 - 20:30 is exactly the same as the lower triangle right and you can verify that's true prove me wrong if i'm not this is a property called symmetry right like i said so what's the benefit of having symmetry well it because if you estimate three and one the correlation you don't have to estimate one and three right so that reduces the number of parameters okay it saves you some um estimation and if you were a computer
20:30 - 21:00 you'd like that because then you can literally just do one and then you get you know two for one so so so that's a good property to have is that it's this covariance matrix is symmetric okay or this correlation matrix the other thing you have to kind of just note is that um this is a correlation matrix scm however or not sems cfa and scm uses what we call the covariance matrix okay so the in how do you know the difference between a correlation and covariance matrix well you basically see
21:00 - 21:30 that the um the diagonals are no longer one if these you see are no longer one then you know it's a covariance matrix because then uh it's it's not it's not in the correlation metric because correlation remember is a standardized covariance so right now we're looking at a correlation matrix not a covariance matrix any questions about that
21:30 - 22:00 right so this is where we get into a little bit of linear equations and if this is a little scary to you don't worry we're just going to like kind of gloss over and then you don't really have to know the technical details of it to understand cfa but since all of you most of you have taken regression you probably have seen this linear equation before this is this is basically a linear regression right so you have your outcome your d dependent variable you have your intercept you have your slope okay and then you
22:00 - 22:30 have your predictor and then your residual well cfa isn't that different from a linear regression you have an outcome which is your item let's say it's statistics makes me cry that's your item you have an intercept that's called tau one you have a coefficient or slope kind of you can call it a slope but basically called a loading this is this and um cfa is called the loading but it is a coefficient and then the eta what is eta this is the
22:30 - 23:00 um predictor right but this is actually the factor okay so instead of being called a predictor it's just the factor and what's so unique about the factor it's unobserved okay so what's the difference then between linear regression and um factor analysis is that the the factor is unobserved but the predictor in the linear regression is observed okay so that's the main difference and then you also have kind of a residual term here so why did i put a 1 here well the one
23:00 - 23:30 is saying kind of this is a multivariate equation okay and this is a univariate multivariate means i have multiple outcomes and what are my multiple outcomes i have eight remember i have statistics makes me cry my friends will think i'm stupid for not being able to cope with spss standard deviations excite me those are the first three outcomes right each of those has a separate intercept
23:30 - 24:00 a separate loading separate residuals but what do they have in common they have a single factor in common why is that important because we're trying to say that this single factor is the predictor of all three items they correlate because of this factor called spss anxiety does that make sense and if you write out this is called the matrix
24:00 - 24:30 formulation if you just write it out in table in terms of equations you'll see that it's just three separate equations basically three separate regressions with the same unobserved predictor or factor any questions about that yeah and so just fyi like this is the more theoretical part but i promise we'll get
24:30 - 25:00 into the equate i mean the lavon part later okay but i i i feel like this is important for you to at least understand why we do cfa okay so now that we kind of know the covariance matrix and then the the the factor model basically this is the covariance matrix
25:00 - 25:30 and then this is the model implied covariance matrix just know that the uh covariance matrix comes from the data and then this model implied covariance matrix comes from your model as it states so why why how is this different from what i showed you before well first thing i showed you before was a correlation table that came from our data this comes from
25:30 - 26:00 the population so that's what you have to keep in mind is is that um the the upper greek letters mean population so these are in the population not in my sample and um how do i get this uh model covariance matrix well i show it in the appendix in the web page but basically it comes from the that lambda remember that lambda coefficient if i take the covariance of that factor model i'm going to get that
26:00 - 26:30 lambda this is the covariance of the factor and then this is the covariance of the residual okay so again i write it out here this is lambda this is the covariance of the factor or actually it's the variance right because the covariance of itself is variance and then multiplied by the the loading and then this is the residual covariance this is what's left over that epsilon remember that i take this is for example the variance
26:30 - 27:00 of the three residuals and then these are the covariances i remember this is also symmetric so i don't have to estimate these okay any questions about that all right so if that was scary then basically i have a picture for you to look at okay pictures for me are less scary than equations but the beautiful thing about a path diagram is that you're able to map
27:00 - 27:30 the equations to these pictures so before we do that though we kind of have to understand the elements of each path diagram right so the element that you're probably most familiar familiar with if you've heard of linear regression is is this one all right this is the observed indicator this is your observed outcome your dependent variable that you use in regression now um what's you may have not seen before is the latent variable okay so this is f
27:30 - 28:00 or what i can call eta remember that so this is a factor right it's it's unobserved so circle means unobserved and square means observe triangle means intercept so remember the tau that we had or we had the beta's not you know you can call it whatever you want that's the intercept term and then this is a path right so what goes in here so this could be your beta 1 right or it
28:00 - 28:30 could be your lambda right that's your path and then this is a variance if that's one one arrow if you have two arrows it's called the variance or covariance remember what was one of our variances that was the psi right that's the variance of the factor so those are the basic elements of the path diagram and then if you wanted to kind of see how it works so you know
28:30 - 29:00 this double arrow means variance right and the circle means factor right so the the variance of the factor is psi one one the factor is eta now the eta predicts the outcome one right and that's the loading any questions about that how to translate the equation to that diagram see if there's okay good
29:00 - 29:30 if you guys are asking questions on chat that's good feel free to unmute your mic too um okay um so so what i mean by so all of these are measurement models okay so what i like to do is kind of think about okay which part of it is coming from the factor model and then which part of it is coming from the covariance model they're the same model though okay so
29:30 - 30:00 they're both the measurement models i just wanted to clarify that but if you if you kind of go through this path diagram okay so first let's look at the dependent variable right so this is this is basically statistics makes me cry what else my friends will think i'm stupid for not being able to cope with spss and then standard deviations exciting for example okay so those are your outcomes right
30:00 - 30:30 they're observed that's why they have a square now what's predicting the outcome the latent variable that's the factor right eta one how many factors do i have here one i only see one circle but i have three outcomes okay so what is the regression path from the from the factor to the outcome these are the
30:30 - 31:00 loadings i have three loadings right so that i just checked those off those are in the factor model equation right what's left well i have the intercept tau 1 tell 2 until 3. those are unique to every item every item has a unique intercept what does this one mean if you know regression basically it's that um column matrix i mean the design matrix of x where you have the first column being the the of x being the
31:00 - 31:30 intercept okay that's where it comes from that one it's part of the design matrix and then you have finally the residuals right the error terms here all right so those come from the factor model uh now that's in green right but remember we have to we have another equation which is the uh covariance equation and there i have what terms well i already already have the uh loadings so i don't need those but i do
31:30 - 32:00 have these right that is what what is that that's the factor variance right where does that come in well that's the double arrow remember of the factor which is a circle that's called the factor variance that's a factor variance of eta one which is the first factor that's the spss anxiety factor right okay what's left in blue
32:00 - 32:30 we have the covariance of the residuals that that covariance matrix well what did i say here then remember that the diagonals are the variances so i have three variances of the residuals one two and three what does it mean when i have zero on the off diagonal and remember it's symmetric so the upper triangle is the same as the lower triangle when i say zero in the off diagonal
32:30 - 33:00 this is the the covariance of item two and item one right so what i'm saying is that covariance is zero i'm also saying the covariance of 3 and 1 is 0 and 2 and 3 are 0. now what would happen if i didn't set those to 0. how would i draw that path if i didn't set those to zero and i
33:00 - 33:30 allowed the covariance do you guys have an idea you just put arrows between uh epsilon 1 and epsilon 2 and so on and so forth exactly great great great great yes so you would draw a double path right between these right perfect so that would be theta one two theta two three and i'm missing one more right so that's how you would draw what we
33:30 - 34:00 call a correlated error model perfect good job all right so do you guys understand basically how to translate the path notation to your linear equation and more specifically too is which parts are the covariances and which parts are coming from the factor model itself because i i for the longest time when i tried to learn this i was like how come some of them are like covariances and some of them are just like terms well the point is that they come from
34:00 - 34:30 different kind of equations okay that's why i kind of wanted to separate these out in colors but all of this is called a measurement model and that is distinguished from the structural equation model or the structural model okay which we won't talk about here okay but i'm just laying it out okay structural but basically we are only talking about the measurement model and that means
34:30 - 35:00 a model that relates uh latent to observe variables right latent to observed variables it's basically how well our our measure is being measured our survey is being measured okay that's why it's called a measurement model just think about measurement in terms of like a ruler right if we have um height it's relatively easy to measure um with the ruler think about a concept like in psychology like depression
35:00 - 35:30 how difficult is it to measure depression it's pretty difficult right like you know human behavior is very difficult to measure and that's why we need measurement models because we we need to assess how well our our our construct is being measured it's not like physics where you can kind of just measure the speed of light okay there's going to be a lot of error involved in that measurement and that's why we have these errors any questions about these concepts that
35:30 - 36:00 we've been talking about i have a quick question yeah um how do you decide whether to go for a correlated error model or not ah that's a good question and on uh i see that often um the question is basically like okay when do you add these correlated errors honestly i the point is not to add them if you don't have to okay because what i see that happens a lot is people if
36:00 - 36:30 we're gonna talk about model fit but they look at the model fit and they're like oh it looks terrible so i'm gonna try to improve my model fit and by adding more covariances of the residuals it artificially inflates the model fit i don't like that because that's basically kind of like a confirmation bias it's saying i want this model to fit i'm going to force it to fit by allowing these to covariate why does it fit better because obviously the more things you add the more um the better the fit right
36:30 - 37:00 it's kind of like linear regression the more predictors you add the better your r squared i argue that that's artificially improving the fit of your model without a strong hypothesis the only context where i've seen that it actually makes sense is for and we won't go over this because this is a more advanced topic is if you have like let's say husband and wives okay so i've seen that before like a husband depression and a wife's depression like basically because husbands and wives live so closely together
37:00 - 37:30 maybe if the wife is depressed the husband would be depressed too right just because they're kind of like correlated so in that case then maybe i would correlate the error of husband and wife okay because what you're saying is that okay so not only this is like depression not only is it just depression that's accounting for the covariance between husband and wives but it's something else it's just the fact that they're just living together right and maybe like there's random
37:30 - 38:00 factors from living together that makes them depressed not just depression itself okay does that answer your question yes thank you yeah yeah so in general don't do correlated errors if you don't have a strong hypothesis about that any other questions all right if you don't have any questions um that was kind of like the theory about the factor analysis model now we're going to jump into kind
38:00 - 38:30 of like one factor cfa but before we do that we have to do something called degrees of freedom and this is a little technical bear with me but i promise it'll help you understand why we're doing it okay and then we're going to talk about an actual one-factor cfa okay so this is this is basically what we're trying to do we're trying to fit uh one factor cfa with those just three items that we have in the saq let's let's uh see if i can find okay so
38:30 - 39:00 item three for example is standard deviations excite me item four is i dream that pearson is attacking me with correlation coefficients i have nightmares about coefficients correlations and number five is i don't understand statistics so what you're saying is then okay those i'm assuming that those three items adequately assess spss anxiety i don't know if that's true but that's my theoretical kind of uh idea before going in to my factor analysis this is i have an
39:00 - 39:30 idea that three items is sufficient okay so that's the that's the cfa we're trying to fit before we do that though we have to know like whether or not we can fit a cfa and to do that we have to think talk about degrees of freedom okay but degrees of freedom comes with some other things that we have to go over so we've seen these before right we've
39:30 - 40:00 seen these before we've we've seen this population covariance matrix right remember it's symmetric right we've seen this model implied covariance matrix we haven't seen this one yet well this is actually what i was showing you before this is called the sample covariance matrix okay and the sample covariance matrix is
40:00 - 40:30 an estimate of the population so it's an estimate of this that's why you have a half there this is called estimate that's an estimate of the popul population covariance major basically i'm using my data to recreate the the sample covariance the the population covariance matrix using the sample data so so how do i know first of all that this is a covariance matrix well you look at the diagonal and they're not one second tip up is i'm using cov and not
40:30 - 41:00 cor in r okay and i also know that it's symmetric so it maintains the properties of the population covariance bit it's just that these are sample estimates okay so we got that out of the way right now in terms of degrees of freedom okay what we want to look at is the number of terms in the population covariance
41:00 - 41:30 matrix using the sample covariance matrix though we can do the same thing because the the properties of the sample covariance matrix are the same right so that's the population in terms of the number of term elements in that matrix so if we have a three by three this is a three by three three items by three items we have nine elements right it's the same nine elements in the sample covariance matrix well how many elements do we have then well instead of counting it one by one
41:30 - 42:00 there's a little trick i can use is just say i want to multiply the number of items times the number of items plus one divided by two so for a three item cfa i have three items times three plus one which is 4 divided by 2 which is 12 divided by 2 which is 6 right why is it 6 and not
42:00 - 42:30 9. you guys have any idea i kind of talked about this before hey oh yeah go ahead because um the matrix is symmetric yes perfect okay so um basically just like you said is um you notice these are the same as this this is the same as this okay so you get my point right yes it's symmetric so you don't have to duplicate that right that's perfect
42:30 - 43:00 so just counting the number right um then you have one two three four five six okay perfect all right so but that's not that's not like the the number of parameters right that's the total that's the total number you have to work with those are what we call known values i i think i i saw that in a term somewhere a book somewhere and then i just stuck with it so those are the total number you have
43:00 - 43:30 to work with now if my model parameters exceed that total six i can't estimate the cfa so let's look at our model implied covariance mix remember we talked about the model implied we have our loadings right we have our um we have the variance of the factor and then we have the variance of the residuals how many do we count now count the unique ones right okay so let's say we didn't constrain the
43:30 - 44:00 residual okay so let's count it one two three four i have this one i have this one i have this one and then i have nine here right uh well not really right but we have six because they're symmetric okay so six six plus four is ten okay you see how i got ten now even though i i i counted the non-unique ones like i cancelled these ones and i canceled these ones that's not
44:00 - 44:30 enough still we have six to work with but we actually have 10. so how can we do that well we have to do something called fixed fixing a parameter fixing a parameter just means i want to predetermine the parameter to have a specified value this is not realistic but it's just a very like easy example of saying how about we just fix everything to ones and zeros you can do that right oh well you can it
44:30 - 45:00 just doesn't make sense but let's say we fix the loadings to one we fix the variance of the factor to one we fix the variance of the residuals to one and then the covariances to zero well how many have we fixed there we fixed well i mean like unique ones right we fixed 10 unique ones right okay so um so the number of unique parameters
45:00 - 45:30 minus the number of fixed parameters is what we call the number of free parameters so in terms of our number of unique parameters we had 10 and we fixed 10 right so 10 minus 7 is 0. this is a very arbitrary like non-realistic example so how many number of free parameters we have zero well then the degree of freedom is the number of known values minus the number of free parameters
45:30 - 46:00 which is six okay and realistically you will never get zero as the number of free parameters i'm just showing you an example of an extreme case so we we count as six right now what does that mean for our model well you have three cases in our case our degree of freedom was positive that means the number of known values is greater than the number of free parameters
46:00 - 46:30 this is called an over identified model the benefit of having an over-identified model in scfa or scm is that model fit can be assessed we can talk about model fit when we have an over-identified model well let's look at the other extreme case where df is negative that means the number of known values is less than so let's say we had five and then we had six then we get a degree of freedom of negative one that can never
46:30 - 47:00 happen that's called under identify you cannot run a cfa that model that is under identified you will never see degree freedom negative in your output if you see that tell me and i'll buy you a house in laguna hills all right so that will never happen now then if you the the kind of just right scenario is when the degree freedom equals zero and that means the number of known values
47:00 - 47:30 is equal to the number of rebrands let's say we had six minus six that happens actually a lot um if you guys know actually linear regression is a case where the degree of freedom is zero you have no degrees of freedom to work with and that's called a just identified or saturated model that's that's okay you can you can have these models a linear regression is a perfect example of that but it's just you can't assess the model fit the fit is as good as it can be okay
47:30 - 48:00 in cfa or sem we want to be able to assess model fit so you want to be able to over identify your model meaning you want the degree freedom to be positive any questions about degree freedom i know it's kind of technical but basically just look at the degree freedom to see if it's positive and then if you see a zero then you know why you can't assess model fit
48:00 - 48:30 this is a good time to kind of uh test your knowledge so i'm going to administer the next poll see if you kind of uh got an understanding of what i was saying and this will help you to understand too if you get it wrong then you can kind of think about okay what was i misunderstanding so it's in the slides it's also in the poll and basically the first question is there's one degree of freedom in my model which means that my model is over identified true or false
48:30 - 49:00 the second question i have three items in my study the number of known values is six and third question i have three items in my study there are six unique parameters and no fixed parameters my model is just identified that the last one may be kind of tricky yeah and we're going to go over the last one for sure and then and then uh you tell me if i'm
49:00 - 49:30 right or wrong
49:30 - 50:00 and feel free to refer to the slides before because you know the the last one especially you kind of have to look at that equation again so for the third one just if you're still um kind of answering that think about how many uh free parameters you have
50:00 - 50:30 maybe i'll give you guys like 30 more seconds if you're still thinking okay maybe 10 more seconds and don't worry if you don't have a response yet we'll just go over it and
50:30 - 51:00 you can kind of see at least you thought about it i think that's the most important thing is to kind of think about the question and see if you understand it okay so let's look at this so the first one is there's one degree of freedom in my model which means that my model is over identified that is true right because um if the degree of freedom is positive that means you have an over-identified model so you guys most of you got it that
51:00 - 51:30 right good job i have three items in my study the number of known values is six that's true because remember that formula we had p times p plus one divided by two so those were kind of straightforward the third one maybe i tricked you or something because um and you tell me if i'm right or wrong because maybe i missed something but let me um let me stop sure okay so i it says i have three um items of my study so first of
51:30 - 52:00 all i know from two that the number of known values is six right okay now um it says there are six unique parameters um and no fixed parameters right so remember the the number of free parameters is equal to the number of unique parameters minus the number of
52:00 - 52:30 fixed so what is that 6 minus 0 equals 6. and then the degree of freedom is the number of known values minus the number of free so what is that 6 minus 6 is 0 and when my degree of freedom is 0 what is that just like that
52:30 - 53:00 does that make sense were you guys like kind of confused about something maybe or if i did something wrong you tell me no no questions about that all right okay so that's just to get you thinking about um
53:00 - 53:30 degrees of freedom really like it's just a small part of your output but you really need to know degrees of freedom to be able to identify your cfa so here's our three item cfa we're now we're gonna get into actually working with our cfa okay so now you know all the kind of theory behind how to identify a cfa okay so this is the exact same path diagram i drew before right we're just going to
53:30 - 54:00 fit this now in levant but if you calculate this um the number of degrees the derivative freedom of this model i'm not going to do it here but if you do the homework and you calculate it i guarantee you it's not going to be identified it's going to be under identified because you'll have more parameters than you have uh known values the degree of freedom is going to be negative okay now
54:00 - 54:30 what i argue without proving anything is that it's going to be exactly one parameter that's under identified so what what what are some ways you think we can kind of uh resolve that issue do you guys any have any ideas now that we you know all the concepts now do you have any ideas about how we can reduce that identification problem maybe we can fix the varieties yeah good okay so that's one way we can fix the variance what about
54:30 - 55:00 is there another way fixing the residuals uh okay that's a good idea now typically we want to keep the residuals as it is because we want to know if if there's air in our measurement okay so we don't want to fix those so that but that's a good suggestion can you fix one of the factor loadings yes we can choose one of the factor loadings to fix the one okay so that's really good okay the
55:00 - 55:30 intercepts that we usually want to keep as is and the residuals you want to keep business and and um feel free to keep answering and uh even if you don't get it exactly right i think it's great that you're volunteering and i highly encourage that and it actually helps you to kind of sync what uh we learn into your brain because you're like oh i got that wrong i'm going to remember that for the rest of my life okay because that's that's me like when i hear negative things and then i was like oh okay i'm going to remember that forever okay so feel free to keep up doing that and
55:30 - 56:00 i'm not going to judge you because this is very difficult material to cover in just like three hours okay so you guys are doing a great job understanding okay so like we uh like some of us were saying we can choose two ways to identify this model we can either fix the loading or we can fix the variance okay so how do we fix loading basically we just set one of them to one you can you set it to five yeah you can but um that's not really useful because you
56:00 - 56:30 want to set basically the loading tubes to be the scale that you want want it to be so what does that mean though when we set the loading to what that just means that we set the the scale to that loading to that item so the item is uh let's see what's the first item i'm looking at the saq statistics makes me cry so what you're saying by fixing the first loading is that i want
56:30 - 57:00 to set the scale of everything else to be a statistic makes me correct so let's say you have a mix and match kind of a survey or even just like a set of dependent variables that it's like inches and height you're going to be setting one of it to inches right and otherwise if it's height then you're going to be setting it to height does that necessarily make sense no okay so you have to think about when you do a marker method you want to make sure that your items are all measured on the same scale if that's not true
57:00 - 57:30 then you have to either standardize your scale first and then do marker method or you have to standardize it using the variance standardization method okay the the by default though lavon specifies the marker method which means fix the first loading to one can we fix the second loading the one and free the first one of course can we do the third one yes but by default it sets the first one
57:30 - 58:00 so just make sure you know what your first one is and if you're okay with setting the scale to that item the other thing is it says uh by doing that it sets the factor variance to that scale as well okay so when you're talking about the variance of that factor you're talking about in the units of this statistics makes me cry item everything else we leave the same though right okay well we leave it as a parameter a free parameter
58:00 - 58:30 now like we said before also there's second way is to standardize the variance so recall that the psi is the variance of the resid of the factor remember okay so i just draw the path diagram here that's the variance of the factor we're going to set that to 1 for the variance standardization method you see so we put that in one now what do we do with the loading we freely estimate it becomes a free
58:30 - 59:00 parameter okay so that's that's basically the two methods of standardization and if you um calculated out the the degrees of freedom i argue that it's going to be zero okay so before we get into how to do that uh lavon though which is kind of what the goal of this seminar is we kind of want to understand this cheat sheet kind of syntax of labon so you guys some of you most of you have worked with r before maybe not lavon
59:00 - 59:30 but you've done linear regression so i'm pretty sure you're familiar with the syntax l sorry l m y tilde x right where y is your dependent variable and then x is your independent variable you guys know that notation right well it's the same uh notation in levin when you have an observed variable predicting an observed outcome you basically use the same syntax now the the thing with lavon
59:30 - 60:00 though is it doesn't estimate the intercept by default so unlike lm that estimates the intercept lavon doesn't estimate the intercept by default so in order to estimate the intercept you have to do like y tilde 1 plus x so if you do y tilde x and lavon it's not going to estimate the intercept okay so just making sure you know that now levon is goes beyond linear
60:00 - 60:30 regression right because it allows uh unobserved latent factor so this is where this equal tilde comes in well the equal tilde is basically saying i want a measurement model i want i want to link an observed uh to a an observed to a observed so i want to link eta or two to y one the only difference is that you need to split the side so like let's call the eta f okay so you need to say something like
60:30 - 61:00 f equals tilde y one okay so that the sides are flipped but that doesn't mean that f is an outcome okay so don't get that confused if you want to do that that's called a formative construct which we're not going to go over but a lot of people come in to start consulting and they they kind of draw the arrows the wrong way sometimes i even get confused right but you're not doing this this is not this is not what we're
61:00 - 61:30 talking about okay this is not it okay we're talking about this so again this is called formative uh i yeah i think it's called formative or it's called causal indicator and then this is kind of the the one uh we want to talk about which is which is that the the factor causes the items okay so don't get the uh
61:30 - 62:00 direction confused of which side goes into what side double tilde means i want a covariance okay so now what does that mean when i say i i want uh f tilde tilde f where f is like the factor the variance of the path yes exactly so so the the covariance of of an item with its or a factor with itself is the uh the variance right perfect
62:00 - 62:30 and in that case we called it psi one one right so really good job okay and then one star fixes the parameter okay so let me clear my screen there so if i want let's say um i have this uh factor and i go equal tilde right and then i have uh y one plus y two plus y three one star fixes the first loading
62:30 - 63:00 to one now what did i say before is that this is the default in lavon so i actually don't need to do that right and then and and um i'll show you how to like override that later but you basically use this thing called n a star okay so when i put n a star here that basically unconstrains the loading not to one so it frees up the parameter
63:00 - 63:30 let's call the free parameter okay we won't talk about variable labels because our value labels because um that's not important for this seminar but do you guys kind of understand now the syntax of levon these are basically the uh the things you need to know essential things you need to know for um running a cfa but otherwise you i think you guys are understanding this pretty well so good job okay now we are finally getting into
63:30 - 64:00 levine okay so basically what we want to run remember is a one factor cfa with three items and i i argue that this is a just identified model so so here's basically the [Music] all the things we've talked about and now into one little syntax here you know honestly the syntax is not that hard it's really understanding how to implement the syntax what levon
64:00 - 64:30 is doing and how to interpret the output that's really what's hard about cfa or sem we've seen this before right we know exactly what that means now this is item three item four and item five in the saq so if you keep referring to that uh beginning slide then basically let's see where is it statistics makes me cry
64:30 - 65:00 i mean i keep losing that one yeah statistics makes me no that's the first one so the third one is standard deviations excite me the fourth one is i dream that pearson is attacking me with correlation coefficients and five is i don't understand statistics so i just chose these because i didn't want like i think i looked at the loadings and like some of them were kind of off so i just chose like these three but remember this this is saying i those three items uh measure spss anxiety which may or not may or not hold and that's why we're
65:00 - 65:30 testing the hypothesis now why do i have these single quotes that just means it's a string okay so like like this means it's a string so it's like it's like it's not an it's not a number why does it need a string because we're going to store that string into an object called model 1a m1a and then i'm going to pass it into m1a that string into a confirmatory factor analysis in levon so it's called cfa and basically this is a wrapper or just it's basically like a like a proxy for
65:30 - 66:00 the actual um function in in lavon which is actually just lavon okay but the reason why you want cfas because like it's made designed for cfa so it'll tell you when it's like something's off okay and you specify the model incorrectly so this is good if you just want to run a cfa instead of using the lavon function and then this is the data set we're using why do i draw an arrow here well this says i want to store that lavon cfa output into this
66:00 - 66:30 this uh uh this object called one fact three items a and then i pass that object into summary and then this is how i get the output any questions about the syntax of lavon and it and during the exercises we're gonna actually like kind of interact with this and like really like get into it but uh for now like you know just trust me that this is the syntax and then if if i'm wrong you can correct me later when you actually run it okay but any questions about the setup of this um
66:30 - 67:00 syntax it's pretty straightforward right and if you use regression before it's kind of like regression where you you pass on the object um here and then you put in the summary the only additional thing is the string because it requires you to have a string of the model okay so remember there's two ways to identify the cfa right and why do i just show you this one first well like i said let me go back to the syntax like i said
67:00 - 67:30 by default lavon does marker method right so i didn't have to say anything here this is where the theory comes into play because if you didn't know that that's that's marker method you would you would not know what melvan is doing what levon is doing is it's fixing the first loading in one and then it's unconstraining the variance okay so it's really adding a one star in front of q03 let's see if that's true
67:30 - 68:00 well this is a summary output right and then i have the estimate column this is where you check and verify that that what you did was right how do i know that the marker method was implemented well in q go ahead anyone want to say something okay but how do i know that that i was using market method well basically i see a one here how do i know that that's not actually an estimate well there's no standard error there's no z value and there's no probability
68:00 - 68:30 that's how i know that that's a fixed parameter right this is a fixed parameter now all these other ones are free you know that commercial that is free free free so this is these are all free now free just means it's estimated okay so um so what what is this now these are the loadings these are the the lambdas right number one lambda two let's say these
68:30 - 69:00 are the loadings of um item two or item four and item five the thing is though since you scaled these in the units of item three which is standard deviations excite me these are in the units of item three now okay so let's try to see if we can interpret one of them so for uh let's see i think what i meant was item four so change that to item four so for a one unit increase in item four
69:00 - 69:30 or sorry and spss anxiety okay oh yeah that's what i meant so when i said item three i meant in the scale of item three so for one unit increase in spss anxiety because remember spss anxiety is our predictor right here and if you've interpreted uh regression coefficients it's kind of the same thing right so for one unit increase in the in the predictor at least to a blank unit increasing your outcome
69:30 - 70:00 so for a one unit increase in spss anxiety item 4 goes down by 1.13 points that's in the scale of item 3 though okay in the scale of item 3. so is it easy to interpret not necessarily now the variance of the factor right here is scaled by units of item three now okay so that that is the variance of item three all right at the factor in scale of item three these are the residuals variances of
70:00 - 70:30 item three four and five and i know that because um there's a dot in front of it that means residual now you can look at the p value but honestly all that is testing is that the coefficient is greater than zero which i don't know if that's that interesting in terms of a loading okay and then basically what i did was i represented each of that each of those parameters into the path diagram and so you see that you
70:30 - 71:00 know that this is a marker method because i'm fixing the loading of the first item to one all right and i interpret these for you so any questions about the interpretation i have a quick question about the model yes uh why didn't we specify a y-intercept ah that's a good question okay so yes in our um in the other diagram i drew i drew this right and that is because of tradition
71:00 - 71:30 okay and and basically sc or factor analysis or scm came from the tradition of only having your covariance matrix okay so your s okay so so in in older programs like from the 90s there's a program my advisor created called eqs there's also something called uh listroll the first programs in sem or cfa only have the covariance matrix
71:30 - 72:00 okay so you the only input of your data was a covariance or a correlation matrix all right so by default in tradition means aren't estimated it's only because because i think back then like um computer memory wasn't available so or they didn't collect the data now we have the full data available so we can estimate the intercepts right so and but for some reason by default the the packages still
72:00 - 72:30 so um don't estimate a lot of packages don't estimate the intercept but remember we are able to get that right so what do we have to add to this syntax to get the intercept one plus right so you can try that on your own is if you add one plus you're gonna get the intercept
72:30 - 73:00 okay and that's but just see this is why you need to understand the theory and the specification of the syntax what levon is actually doing because if you just put it in there you're like oh this is the output right but you don't know for example then that the default is not to estimate the intercepts so that's why you need to add a one plus okay so that's a good question any other questions
73:00 - 73:30 okay this is where you have to kind of pit i mean you guys are definitely paying attention but this is kind of like you kind of have to think a little bit it's not just like okay i can look at the up the syntax and understand it okay so remember that n a is what we um needed to free the parameter and then one star fixes the parameter to one so what is this doing actually i think you guys have the tools to understand what this is doing okay
73:30 - 74:00 does anyone want to volunteer if you if not it's okay i can go over it but i think you have the tools to understand what this is doing no great volunteers it is kind of challenging somebody answered in the chat oh it's already answered in the chat okay yeah so it is factor yeah so it is varying standardization method but i kind of want to go over kind of the details of how it's doing that okay
74:00 - 74:30 so number one lavon uses default marker method okay which means it fixes the first loading to one that means we have to override the default marker method right so the override that you free the first loading right so you override that now you have but then but then to do variance standard deviation method you have to fix
74:30 - 75:00 the variance to one right so that's what this is doing and like someone said before double tilde means covariance but a covariance of something with itself is the variance so this is the variance of the factor any questions about why we do the n a and the one star
75:00 - 75:30 um so there is like an easier way to do that i don't know if i cover that okay uh it is on the web page though so if you want uh you can look at the web page but basically instead of uh doing this uh manual way you in the cfa syntax you can add std.lv equals true and that's another way to
75:30 - 76:00 do exactly what we just did without standardizing you can leave it as default marker method and then do standard lv equals true you can see that in the website so that's more incentive for you to read the website there's way more material on the website okay so let's look at the output let's look at the estimate column how do i know that this is now the variance standardization method i think you guys
76:00 - 76:30 know based on what we've talked about because um there's no blank row um for any of the variables right the loadings are all filled out there's no yeah okay because that's one hint anything else and the f estimate is one yeah and more specifically we're estimating
76:30 - 77:00 the variance here right so these are the loadings and then these are the variances so really good job and that's exactly how we know that we've done the variance standardization method when we fix the estimate to one all right for the variance now if we didn't do the n a star we would have seen a one here is that okay well technically it'll still run i guarantee you it'll still run but you're oh you're kind of like restricting you're fixing too many things right you want to just identify model but then your artificial saying are artificially saying that that
77:00 - 77:30 loading is now one which which you don't want okay so really good job and then so uh how do we interpret like let's say uh item the q04 then so basically now this is this is the factor is now one so the factor variance is now standard size so you can think of it as like a standard deviation increase right so for one standard deviation increase in xpss anxiety
77:30 - 78:00 so this is spss anxiety remember it's a latent variable so it's a circle so for a one unit increase in sps anxiety item four goes down by 0.665 points now be careful because um the the the the item itself isn't scaled to one of the variance okay so it's in the original metric of item four okay but the variance of that factor is now
78:00 - 78:30 scaled to one so for one unit one standard each and increase in sps anxiety i dream that pearson is attacking me with correlation coefficients goes down by 0.665 points on the lecture scale any questions about that okay and then finally this is something i haven't talked about but there's another way of standardization so we've talked about marker method
78:30 - 79:00 we've talked about variance standardization method and then there's something called the all standardization all method so first of all the syntax so the summary right we can basically do the same model but all we have to do is say standardize equals true and honestly this is the shortcut way to do everything all at once okay so you don't need to do any of that n a whatever stuff that i just showed
79:00 - 79:30 you if you just put in standardize equals true how do i know it gives me all three things well you guys know that okay let me let me show you okay one column one column two and column three you can put in the chat too which one is um variance standardization method column one two or three
79:30 - 80:00 okay got two good which one is marker method [Music] yeah exactly one is marker method you know that because this fixes the one right the loading to one so good job this is mark now the only thing we haven't seen is three okay so what is three three means
80:00 - 80:30 if you guys have looked at standardized beta coefficients in linear regression that's basically what this is doing what it's doing is not only is this standardizing the variance of the factor to one sed all but it's also standardizing the the items themselves kind of like uh like dividing by the standard deviation of the item two okay so it's kind of like a correlation coefficient then it's standardizing both the or the beta
80:30 - 81:00 coefficient the standardized beta coefficient of linear regression is standardizing by both the predictor and the outcome which means the factor and the item so for one standard deviation increase in sps anxiety item four goes down by 0.701 standard deviation units so that's the difference standard deviation units so essentially this is like a correlation right and this is what you see more typically when you talked about standardized
81:00 - 81:30 loadings okay so when you when you see standardized loadings typically it means this one not the standard lv the reason why we like this standardized loading is then you can kind of compare these relative to each other so for example the loading of item 3 is the magnitude of it is lower than the loading of item 4 right 0.54 is lower than 0.701 oh
81:30 - 82:00 but it's just flipped the sign is flipped we can also look at it relative to 1 or negative 1. so which one has the highest loading then we can say that item four has the highest loading the magnitude the sign tells us the relationships between the items so um as ida so so basically item three has a positive relationship with spss anxiety and item four and five
82:00 - 82:30 have a negative relationship with spss anxiety and you have to kind of uh basically check if you have some like reverse coding going on because ideally you don't want the signs to flip the other way you can check that the variance of this factor scale to one is looking at this okay the variance is one notice though that the um the residuals are not different right between the sdd lv even though they're both one the
82:30 - 83:00 residuals are different because this actually standardizes by the the items themselves whereas this one leaves the items in the original scale any questions about interpretation of these columns okay so i know that's a lot of information so what we're gonna do is we're gonna take a uh break okay and we'll come back
83:00 - 83:30 at maybe 2 30. all right so feel free to ask questions to me before then could i ask a quick question um about adding the intercept when would you want to do that okay so um generally it's good to have the intercepts estimated if you have full data so if you you don't just have the
83:30 - 84:00 correlation table or the covariance table and you have the full data it's usually good to estimate the intercept i just didn't want to show you the intercept because it actually adds more like parameters in your output but it's definitely something you can estimate and it's just um the interpretation's not that interesting either and so because most people care about the loadings and i think that if you uh yeah
84:00 - 84:30 if you for cfa if you if you if you if you just look at the loadings that's really what's important most some people don't really care about the intercepts and that's why it's not really part of the output but you can certainly um estimate it and i encourage you to do it if you have the full data available okay but it wouldn't change the loadings right no okay you can you can try that but i don't think it will change the loading yeah so
84:30 - 85:00 can you say that you're you're cutting in and out i can't really hear you adding intercept to the model does it make it does it make the model just identify that's a good question okay so the model parameters are independent from the means and the covariances right so the the known values will add additional
85:00 - 85:30 uh parameters based on the mean you're interested you're estimating so you're going to subtract those out anyway so it's going to be just identified even with the intercepts okay so that's a good question but it's basically going to be like p times p plus 1 divided by 2 plus p and then by estimating the the intercept you're going to cancel out the p anyway so so your degree of freedom is going to be the same
85:30 - 86:00 any other questions i know so um this is the thing and the exercises will help kim is um this is assuming i i was looking at that too and i was like the relationships look kind of weird with these three items but this is basically because we're not using the full survey okay so by using only three items is it really assessing spss anxiety
86:00 - 86:30 or is it assessing something else the more items you have though the more sure that you're gonna and when you run the exercises you'll see that the patterns actually kind of change and so this is my argument is that um the more items you have the better or more reliable reliably you're going to estimate the factor okay here we're only taking a small subset of the saq the saq actually has 20 items and i'm only even taking eight of those
86:30 - 87:00 right so definitely uh yeah look at the uh the exercise a little later and see the loadings there see if they change a little bit but yeah okay so yeah so basically uh having those three items doesn't necessarily reflect spss anxiety and you're gonna see some kind of a strange kind of relationship there the other thing i want to tell you is that this data set is completely made up okay so i think literally um if you guys
87:00 - 87:30 have heard of andy field i kind of like borrowed the the data set from him but basically i think he literally just like used the computer and like randomly generated some numbers okay so don't over interpret anything here because the the um the the direction and the um things may be messed up only because it's a fake data set okay that's the real reason why you're seeing strange patterns fine johnny there's a couple questions about when you should use the marker method versus
87:30 - 88:00 factor standard of the variance standardization method oh okay um honestly in my experience you don't use any of them you use the standardize all method the market method and the variance standardization method is literally like what's the default and then like what's the solution of what's an alternative solution now those two are going to be standardized using the standardized all anyway okay so regardless of marker method or variance standardization method that's
88:00 - 88:30 just to identify the cfa but then in terms of interpreting usually you just use std all now there's a special circumstance when um you don't want to use std all and that's when you have uh categorical predictors uh we're not going to go over that but basically if your predictor is like a categorical
88:30 - 89:00 variable and it's like 0 1 let's say dummy variable you don't want to standardize by the item right you want a standard you want to leave that unstandardized okay so that's that's kind of the only situation when you would use standardized lv but just think about it in terms of like okay these are kind of ways to identify the model and then the interpretation really comes from the standardize all
89:00 - 89:30 oh it's 2 30. okay um did you guys get enough break [Laughter] maybe it's not enough break um all right so we just have two more topics really like this is the harder one so you like if you have enough brain power to pay attention i appreciate that but it's going to go quicker because i
89:30 - 90:00 kind of removed some of the very very technical stuff based on my colleague suggestions so the last thing we're going gonna well the last two things we're gonna talk about and this is really the major thing is model fit because we talked about degrees of freedom right and then we want to make sure that the degree of freedom is positive so we have an over-identified model right and this is where we talk about okay so how now that we have an over-identified model how well does this model fit now basically did we do we have a over-identified model
90:00 - 90:30 with our three item cfa no right because that was just identified so can we even talk about model fit with these no so that's why for model fit you really want to consider models that have more than three items because the three items cfa is just identified you want to have more than that so this is the situation when you have more than three items and we'll do that in the exercises so the first thing you want to know is the model chi-squared and the second thing you want to know is approximate fit indices okay so
90:30 - 91:00 this is where it gets a little math stat kind of okay so if this is not like ringing true and i know it's like kind of a lot to process don't worry about it too much other than um then how to like interpret the p-value okay all right so the the hypothesis in a cfa and this applies to scm too if you if you're interested in scm the null hypothesis is that we already know these terms though the model implied covariance matrix
91:00 - 91:30 equals the population covariance matrix right if you think about the conceptually what that means that's just basically saying that my model reproduces the the the population that's basically what it's saying is is my model good enough to perfectly reproduce the population and note that this is an example of exact fit okay it just means it's exactly equal which may or may not be true but this is why this is called an exact
91:30 - 92:00 hypothesis and we'll talk about close fit later but basically you want this to happen this is called an accept support test this is counterintuitive because in a linear regression model you typically want a reject support test so you want to reject the null hypothesis because you want this to be not true but in a like in the linear regression you want the beta coefficients to be not zero but in an scm or cfa you actually don't
92:00 - 92:30 want this because what is this saying this is saying that the model implied covariance matrix is not equal to the population so that's actually a bad thing to do so what we do want is to accept quote unquote or fail to reject the null hypothesis and what does that mean for our p-value do we want our p-value to be less than 0.05 or do we want our p-value to be greater than 0.05 assuming our
92:30 - 93:00 alpha is 0.05 yeah kim said greater that's exactly right because by by having a p-value greater than 0.05 we are failing to ex to reject the null hypothesis which is an except support test very good so we want this right um now just an aside it's actually kind of important given that we want to accept the null
93:00 - 93:30 hypothesis sample size is actually a deterrent right because the the larger your sample size so sample size or n is is positively correlated with power or the ability to reject the null hypothesis right so that's a catch 22 or what or a conundrum right because you by increasing your sample size you gain more power but you're more
93:30 - 94:00 likely to reject your null hypothesis that makes sense so that's kind of why we developed other things besides the the chi-square because this is basically measured by the chi-square okay all right so now in terms of that that's an aside if you know from linear regression we are estimating the population with the sample okay
94:00 - 94:30 so you know that the null hypothesis is regarding the population not for the sample this is the population on the left this is the uh oh sorry sorry no this is the population here and this is a sample on the second row okay this is the model this is the covariance so remember we have the model implied covariance matrix we have the
94:30 - 95:00 the population covariance matrix now we have the estimated or sample model covariance matrix and the sample covariance we've seen all these terms right the only term we haven't seen is this one what this is basically saying is that i'm going to estimate all the terms i see up above when the hat means i estimate so lambda hat psi hat lambda transpose hat plus theta epsilon hat what we've seen before
95:00 - 95:30 were all hats right so like all of these were hat hat hats okay these were all hats because these were all estimated from our sample does that make sense those loadings are hats those residual covariances are hats what is the point of that well this allows us to create what we call residual covariance matrix and don't confuse that with this residual covariance okay what this is saying is this is the sample covariance matrix
95:30 - 96:00 minus the model implied estimated covariance matrix why is that interesting because we're trying to say something about the population what we have is the sample now in the population what is the the corresponding mapping is sig sigma minus sigma theta
96:00 - 96:30 in the population what is the difference of sigma minus sigma theta look at the null hypothesis under the null hypothesis yeah it's zero good job i'm sorry
96:30 - 97:00 but that's a great great response but in our sample is this necessarily equal to zero well i argue that that's not necessarily equal to zero because the s may be off in terms of estimating the population covariance and the model might be off in terms of estimating the model implied so there's sampling error well you should be getting then if it's truly zero you should be getting things like 0.01 you know maybe negative 0.05 you
97:00 - 97:30 know values that are close to zero if you get five million that's not good right so you want this to approach zero and that's really kind of um what the goal of this is is you want to make sure that your your model in your sample reproduces the sample covariance matrix as much as possible so here's where we get the second poll to see if that makes sense to you
97:30 - 98:00 um okay if you have some time i'm going to launch the poll now i'll read through the questions just three of them the residual covariance matrix is defined as the population covariance matrix minus the model implied covariance matrix it will never approach zero but can approximate zero second the goal of scm or i actually mean cfa
98:00 - 98:30 it's the same thing it's to recreate the population covariance matrix using model parameters therefore we do not want to re we we do want to reject we do want to reject the null hypothesis and then finally the larger the sample size the more likely we will reject the null hypothesis i'm talking about um cfa too sem and cfa are basically the same thing okay so let me get something first while you do that
98:30 - 99:00 okay i'll give you guys maybe like um 30
99:00 - 99:30 more seconds
99:30 - 100:00 i would say the number one is probably the trickiest and don't worry if you get it wrong like you know sometimes i'm just trying to trick you on purpose and my intention isn't to make you feel bad it's really just to get you thinking about like okay what exactly am i saying here and once you get it wrong then you'll know exactly why you got it wrong right that's part of the learning experience
100:00 - 100:30 okay i'll close it in about 10 seconds 5 four three two and one okay so the first one was kind of hard i guess so the residual the residual covariance matrix is defined as the sample covariance matrix minus the sample model implied covariance picture so remember in um in this you can move the window if it's
100:30 - 101:00 bothering you okay so what i'm talking about is is is this this thing here right here i'm not talking about this thing on the top right so what i was asking in the thing is is this equal to the sample residual covariance matrix no because this is exactly zero in the population remember but this should approximate or be close to zero which is the sample
101:00 - 101:30 okay so the residual covariance matrix is talking about the sample because in the population this in the null height under the null hypothesis this will always be zero remember so that's why um number one is false number two is false because this is an accept support test not a reject support test so we watch we actually and you guys got this one we actually want to not reject the null hypothesis and three
101:30 - 102:00 the larger the sample size the more likely we will reject the null hypothesis and that is exactly why we have approximate fit indices any questions about the quiz before we go on okay so this is basically what we're talking about in terms of the model chi-square this is the model we just fit this is a
102:00 - 102:30 model that you're going to do in the exercise with eight items remember the model we just fit is just identified and how do we know that because the degree of freedom is zero now you know exactly what this means that also means the test of 6 is zero because you can't calculate it you know what these are right three parameters now let's say we have an eight item factor analysis one factor then we have 16 free parameters and now we actually have a degree of
102:30 - 103:00 freedom that's positive that means it's an over-identified model and then we can actually assess the test statistic now look at the p-value what is this saying about um our null hypothesis the null is rejected the null is rejected right so that's actually bad right because if we wanted this to be
103:00 - 103:30 a you know model we ex we want we we actually just said nope this is not the model we want we rejected it that's a problem because our sample size remember was 2571 and the fact that it's such a large sample size means that we're probably going to get a p value less than 0.05 which means we probably want to are going to reject our model so that's the conundrum right i was talking about okay so this is where these kind of
103:30 - 104:00 approximate fitnesses come in so what we talked about was exact fit exact fit means you know sigma theta equals sigma in the population approximate fit kind of like kind of doesn't um force that to be true as long as you're kind of close okay so there's two types of um approximate fit indices oh is there a question
104:00 - 104:30 all right so there's two types of um fit indices okay there's absolute and incremental you don't have to actually know what that means but basically there's cfi tf tli and then there's rmsca those are those are the main ones you want to know okay all right but before we do that we have to know what a baseline model is
104:30 - 105:00 okay so remember the variance covariance matrix okay um but this is sigma theta okay so what this is saying the baseline model is basically if you look at the path diagram what are these basically estimating well we see these are estimating residual variances but there's nothing predicting it right so if there's nothing predicting the the
105:00 - 105:30 the there's no like factor predicting it so what do you guys think then the residual variance is if there's nothing predicting it there's nothing there's no residual right so the residual variance is literally the variance okay and what are we doing with these we're basically estimating variance of
105:30 - 106:00 y1 variance of y2 all the way down to variance of y do you see covariances in this diagram no covariances right what that means is then in our model implied covariance matrix if you look at the highlight we are estimating the variance of y1 y2 y3 y8 but we are not estimating
106:00 - 106:30 any of the covariances this is called the baseline model why is that called a basal model it's the worst model because in reality variables are probably related to each other and that's why we're doing the factor analysis right now what do you guys think the orange one is what is this saying just think about
106:30 - 107:00 what the opposite model of a baseline mod yeah right so not only are we estimating the variances we are estimating all possible covariances and this is an upper triangular system in that case the known values
107:00 - 107:30 equals the number of parameters and our degree of freedom is zero right doesn't this look familiar what kind of model is this saturated yeah exactly and the other term for that is the just identified which is exactly what we've been doing right just identified or saturated okay
107:30 - 108:00 really good job you guys are still awake good job okay to understand fit indices you have to understand the difference between a baseline model and a saturated model and think about where your model lies your model is somewhere between a baseline model and a saturated model and that is the premise of fitness disease okay so bear with me here does this a
108:00 - 108:30 simplified eyes can make it without going too technical what these approximate fantasies are saying except especially for cfi and tli not necessarily for uh rmse think about the new denominator here let's look at the denominator remember the baseline is the worst model the best model is the saturated model you can take the difference right of the
108:30 - 109:00 saturated model to the baseline that's that's the biggest difference you can see right that's the denominator what about the numerator well your model called the user model is somewhere between the baseline and the saturated but how much between is it so what you do is then you take the difference of these divided by the difference of the best to the worst
109:00 - 109:30 what would this ratio be if your user model was the saturated model one good job perfect you guys are getting it that is why we want for example the cfi and tli to approach one now if you think about the opposite if your user model is now shifted
109:30 - 110:00 towards the baseline model you're gonna approach zero right zero over one is zero so that's bad that is why you want the cfi and tfi to be between 0 and 1 tli but approach or sorry it should be in between right but approaching one okay so that's that's kind of the rationale for why you
110:00 - 110:30 want a cfi or tli to be close to one uh the rmsea is just a little different okay basically remember we talked about exact fit that's exact fit for the model chi squared the rmsea just does something a little bit different where we talk about close fit and basically you don't have to know too much about like the null hypothesis but um this is what we call a non-central chi
110:30 - 111:00 square it doesn't matter basically what you have to know is if if you get an rmsca of less than 0.05 it satisfies close fit if it's between 0.05 and 0.05 it fails close fit but it doesn't fail perfect so it's kind of in between okay and then anything greater than or equal to ten point ten is poor fit so that's not good okay so if you have something like close fit or approximate close fit then you're good to go for rmsea
111:00 - 111:30 okay so this is just kind of a summary the model chi-square remember it's likely to be rejected given the large sample size okay project more often more likely the cfi remember range is between zero and one well how big uh approaching one they say between
111:30 - 112:00 point nine and point nine five is good okay so usually you want the cfi to be 0.95 tli can be a little lower it can be around 0.9 the cfi is always going to be bigger than the tli and then the arm sca you want it to be less than or equal to 0.05 for close fit but honestly i see a lot of people have you know approximately close fit okay so between
112:00 - 112:30 uh 0.05 and .08 is okay okay so let's look at the um let's look at because we couldn't do um the fit for the three item i just show you a fit for the eight item and you'll be doing this in the exercises actually but um this is just a quick overview of what you're going to look at so in terms of the lavon syntax you're going to add this right here fit that measures equals true remember the standardized true is just to get the standardized loadings
112:30 - 113:00 this is the only thing you add in the summary okay you'll you'll be fitting this a little later now we know the number of observations is 2571 that's big right so we're probably going to reject our null hypothesis well that's the model chi-square with 20 degrees of freedom this is the baseline remember it's the it's the model where we specify only the
113:00 - 113:30 variances and not the covariances we have higher degrees of freedom because we're freeing up eight additional terms don't worry about the p-value there but that is the baseline model this is what you're going to use to calculate the cfi in the exercises now the important part let's look at the cfi what do you guys think well the criteria was .95 honestly
113:30 - 114:00 we don't even satisfy 0.90 okay so it's not great okay but it's not terrible let's look at the rmsea okay it says poor fit basically now looking at the confidence interval that's actually kind of useful is um you kind of want to see if this confidence interval the lower and the upper kind of reach the close fit
114:00 - 114:30 and the lower and then the poor fit in the upper okay so if your confidence interval was something more like this then i'd be i'd be happier right because then you satisfy close fit in the lower and then you satisfy approximate fit and upper approximate fit but here let's look at the lower the lower is barely uh you know it doesn't even satisfy 0.08 which is kind of kind of our requirement there
114:30 - 115:00 and then the upper goes pretty much equal to point 10. you see that so that's not really good and the p-value is you reject the close-fit hypothesis so that's not good finally you have this root mean square residual that comes from um the s minus sigma we haven't covered that but basically that's where it comes from and then you just kind of standardize that and you take the square root
115:00 - 115:30 you want this you know close to zero right so this is actually okay for the standardized viewing visitor so my point here is that you need to take into context all the fit measures that you've you're given now from an overall picture though it looks like this isn't the best fitting model okay so can you guys think of ways to improve the fit of a model in cfa
115:30 - 116:00 can we add more factors that is good okay so yeah so like for example if if like you know there's actually two factors that load into yeah like three items loaded to two factors instead of one factor that's a good suggestion okay what's another suggestion because you're gonna face this problem i guarantee you when you run your analysis you're like oh no my fit is terrible what am i gonna
116:00 - 116:30 do can you look at the uh fit of the individual items and then potentially remove some that is really good yeah so if you look at the loadings of some of them let's say some of them i'm talking about standardize all but let's say some of them were 0.8 some of them were 0.2 well maybe you want to remove the one that's 0.2 and then reassess the fit of your model okay so those are all excellent suggestions and um you know like honestly cfa
116:30 - 117:00 is kind of iterative you kind of have to kind of play around with it just know that um the more things you do with it the more you capitalize on chance okay so i won't talk about this in this beginner seminar but there's something called cross validation that i recommend and that's basically splitting your data set into a testing and a training data set or training and a testing so you start with the training and then you test it okay so let's say you have like uh 2 500 let's say the first uh
117:00 - 117:30 1 1 571 we train and then the next 1000 we we test okay so you can you can like fit as many models as you want in the training like remove items add more factors and then you settle on one final cfa and then you do the test and then see if this fit is good there if your fit is good just as good as in the training then you cross validated your cfa does that make sense and if if not um request a advanced seminar for cfa
117:30 - 118:00 all right so those are great suggestions um in terms of improving the model fit okay um okay we're not gonna take it because this is really quick okay so basically like um someone suggested sorry i i should know your names by now but i haven't look at your name um but um i think you've been responding a lot but basically um we can think about adding two factors
118:00 - 118:30 right instead of one factor well just know that um lavon does something by default okay not only does it do marker method by default it also does the covariance by default okay so by default lavon will co-vary the um factors and if that's not what you want you have to turn it off the other thing is this um
118:30 - 119:00 notice how many items are loaded onto factor two let's say this is like now uh spss anxiety but then this is more like uh computer anxiety like overall anxiety with with uh technology okay so that's kind of what i mean by like a second factor spss anxiety and computer identity should be correlated right but what if i only have two items like um computer i hate using computers for y6 i don't even
119:00 - 119:30 remember but let's say this is like um i can't i don't know how to fix a computer or whatever right let's say you only have two computer items in your survey but what if we remember about um identification remember that for identification we needed three okay so i won't go over the details this is basically gonna be in my advanced mri if you take it and if you want it
119:30 - 120:00 by correlating the factors it automatically identifies a two-item factor okay so you have to have a correlation with another factor to identify a two-item vector okay but otherwise this is under identified if you didn't have any of this this would be under identified okay we won't cover that here today all right so what standardization method are we using here are we using marker or variance
120:00 - 120:30 standardization bearing standardization right can we can we uh use marker of course can you mix and match i you can like you can do i think you can do one here and then do variants one here but i wouldn't recommend it i just basically choose one okay now in terms of lavon basically you just do a second factor right so
120:30 - 121:00 instead of calling it f we're gonna call it f one and then f two and then there's two items that load onto f2 and then there's one two three four five items that load onto f1 now maybe this is a trick question but like what that method am i using here standardization variant standardization yes how do we
121:00 - 121:30 know that because we have this right here but it's a trick question because by default lavon specifies marker method okay so it's actually overwriting the marker method by specifying std dot lv equals true so good job and um finally when we say standardized equals true we get all three okay so it's actually a true question okay so um but basically you know you can look at the um
121:30 - 122:00 yeah we know that this isn't um marker method right because the variance isn't on the loading isn't one i just like looking at this honestly this is the only column that i really look at when i look at cfa right here and then you will notice that you know these loadings what do you guys think well the loading should be between um the standardized loading should be between one and negative one right where one is really high and negative one is really high in the opposite
122:00 - 122:30 direction these don't look that high to me okay why why is this not hype well if you look at the variance explained by um that factor you you square okay i didn't talk about that in the seminar but i do talk about it in the efa one so 0.19619 squared is 0.38 that means
122:30 - 123:00 38 of the variance in statistics makes me cry is explained by spss anxiety let's look at this one 0.498 squared is only 0.25 ish only 25 of the variance in q8 you guys tell me what that is is explained by spss anxiety to me that's bad okay so if you have a loading that's like 0.4 or 0.5 that's actually not that good
123:00 - 123:30 and this is kind of the reason why we're kind of getting low uh cfi too okay so the loadings kind of determine your cfi okay so this is not the best fit honestly none of these um are really great factor analysis models i'm just showing you the real world because you know in textbooks they always show you like perfect data like this is this is the real world where you're not gonna get perfect data okay and some of the correlation the the
123:30 - 124:00 direction doesn't make sense sometimes okay so that's that's that's reality okay so that's kind of uh the loadings now the the the new term we have here is this right and remember the double tilde means covariance this is the covariance and actually you can think of it as a correlation because they're standardized now here this is the correlation of factor one and factor two that is
124:00 - 124:30 pretty high to me that means that um spss anxiety and fear of technology is highly related but they're not perfectly related so you're not going to get a one okay and again these are the residual residual variances and then these are the variances of the factor and we know that these are standardized because these are ones which means this is not marker method this is variance standardization method
124:30 - 125:00 and this is not varying standard deviation method this is standardize all because i see the residual covariance the variational variances are different from the variance standardization method any questions about that i think that's kind of the the major thing that i want to talk about and then finally what if you want to uncorrelate what would you do to uncorrelate these factors
125:00 - 125:30 set the covariance to zero exactly good job okay so basically right here is the extra term f1 covari with f2 set the covariance to zero perfect now you probably don't expect this but this is the output you get okay anyone have an idea of why this is giving me this warning now you're at the the tech support
125:30 - 126:00 can you say that again or if you were talking to me so why why is this um giving me an error i ran the it's not the syntax i guarantee you it's not the syntax this is where you have to kind of know about identification because now f2 will not be identified we have only two items yes perfect okay so like i said before
126:00 - 126:30 and perfect job of understanding that is that we have a two item factor here and when i when i said uh it's identified it's when these are correlated but now i turn these off oops and those are no longer identified very good job okay last poll and then we get to the exercises okay thanks for bearing with me here okay so number one by default levon
126:30 - 127:00 correlates the factors into two factors cfa number these are pretty easy two either marker or variance standardization methods can be used for two factor cfa and then three turning off the factor covariance is that's probably the trickiest one turning off the factor covariance is an assumption it doesn't mean that there's actually no factor covariance amazing
127:00 - 127:30 yeah i i made this one a little easier
127:30 - 128:00 because i know the other ones were like super challenging in terms of like trying to trick you this one should be pretty straightforward so trust your instinct i'll give you guys maybe 20 more seconds all right five four three two and one by default levon correlates the factors and two factors of that's true right so if you if you just like leave levon it's gonna do marker
128:00 - 128:30 method covari the factors does that mean i can use variance standardization method of course right you can use any of them you can you just have to you know say which one you do in this in in the summary and honestly you could just get all three your standardized equals true okay the the third one is you guys got it right but i just want to be clear because a lot of people are like okay so does if i if i remove the covariance does that mean my factors aren't
128:30 - 129:00 correlated anymore yes but that's an assumption right just because you say it's zero doesn't mean it's actually zero that the data is the data right you're still going to have a covariance structure but what does that imply about your model fit if you have a covariance between factors and you constrain it to zero it's an incorrect assumption right so your model implied covariance matrix
129:00 - 129:30 isn't going to match the the population you're gonna get a low fit good job okay so this is the intermission yeah go ahead yes go ahead uh under what circumstances should you set the factors covariance to zero almost never because you can always um co-vary first and then see if the correlation is near zero and then fix it later but remember you
129:30 - 130:00 have to do cross validation okay so just remember you have to do cross validation if you want to do that okay all right so this concludes the lecture portion of the seminar what we're going to do is we're going to take a quick break while you set up your rstudio and then we're going to go over three exercises if we have time otherwise we're going over two okay so this is where we kind of get more interactive and then you can like kind of play around with lavon all right so let's resume at 3 15. okay
130:00 - 130:30 i'll give you some time to kind of set up levan or r okay
130:30 - 131:00 i had a real quick question yes so
131:00 - 131:30 with the data that we're dealing with
131:30 - 132:00 right now and i'm guessing with a lot of the cfas and sems are going to be dealing with uh likert scale type data um and you were saying earlier i just wanted to make sure i understood you clearly you don't really have to specify that the data is ordinal in that instance as long as it's normally distributed
132:00 - 132:30 yeah that's a great question um honestly ordinal regression models were developed much later than uh factor analysis so like if you know like ordinal logistic regression that's something that wasn't developed much later when they developed something called generalized your models and that was like i don't know like 80s or something so factor analysis was developed in like 1900s okay so you're not going to get a lot of overlap there and really um
132:30 - 133:00 as long as your data is like let's say the number of categories is like around four to seven i think that's okay as long as it's not like two right it's not like a binary yes no there's other methods to deal with that but yeah you wouldn't treat it as ordinal you would treat it as continuous okay and then just failed normality i think there's like some other yes some other estimation methods you could use to handle exactly there's things called um the chi-squared
133:00 - 133:30 correction um something called sator bundler was my advisor so satoru bundler adjusts for the chi-square to make it so that it works with non-normal data or skewed data there's also other estimation methods like weighted least squares so those are kind of like solutions they've come up with in if if you have like non-normal data or categorical data and then m plus is definitely the way to go if you want to work with categorical
133:30 - 134:00 data there is something called binary factor analysis or categorical factor analysis and to my knowledge lavon doesn't do that so yeah so that's that's really a limitation of the software and the uh the the technology because um those techniques were developed much later like like these methods were probably developed in the 80s versus like the 1900s okay so levon is not like a it apparently is still beta software so it's not actually that well developed compared to
134:00 - 134:30 m plus for example okay great yeah i'll have to get i've tried israel before right wow you've tried literal that is very that is the heart of sem and if you understand israel you will understand every other program so that's great that you understand this rule because um that's kind of how i taught uh my scm class or seminar but yeah if you understand israel you will understand and plus for sure thank you very much okay so thank you all for staying till
134:30 - 135:00 the end this is the part where you get to uh you know have your hands get your hands wet a little bit uh because we're gonna do some exercises and um basically you're gonna work with lavon yourself okay so exercise one is kind of what i was doing before when i showed the model fit index remember you three items cfa you're not going to get any model fit indices so i want you to practice getting some out of
135:00 - 135:30 lavon by fitting a cfa with all eight items in the saq so that's just q01 to q08 and then fit the fit with all three methods okay you can either do it manually if you want more practice i encourage you to try it manually and then do it with the shortcut method that i showed you kind of interpret the loadings and then assess the fit and then name some reasons for the poor fit
135:30 - 136:00 okay so what i'm gonna do is give you maybe five minutes to do that so 323 let's say yeah lavon does not do efe so i'll give you until 3 23 you don't have to finish it you just have to you know start it so you kind of get an idea okay and i'm not going to tell you where the
136:00 - 136:30 answers are but it's it's it's there somewhere so i would encourage you not to look at the answers when you're doing this yourself and just try it okay the real way to learn is to just do it yourself the beauty of lavon is that the syntax is so easy that i'm sure you can pick it up and then this is where i encourage you to be like interactive like like if you have a question or just just unmute or or like if you
136:30 - 137:00 want to share like how you're you're coding it or something and then how you interpret it are we doing one factor for this one yes one factor and i'm gonna um show you like kind of my code and see if you can catch like the air because you know sometimes you code it you're like wait why is there an error and then you realize you like flipped the sign or you flipped you
137:00 - 137:30 you left out a quotation mark or something you know so now uh nandini i'm going to leave some time after the exercises to answer any um other questions okay so hold off on that and then we can uh address them after the seminar okay all right so how do you guys feel about doing it's not too bad right i think it's just getting the syntax right the good thing is we already kind of did this during the
137:30 - 138:00 lecture right it's just really like for you to kind of practice okay i'm gonna share um and then you tell me what i did wrong okay because i literally made this mistake okay so this is my model i specify the factor right using these eight items and then i pass the model one into the cfa function and then a request fit that measures equals true and standardize equals true okay so do you guys see anything wrong
138:00 - 138:30 with this what was the error yeah violet okay so i think i just flipped the equals with the tilde let's see if that works there we go okay
138:30 - 139:00 so yeah the syntax ordering really matters okay and this is a one factor cfa like we were saying anyone want to volunteer in terms of how to interpret either like the marker method or the variance standardization that's that's kind of hard like pick an item let's say
139:00 - 139:30 no it's kind of scary right okay i'll interpret the um marker method so like um this is saying let's look at uh item four so this is saying for a one unit increase right in spss anxiety then the um
139:30 - 140:00 item 4 increases by 1.3 units on the scale of item 1. okay how about um this one
140:00 - 140:30 okay no volunteers all right so this is basically uh for one let's see once one unit increase now one standard deviation increase in spssx anxiety leads to a 0.55 increase in in item five in yeah in the original metric
140:30 - 141:00 of item five okay and finally what about the last column here for one standard deviation increase in our factor
141:00 - 141:30 uh item 5 goes up by 0.574 perfect standard deviation units perfect thank you what is your name uh dana daniel dano uh where are you on here how do you spell that uh it's d-a-n-i-l-o d-a-n-i-n then danilo it's yeah it's it's danilo
141:30 - 142:00 all right good job danilo okay so yeah so like like danilo said it's basically um these are kind of like standardized units in the in the y and x right in the in the predictor and the outcome so you can kind of see these as correlations now basically then what i look at is okay where are my loadings like near 0.8 no and what's my worst loading item two
142:00 - 142:30 that's really what i look at okay and um you know some some programs they they even cut off certain uh like for efa they cut off the the loadings at point four so if it's less than by four they cut it off okay so that let's look at the fit well we already saw the fit right the cfi was point eight seven tli is point eight one you know the criteria should be point nine five and point nine and then the rmsea it fails uh the test
142:30 - 143:00 of close and approximately close fit so honestly this isn't the best model and like we said to improve the fit you know what's one way well we can maybe maybe remove item two but honestly like it just doesn't look like um there's a strong correlation between a lot of them like q8 we can um kind of reconsider our
143:00 - 143:30 hypothesis and like someone said is to consider a two-factor model and then maybe some load more uh onto each of those factors and if if you try a million things and and and then you still don't see it panning out for you it means you need to reconsider your hypothesis okay if your hypothesis is that this is measuring um spss anxiety and it's it's you're getting super low loadings it means that
143:30 - 144:00 these measures are not reliably measuring the same thing so think about like uh changing your hypothesis maybe this isn't spss anxiety that you're measuring yeah and i won't talk yeah go ahead if i if i'm adapting a scale then all i need to show is all the items are loading onto the factor right that is the first step yes you have to make sure that the loadings are high that you get a high
144:00 - 144:30 cfi tli and the low rmsea that's the first step and then i won't talk about the other stuff but basically there's something called validity so you have to make sure that um your scale correlates with an established scale okay so that's that yeah so that's called validity that's called construct validity but i'm not going to talk about that here okay um all right so how do you guys feel about that one not too bad maybe the interpretation was a little tricky right but honestly like most people don't interpret the marker
144:30 - 145:00 method or the variance standardization method they really just interpret the last column which is the standardize all method okay so second exercise we're going to now like someone said let's try to see if um let's fit the first four items to factor one and the second four items to factor two choose any standardization method you can do all three like we did and then what i want you to do is remove
145:00 - 145:30 the item with the lowest loading how about remove the two the item with the lowest loading in each factor that's what i meant okay so not all the items with the lowest but the lowest loading in each factor now compare the fit of that to the first model you fit and then finally uncorrelate the the factors because by default it's correlated right
145:30 - 146:00 and then compare the uncorrelated model to the correlate and what i mean by the uncorrelated uncorrelate the model in b that makes sense and i'll give you um five minutes to do that you don't have to finish this might take more than five minutes it might take less but just get to as many as you can um yeah like absolute value exactly
146:00 - 146:30 for the uncorrelated two-factor model do you want us to fit um use the or use all eight items again no use the one that you whittle down [Music] um johnny would you be able to say a word or two about um aic bic yeah okay so aic and bic
146:30 - 147:00 are not that commonly used in sem because there are so many other fit indices like the cfi and the tli and rmsea those are pretty good for what you need to do in general i only see the use of that when you're running a model that doesn't have the cli or cfi available this can occur when you have like more complex models that were developed much later for example like
147:00 - 147:30 when you have like categorical factor analysis there's some models that don't even output um the the cfi or tli uh i know this is especially true if you have like interactions like like moderated factor analysis that that definitely doesn't have um tli and cfi so in those cases you're pretty much left with the aic and bic and and that is only a relative fit index so so basically
147:30 - 148:00 you're comparing it to to another uh model i i wouldn't recommend using that unless you absolutely have to like if you have the cfi tli available just use that thank you okay how do you guys feel about this one
148:00 - 148:30 okay so first of all um which items did you guys end up removing okay i got yeah so two and eight did everyone get two and eight yeah like let's see let me share my screen um so this is my first one two and eight right good
148:30 - 149:00 yeah which columns did you guys look at basically i just look at the standardize all right it's hard to tell right you can't use marker method really to compare it you can kind of use variance standardization standardization method also and then i've removed two from factor one and then i removed eight from factor two okay now it looks like look at the fit first okay so cfi tli 0.88 0.82 now let's fit the one
149:00 - 149:30 where we remove them did you guys get 0.917 for the cfi and then 0.844 for the tli so yeah if you get that then what is that telling you that's basically saying that this is okay good good anna okay so that means basically this is a better model right why because we remove the one with the lower
149:30 - 150:00 loadings right so so like i said cfa could be an iterative process where you kind of like try to uh you know free uh or remove some of the load loading items but this is where it crosses the line of is it cfa or is it efa because it's supposed to be a confirmatory not exploratory right um so this is why you need to do something called cross validation like i said to really make sure you're not capitalizing on chance there's also something called modification index which i i didn't talk about today but uh that's another way to kind of look at okay if
150:00 - 150:30 you don't really know like um you let this like algorithm tell you which parameters to add but um i yeah go ahead hi johnny i have a question so the first time i did it i forgot to remove eight so after i remove a so cfi went up and tli went down a little bit like it's 0.01 but how could that happen wait so you said tli went down yeah because oh i typed it wrong because i the first time i forgot i went
150:30 - 151:00 to remove eight hmm let me do this again like him left this oh yeah so before i remove eight the tli is zero point zero point eight six one after i remove eight is zero point eight four four oh interesting after you so okay so can you repeat what you did again you you
151:00 - 151:30 you did you do this yeah i did this okay let's see you got this yep oh hmm that's interesting right that's you know the the strange thing is that the cfi went down but the tli went up right is that what you're saying yeah that is interesting they might have something to do with the
151:30 - 152:00 formula itself because the tli um uses something called relative chi-squared and the cfi uses basically the non-central chi-square so there could be some slight differences in the the pattern but i think in general that's why they develop those criteria so you kind of just want to see if they approach point nine and the other thing is if you're getting different results you want to also confirm it with other things like how about the rmsea
152:00 - 152:30 okay did the rmsea go down how about the the chi-square did the chi square go down okay so you want you want to confirm it with the other things too okay okay thank you yeah um all right so i got that and then and then you guys know how to unconstrain the the covariances right by setting the zero star now what did you notice about the fit battery yeah okay so 0.65
152:30 - 153:00 0.42 rmse 8.22 right so just because you constrain your to your factors to be uncorrelated doesn't mean it's actually reproducing your data that your data itself is uncorrelated okay so you want to make sure that your model accurately reflects the data and that's really the point of these fit indices so good job um
153:00 - 153:30 okay so do you guys have mental capacity to do another one or do you want to just uh kind of end here and then ask questions you can type it in the chat or like you wanna do another exercise or just end here and ask questions the last exercise is hard so just letting you know that if you you
153:30 - 154:00 you'll you'll get something out of the exercise but it oh you wanna do one more oh good yay wow you guys are amazing um you have really good attention span okay yeah okay cause i promise you you'll learn something from okay all right um let's go back to the exercise okay this is the last exercise i call it advanced it's not advanced but it just requires you to think a little bit more
154:00 - 154:30 and you really have to know what baseline model is although the secret is it's in the uh it's in the uh it's in the web page you just have to find it and then the saturated model you kind of have to think about that one but let me read the exercise so basically what what the point of this exercise is is how to calculate the cfi yourself and that way you'll kind of understand what the cfi
154:30 - 155:00 is but basically remember it's a comparison of the baseline model to the saturated model so what you have to do is first fit the baseline model and then fit the saturated model and then use the chi-square and degree of freedom from those and then manually compute the cfi and i'll show you the cfi equation here so this is this this is the cfi equation so first thing you need to do is get the
155:00 - 155:30 chi squared you know how to get the model chi squared right chi square test user whatever and then there's yeah but basically um you need two chi squares you need the baseline and then you need the saturated all right so you need two chi-squares you need two degrees of freedom and then those are your delta right actually i don't think you need the saturated so i'm just showing you so you can fit the uh the saturated
155:30 - 156:00 yourself but basically um uh for actually for to calculate this one you need two you need the baseline and then your user one okay i was just showing you the saturated because um it's useful to learn what a saturated model looks like but you don't actually need that okay so let me rephrase that again so first you need the baseline chi-square and the degree of freedom then you need your user chi squared and your degree of freedom and then you can take the difference of
156:00 - 156:30 the chi squared from the degree of freedom that's your delta baseline minus your delta user this is the user one over the delta baseline you don't need the saturated but i leave that as an exercise because like i wanted you to understand what the saturated model is and for a eight item factor analysis this is for all eight items remember okay so if you want i could uh leave this up and then and then um you can also find
156:30 - 157:00 this equation in the web and i give you a hint go ahead sorry you can ask a quick question about yes fit measures yes um so if you're using cfi srmr and ramsey let's just say it's your three fit measures if you kind of passed two out of the three of them for cut off so you said like cfa has to be like zero decimal nine five for example um if you uh have that and then same with four s r
157:00 - 157:30 and r if you kind of fit the cut off for what's considered good but you don't pass um rmsea is that okay or considered acceptable if you're like past two out of three of those yeah so um that's why there are so many develop because you know you really want to take it in the context right of of your whole model so if you're saying two out of three well okay then you have to kind of look at your loadings make sure your loadings are high like
157:30 - 158:00 you don't want loadings that are in point two region or point four region for the standardized loadings um if you see that then um you know you think about removing those additionally if if you if you're saying two out of three and then your loadings are all pretty high then yeah that's fine so it really is in the context of other things too yeah okay thank you that's super helpful
158:00 - 158:30 and do you guys kind of know how to start the baseline model and the saturated model that honestly is a little advanced okay so but i i hope you can derive it from the um slides i looked at you can go back to the baseline slide and the saturated slide remember the baseline is everything without the covariance and the saturated is everything with the cooker so so so fit the covariance model
158:30 - 159:00 right you need the hint is you need all the variances and covariances in the saturated model and then you can use the plus operator to add multiple coherences in one line so that's the hit
159:00 - 159:30 so the key operator here is the double
159:30 - 160:00 tilde right you need the double tilde to fit all these so you need double tildes and then you need plus and that's pretty much all you need to fit these models and then you just need a
160:00 - 160:30 calculator to calculate the cfi you can also type it into r and let our calculator for you
160:30 - 161:00 okay so don't worry if you didn't finish
161:00 - 161:30 but
161:30 - 162:00 okay
162:00 - 162:30 so you can see the answer is already there what did you guys get did anyone get the um cfi
162:30 - 163:00 oops i think i closed the window oh point eight yeah okay did anyone else get like point eight six nine or point eight seven and that should match the um sorry i'm sure that should match the first model right let's see i don't even remember what i did
163:00 - 163:30 no i think it was something else no it's though it's the one where you have uh all eight right which one where did i put that yeah the one where we have play no what did i do wrong one two three four five six oh wait i copied this
163:30 - 164:00 wrong yeah i think i think it's probably rounding error too let me just make sure yeah i think it's rounding here okay so
164:00 - 164:30 anyone want to share how they did it should be around point eight seven right okay so the baseline model was directly from the web page basically you want to make sure you get all the variances right and no covariances right by default it doesn't get estimate the covariance so so what you want to do is then just say
164:30 - 165:00 you know item one with item one and then just do that for all the items okay so that's how you do the baseline model and then you should get so so you look at the model test user model right so you don't look at the other one because that is the the but you can verify that that is the baseline model so then you should get a test statistic of four one six four and then a degree of freedom of 28. now how did you fit the saturated model
165:00 - 165:30 well basically it's the variances plus the covariances right so then i basically add the variances here i copy that and now i add the covariances and basically you don't want to duplicate them right so you want like one with two three four five six seven eight two with three four five six seven eight etcetera until q7 with q8 and that's what i meant you can now simplify it by adding so that that estimates the covariances so you basically have the variances and covariances and then
165:30 - 166:00 that's the uh saturated model and you know that because the degree of freedom is zero now which one do you use this one for the the cfi no i'm just demonstrating this for um purposes of demonstration now what i think you should get and correct me if i'm wrong is i think you should use the model test here right 554 and then degree of freedom is 20. and then you can verify that this is the same as that baseline model here in the
166:00 - 166:30 test baseline okay so if you subtract that from 20 and then you extract this from 28 and then you follow the uh formula you should be getting the cfi okay so let me know if you guys get something else but here's the answer okay all right so i think that's a good time to stop but thank you all
166:30 - 167:00 for participating you guys were excellent participants one of the best uh audiences i've had and you guys looked like you were uh participating engaged throughout the whole seminar so thank you for that um and i hope you found that helpful if you are interested in like a more advanced cfa seminar like let me know or like a cm seminar that's more advanced uh let me know otherwise this is the conclusion of the seminar and then you are free to go or you can just stay and ask questions
167:00 - 167:30 about things you didn't understand or just general questions about cfa okay thank you so much