Regression: Crash Course Statistics #32

Hi, I’m Adriene Hill and welcome back to
Crash Course Statistics. There’s something to be said for flexibility. It allows you to adapt to new circumstances. Like a Transformer is a truck, but it can
also be an awesome fighting robot. Today we’ll introduce you to one of the
most flexible statistical tools–the General Linear Model, or GLM. The GLM will allow us to create many different models to help describe the world. The first we’ll talk about is The Regression
Model. INTRO General Linear Models say that your data can be explained by two things: your model, and some error: First, the model. It usually takes the form Y = mx + b, or rather, Y = b + mx in most cases. Say I want to predict the number of trick-or-treaters I’ll get this Halloween by using enrollment numbers from the local middle school. I have to make sure I have enough candy on hand. I expect a baseline of 25 trick-or-treaters. And then for every middle school student,
I’ll increase the number of trick-or-treaters I expect by 0.01. So this would be my model: There were about 1,000 middle school students nearby last year, so based on my model, I predicted that I’d get 35 trick-or-treaters. But reality doesn’t always match predictions. When Halloween came around, I got 42, which means that the error in this case was 7. Now, error doesn’t mean that something’s
WRONG, per se. We call it error because it’s a deviation
from our model. So the data isn’t wrong, the model is. And these errors can come from many sources: like variables we didn’t account for in our model– including the candy-crazed kindergartners from the elementary school–or just random variation Models allow us to make inferences –whether it’s the number of kids on my doorstep at Halloween, or the number of credit card frauds
committed in a year. General Linear Models take the information
that data give us and portion it out into two major parts: information that can be accounted for by our model, and information that can’t be. There’s many types of GLMS, one is Linear
Regression. Which can also provide a prediction for our
data. But instead of predicting our data using a
categorical variable like we do in a t-test, we use a continuous one. For example, we can predict the number of
likes a trending YouTube video gets based on the number of comments that it has. Here, the number of comments would be our
input variable and the number of likes our output variable. Our model will look something like this: The first thing we want to do is plot our
datafrom 100 videos: This allows us to check whether we think that
the data is best fit by a straight line, and look for outliers–those are points that are
really extreme compared to the rest of our data. These two points look pretty far away from
our data. So we need to decide how to handle them. We covered outliers in a previous episode,
and the same rules apply here. We’re trying to catch data that doesn’t
belong. Since we can’t always tell when that happened,
we set a criteria for what an outlier is, and stick to it. One reason that we’re concerned with outliers
in regression is that values that are really far away from the rest of our data can have
an undue influence on the regression line. Without this extreme point, our line would
look like this. But with it, like this. That’s a lot of difference for one little
point! There’s a lot of different ways to decide,
but in this case we’re gonna leave them in. One of the assumptions that we make when using
linear regression, is that the relationship is linear. So if there’s some other shape our data
takes, we may want to look into some other models. This plot looks linear, so we’ll go ahead
and fit our regression model. Usually a computer is going to do this part
for us, but we want to show you how this line fits. A regression line is the straight line that’s
as close as possible to all the data points at once. That means that it’s the one straight line
that minimizes the sum of the squared distance of each point to the line. The blue line is our regression line. Its equation looks like this: This number–the y-intercept–tells us how
many likes we’d expect a trending video with zero comments to have. Often, the intercept might not make much sense. In this model, it’s possible that you could
have a video with 0 comments, but a video with 0 comments and 9104 likes does seem to
conflict with our experience on youtube. The slope, aka, the coefficient–tells us
how much our likes are determined by the number of comments. Our coefficient here is about 6.5, which means
that on average, an increase in 1 comment is associated with an increase of about 6.5
likes. But There’s another part of the General
Linear Model: the error. Before we go any further, let’s take a look
at these errors–also called residuals. The residual plot looks like this: And we can tell a lot by looking at its shape. We want a pretty evenly spaced cloud of residuals. Ideally, we don’t want them to be extreme
in some areas and close to 0 in others. It’s especially concerning if you can see
a weird pattern in your residuals like this: Which would indicate that the error of your
predictions is dependent on how big your predictor variable value is. That would be like if our YouTube model was
pretty accurate at predicting the number of likes for videos with very few comments, but
was wildly inaccurate on videos with a lot of comments. So, now that we’ve looked at this error,
This is where Statistical tests come in. There are actually two common ways to do a Null Hypothesis Significance test on a regression coefficient. Today we’ll cover the F-test. The F-test, like the t-test, helps us quantify
how well we think our data fit a distribution, like the null distribution. Remember, the general form of many test statistics
is this: But I’m going to make one small tweak to
the wording of our general formula to help us understand F-tests a little better. The null hypothesis here is that there’s
NO relationship between the number of comments on a trending YouTube video and the number
of likes. IF that were true, we’d expect a kind of
blob-y, amorphous-cloud-looking scatter plot and a regression line with a slope of 0. It would mean that the number of comments
wouldn’t help us predict the number of likes. We’d just predict the mean number of likes
no matter how many comments there were. Back to our actual data. This blue line is our observed model. And the red is the model we’d expect if
the null hypothesis were true. Let’s add some notation so it’s easier
to read our formulas. Y-hat looks like this, and it represents the
predicted value for our outcome variable–here it’s the predicted number of likes. Y-bar looks like this, and it represents the
mean value of likes in this sample. Taking the squared difference between each
data point and the mean line tells us the total variation in our data set. This might look similar to how we calculated
variance, because it is. Variance is just this sum of squared deviations–called
the Sum of Squares Total–divided by N. And we want to know how much of that total
Variation is accounted for by our regression model, and how much is just error. That would allow us to follow the General
Linear Model framework and explain our data with two things: the model’s prediction,
and error. We can look at the difference between our
observed slope coefficient–6.468–and the one we’d expect if there were no relationship–0,
for each point. And we’ll start here with this point: The green line represents the difference between
our observed model–which is the blue line–and the model that would occur if the null were
true–which is the red line. And we can do this for EVERY point in the
data set. We want negative differences and positive
differences to count equally, so we square each difference so that they’re all positive. Then we add them all up to get part of the
numerator of our F-statistic: The numerator has a special name in statistics. It’s called the Sums of Squares for Regression,
or SSR for short. Like the name suggests, this is the sum of
the squared distances between our regression model and the null model. Now we just need a measure of average variation. We already found a measure of the total variation
in our sample data, the Total Sums of Squares. And we calculated the variation that’s explained
by our model. The other portion of the variation should
then represent the error, the variation of data points around our model. Shown here in Orange. The sum of these squared distances are called
the Sums of Squares for Error (SSE). If data points are close to the regression
line, then our model is pretty good at predicting outcome values like likes on trending YouTube
Videos. And so our SSE will be small. If the data are far from the regression line,
then our model isn’t too good at predicting outcome values. And our SSE is going to be big. Alright, so now we have all the pieces of
our puzzle. Total Sums of Squares, Sums of Squares for
Regression, and Sums of Squares for Error: Total Sums of Squares represents ALL the information
that we have from our Data on YouTube likes. Sums of Squares for Regression represents
the proportion of that information that we can explain using the model we created. And Sums of Squares for Error represents the
leftover information–the portion of Total Sums of Squares that the model can’t explain. So the Total Sums of Squares is the Sum of
SSR and SSE. Now we’ve followed the General Linear Model
framework and taken our data and portioned it into two categories: Regression Model,
and Error. And now that we have the SSE, our measurement
of error, we can finally start to fill in the Bottom of our F-statistic. But we’re not quite done yet. The last and final step to getting our F-statistic
is to divide each Sums of Squares by their respective Degrees of freedom. Remember degrees of freedom represent the amount of independent information that we have. The sums of square error has n–the sample
size–minus 2 degrees of freedom. We had 100 pieces of independent information
from our data, and we used 1 to calculate the y-intercept and 1 to calculate the regression
coefficient. So the Sums of Squares for Error has 98 degrees
of freedom. The Sums of Squares for Regression has one
degree of freedom, because we’re using one piece of independent information to estimate
our coefficient our slope. We have to divide each sums of squares by
its degrees of freedom because we want to weight each one appropriately. More degrees of freedom mean more information. It’s like how you wouldn’t be surprised
that Katie Mack who has a PhD in AstroPhysics can explain more about the planets than someone
taking a high school Physics class. Of course she can she has way more information. Similarly, we want to make sure to scale the
Sums of Squares based on the amount of independent information each have. So we’re finally left with this: And using an F-distribution, we can find our
p-value: the probability that we’d get a F statistic as big or bigger than 59.613. Our p-value is super tiny. It’s about 0.000-000-000-000-99. With an alpha level of 0.05, we reject the
null that there is NO relationship between likes and YouTube comments on trending videos. So we reject that true coefficient for the
relationship between likes and comments on YouTube is 0. The F-statistic allows us to directly compare
the amount of variation that our model can and cannot explain. When our model explains a lot of variation,
we consider it statistically significant. And it turns out, if we did a t-test on this
coefficient, we’d get the exact same p-value. That’s because these two methods of hypothesis
testing are equivalent, in fact if you square our t-statistic, you’ll get our F-statistic! And we’re going to talk more about why F-tests
are important later. Regression is a really useful tool to understand. Scientists, economists, and political scientists
use it to make discoveries and communicate those discoveries to the public. Regression can be used to model the relationship
between increased taxes on cigarettes and the average number of cigarettes people buy. Or to show the relationship between peak-heart-rate-during-exercise
and blood pressure. Not that we’re able to use regression alone
to determine if it causes changes. But more abstractly, we learned today about
the General Linear Model framework. What happens in life can be explained by two
things: what we know about how the world works, and error–or deviations–from that model. Like say you budgeted $30 for gas and only
ended up needing $28 last week. The reality deviated from your guess and now
you get to to go to The Blend Den again! Or just how angry your roommate is that you
left dishes in the sink can be explained by how many days you left them out with a little
wiggle room for error depending on how your roommate's day was. Alright, thanks for watching, I’ll see you
next time.

49 Replies to “Regression: Crash Course Statistics #32”

  1. this example does not make sense for those students who have no knowledge of cheesy 'halloween'. use a better example woman! using too much animation also distract from effective learning.

  2. Only crash course can make statistics interesting. Thank you for making quality educational videos for free! 😀

  3. Okay, right off the bat she lost me when she introduced 0.01 to the equation at 1:00. I just wanted to know about the regression briefly. Will I understand why she used that number(0.01) if I watched this course from the beginning?

  4. Hi at 3:51 is it sum of(observed value minus predicted value)^2 or is it sum of(observed value minus average of values observed)^2

  5. This right here is the most entertaining and intriguing statistical video Ive ever watched.. it actually made stats fun, thanks for incorporating art and creativity to this piece ,,instead of old and boring numbers presented in a monotonic go to sleep now voice

  6. At 9:33, she says 'The sums of squares for regression (SSR) has one degree of freedom as one degree is consumed in calculating slope of the model line'. How is that o.O

  7. I don't get why in 9:35 she says that we only need 1 degree of freedom to calculate the slop. I understand the 98 DF for SSE but I don't get why SSR has only 1 DF

  8. Some unnecessarily confusing parts:

    It would have been helpful to explain that our zero-coefficient line IS the line y='y hat'.

    The point referred to at 7:05 is not highlighted or pointed out (and as it sits far above its distance for SSR it isn't instantly recognizable as connected).

    Positioning of the equations at 8:50 gives strong and erroneous implication that each refers specifically to the diagram above.

    The equation given for F-statistic at 8:58 is then instantly revised as not being correct.

    The correct f-statistic equation is only on screen at 10:07 for a fraction of the time needed to read it – let alone fully digest it.

  9. This video is really helpful and the explanations are easy to understand. I think however some quiet music might be able to break the tension a bit, especially as the sound effects tend to get repetitive. Some quiet Lo-Fi instrumentals in the background could really help the videos seem even more polished.
    Keep up the good work. ^^

  10. that was a gr8 one. currently, maybe one of the most important. nice work aH ^_^/💜✨🌹💫

  11. Thank you this helped me so much! Will you do a video on multiple regression and econometrics in general? Keep up the good work you guys rock!

  12. Very interesting. One comment though. "The regression line is the one straight line that minimizes the sum of the squared distances of each point to the line" (3:50) can be slightly misleading. It seems to suggest the actual distance from each point to the line, which (except for a horizontal line) would not be vertical. It should say, "…minimizes the sum of the squared vertical distances from each point to the line."

  13. This series is amazing! I have majored in Statistics and still this series explains everything much better than college classes.

  14. Don't forget to factor in the number of dishes there are. You might want dirty dishes as a percentage of all dishes owned and a percentage of space in the sink. Higher numbers are bad for both values. Maybe instead of working out the math and plotting data, you could just do the @#!$ dishes already. I can't even wash them at this point without taking them out to make space.

  15. You mean it's all of us human being random number generators against YouTubes mechanical algorithms. "Of course you realize, this means war!"

  16. Can anyone recommend an exercise book or a site with practice questions for statistics? I feel like I need to practice it on my own. Cheers.

  17. NOTE: This video uses the abbreviation "GLM" incorrectly (or at least very misleadingly) throughout.

    The general linear model is NOT usually what is meant by "GLM". Instead, GLM stands for generaLIZED linear model, which is a special kind of linear model that (among other things) allows for a response variable that is not normally distributed. (Yes, this is extremely confusing. Don't even get me started on the word "linear", which doesn't even mean "straight lines" in this context.)

    Bottom line: substitute simply "linear model" whenever Adrienne says "GLM" in this video, and you'll be fine.

  18. I finally figured out the issue with this series, why it is so hard to follow. The animations are too much, too fast for statistics. I can barely follow through with the examples, or cannot follow at all. Example: the calculations; you can't remove each line before the next. I would want to see what numbers went where, and it is not that long of a calculation that you need to have space. Other than that, I think everything else is fine. Crash course Economics was awesome btw.

  19. Interesting. I just factchecked the theory about the comment-to-likes ratio, and it met pretty well: At the time I've written this, there were 41 comments and 391 likes, which is just the value "4000/100" shown in the diagram… As it turned out, this time it's above the regression line, but with an increase in the y-value by less than 35%

Leave a Reply

Your email address will not be published. Required fields are marked *