model <- lm(Y ~ X1 + X2 + X3, data = mydata)
summary(model)Generalised Linear Model
This chapter introduces the generalised linear models: linear regression, logistic (binomial, ordinal, and multinomial), poisson, and negative binomial regression.
This chapter is meant to be for application purposes. The estimation processes and details are contained in chapter 3 and chapter 4.
Overview
The generalised linear model (GLM) are a series of models that help explain the relationship between a series of explanatory variables \(X_1, X_2, \dots, X_p\) and an outcome variable \(Y\), for individual observations \(t = 1, 2, \dots, n\).
- Note that relationships cannot be interpreted causally, unless the treatment in question was randomly assigned and meets the condition of independence (definition 5.9). If you are interested in causal estimation, see the next chapter.
The generalised linear model (GLM) is a grouping of many different models that take the following form:
\[ g(\mu) = \beta_0 + \beta_1 X_{t1} + \beta_2 X_{t2} + \dots + \beta_p X_{tp} \]
Where \(\mu\) is the parameter of interest of variable \(Y\) (depends on the model) and \(g(\cdot)\) is some link-function that allows the GLM to be applied to a variety of different types of \(Y\). \(\beta_0, \dots, \beta_p\) are the parameters of the model that need to be estimated.
GLMs have two main purposes. First, they can be used for predicting new values of \(Y\) given \(X_1, \dots, X_p\) values. Second, they can be used to understand the correlation between \(X_1, \dots, X_p\) and \(Y\). The choice of model depends on your purpose (prediction or correlation), and the type of outcome \(Y\) you have:
| Model | When to Use |
|---|---|
| Linear Regression | Continuous \(Y\) for both prediction and correlation. Correlation with count \(Y\), and prediction (but can produce poor predictions, so use poisson or negative binomial). Correlation with ordinal \(Y\) (not good for prediction, use ordinal or multinomial logistic). |
| Linear Probability | Correlation with binary \(Y\) (do not use for prediction, use binomial logistic). |
| Binomial Logistic | Binary \(Y\) prediction, but can be used for correlation as well (but harder correlation interpretation than linear probability). |
| Ordinal Logistic | Correlation or prediction with ordinal \(Y\) (but prediction may be better/more flexible with multinomial logistic). |
| Multinomial Logistic | Correlation and prediction with categorical \(Y\). Prediction with ordinal \(Y\). |
| Poisson | Prediction and correlation with count \(Y\) (Use only if \(\V Y < \E Y\), if not, use negative binomial). |
| Negative Binomial | Prediction and correlation with count \(Y\) (Works no matter what type of \(Y\), but poisson may be more efficient if \(\V Y < \E Y\)). |
| Type | Description | Example |
|---|---|---|
| Continuous | Can take any value within an interval \([a, b]\). | GDP of a country, temperature. |
| Binary | Can only take value 0 or 1. | Yes/no, true/false, did/didn’t. |
| Ordinal | Can only take a finite set of discrete outcomes \(\{a, b, c\}\), but these outcomes can be ordered. | strongly disagree - disagree - neutral - agree - strongly agree |
| Categorical | Can only take a finite set of discrete outcomes \(\{a, b, c\}\), and these outcomes have no natural order. | Country of birth. |
| Count | Can only take an integer value \(\{0, 1, 2, \dots\}\). | Number of cases of a disease, number of phone calls received at a call centre in an hour. |
The final section on model specification issues introduces a few different topics (categorical explanatory variables, transformations, interactions) that can be applied to all of the models.
Linear Regression Model
The linear regression model is used for continuous outcome variables \(Y\), and a set of explanatory variables \(\set X = \{X_1, \dots, X_p\}\) for observations \(t = 1, \dots, n\):
\[ Y_t = \underbrace{\beta_0 + \beta_1X_{t1} + \beta_2 X_{t2} + \dots + \beta_p X_{tp}}_{\E(Y_t | \set X_t)} + \eps_t \]
Where \(\beta_0\) is the intercept, \(\beta_1, \dots, \beta_p\) are the parameters that explain the relationship between \(X_j\) and \(Y\), and \(\eps_t\) is the error term that accounts for random variation/noise in \(Y\). For more details on the linear model, see chapter 4.
The parameters \(\beta_0, \dots, \beta_p\) need to be estimated with sample data to obtain estimates \(\hat\beta_0, \dots, \hat\beta_p\). The linear model can be estimated with a variety of estimators:
We should use the Ordinary Least Squares estimator (definition 4.8) when our model meets all 5 of the classical linear model assumptions. If we are not sure spherical errors (definition 4.6) is met, we should instead use OLS with robust standard errors.
For estimation with OLS (and normal standard errors), the R-code is:
For mechanical details on OLS, see here.
We should use the Ordinary Least Squares estimator (definition 4.8) with robust standard errors (theorem 4.6) if we believe our model meets all of the classical linear model assumptions except for spherical errors (definition 4.6), and instead we have conditional heteroscedasticity (?def-heteroscedasticity).
For estimation with OLS and robust standard errors in R, we should use the fixest package:
model <- fixest::feols(Y ~ X1 + X2 + X3,
data = mydata,
se = "hetero")
summary(model)For mechanical details on robust standard errors, see here.
We can use the weighted least squares estimator (definition 4.14) if we believe our model meets all of the classical linear model conditions except for spherical errors (definition 4.6), and we have conditional heteroscedasticity.
For WLS in R, we first estimate a normal linear regression with OLS to obtain the estimated residuals, then estmiate a model between residuals and \(X\), before using them as weights in a final model:
#function
wls <- function(y, x, data) {
y <- rlang::enquo(y)
x <- rlang::enquo(x)
formula <- as.formula(paste(rlang::as_label(y), "~", rlang::as_label(x)))
ols <- lm(formula, data = data) #estimate lm
res_sq <- residuals(ols)^2 #residuals
log_res_sq <- log(res_sq) #log trans for stability
formula2 <- as.formula(paste("log_res_sq", "~", rlang::as_label(x)))
var_model <- lm(formula2, data = data) # model for variance
hat_sigma <- exp(predict(var_model))
weights <- 1 / hat_sigma # weights
mod <- fixest::feols(formula, weights = weights, data = data) # WLS
return(mod)
}Once you have copied the function and ran it, a WLS model is run by:
wls(y, x1 + x2 + x3, data = mydata)For mechanical details on WLS (GLS), see here.
We should use the generalised least squares estimator if we believe our model meets all of the classical linear model conditions except for spherical errors (definition 4.6), and we have autocorrelation.
I do not recommend using GLS over OLS with robust standard errors, as the chances of something going wrong are much higher.
For (Feasible) GLS in R, we can use the nlme package:
model <- nlme::gls(y ~ x1 + x2 + x3,
data = mydata,
correlation = corAR1(form = ~ time))
summary(model)We can also choose other autocorrelation structures by replacing corAR1():
corAR1(form = ~ time) #AR 1 autocorrelation
corARMA(p = 1, q = 1, form ~ time) #ARMA(p, q) autocorrelation
corExp(form = ~lat + long) #spatial exp decay autocorrelation
corGaus(form = ~lat + long) # spatial gaussian decy autocorr
corSpher(form = ~lat + long) #spatial spherical(linear) autocorrFor mechanical details on FGLS, see here.
We should use the instrumental variables estimator (definition 4.16)or 2-stage least squares (2SLS) when our model violates exogeneity (definition 4.5), and we care a lot about accurately estimating the correlation between \(X_j\) and \(Y\), and we have a suitable instrument \(Z\).
You should not use IV/2SLS unless you really only care about the accurate estimation of one \(\beta_j\), and do not care about prediction purposes.
For estimation of IV/2SLS, we should use the fixest package:
model <- fixest::feols(Y ~ 1 | X ~ Z, data = mydata, se = "hetero")
summary(model)You can replace the 1 with covariates that will be included in both stages of the regression.
For mechanical details on IV/2SLS, see here.
Interpretation of coefficient estimates are as follows (proof from theorem 4.1). Remember that these interpretations are not causal unless our \(X_j\) has been randomly assigned and meets the condition of independence (definition 1.6):
| Continuous \(X_{j}\) | Binary \(X_{j}\) | |
| \(\hat\beta_j\) | For every one unit increase in \(X_{j}\), there is an expected \(\hat\beta_j\) unit change in \(Y\), holding all other explanatory variables constant. | There is a \(\hat\beta_j\) unit difference in \(Y_i\) between category \(X_{j} = 1\) and category \(X_{j} = 0\), holding all other explanatory variables constant. |
| \(\hat\beta_0\) | When all explanatory variables equal 0, the expected value of \(Y\) is \(\hat\beta_0\). | For category \(X_{j} = 0\), the expected value of \(Y\) is \(\hat\beta_0\) (when all other explanatory variables equal 0). |
To predict new values of \(Y\) for new observations \(t\), we use our fitted values equation:
\[ \hat Y_t = \hat\beta_0 + \hat\beta_1 X_{t1} + \dots + \hat\beta_p X_{tp} \]
In R, we can use the predict command:
my_predictions <- predict(model, newdata = my_new_data)Where my_new_data is a data frame with \(X_1, \dots, X_p\) values for all the new observations we want to predict \(\hat Y\) for. If you only want to predict for your existing data on which the model was fitted on, you do not need the newdata argument.
Hypothesis testing of each coefficient with a t-test (and the p-values) is provided in most regression outputs. The mechanics of t-test were provided in chapter 4.
- If our coefficient \(\hat\beta_j\) is statistically significant, then we can reject that there is no relationship between \(X_j\) and \(Y\), and conclude there is a relationship between \(X_j\) and \(Y\).
- If our coefficient \(\hat\beta_j\) is not statistically significant, then we cannot reject there is no relationship between \(X_j\) and \(Y\).
R-Squared is provided in the regression output, and was discussed in definition 4.11.
R-Squared is the percentage of variation in \(Y\) our model explains. It is always between 0 and 1. R-squared never decreases as we add more explanatory variables - so it is important not to overly focus on it.
Adjusted R-Squared is similar but penalises models for having too many explanatory variables.
F-tests can test the significance of multiple parameters at once. The mechanics of F-test were provided in chapter 4.
To conduct a f-test in R, we do the following (where model 0 is the null model, and model 1 is the alternative model):
anova(model0, model1)If the test is statistically significant, then all the additional parameters in model 1 are jointly significant. If the test is not statistically significant, then all the additional parameters in model 1 are not jointly significant.
For more information on categorical \(X\), interactions, polynomial and logarithmic transformations, see the final section on model specification.
Linear Probability Model
The linear probability model is for binary \(Y\) (bernoulli distribution). It should generally only be used for interpreting relationships between \(X_j\) and binary \(Y\), and not for prediction (use binomial logistic instead). The outcome of interest is the \(\E(Y_t | \set X_t) = \P (Y_t = 1 | \set X_t)\), which we denote as \(\pr_t\).
\[ \pr_t = \beta_0 + \beta_1 X_{1t} + \beta_2 X_{2t} + \dots + \beta_p X_{tp} \]
Where \(\beta_0\) is the intercept, \(\beta_1, \dots, \beta_p\) are the parameters that explain the relationship between \(X_j\) and \(Y\), that need to be estimated with sample data with one of two estimators:
We should use the Ordinary Least Squares estimator (definition 4.8) with robust standard errors (theorem 4.6) for the linear probability model, because \(Y\) is distributed by the bernoulli distribution, which implies heteroscedasticity.
For estimation with OLS and robust standard errors in R, we should use the fixest package:
model <- fixest::feols(Y ~ X1 + X2 + X3,
data = mydata,
se = "hetero")
summary(model)For mechanical details on robust standard errors, see here.
We should use the instrumental variables estimator (definition 4.16) or 2-stage least squares (2SLS) when our model violates exogeneity (definition 4.5), and we care a lot about accurately estimating the correlation between \(X_j\) and \(Y\), and we have a suitable instrument \(Z\).
You should not use IV/2SLS unless you really only care about the accurate estimation of one \(\beta_j\), and do not care about prediction purposes.
For estimation of IV/2SLS, we should use the fixest package:
model <- fixest::feols(Y ~ 1 | D ~ Z,
data = mydata,
se = "hetero")
summary(model)For mechanical details on IV/2SLS, see here.
Interpretation of coefficient estimates are as follows (proof from theorem 4.1). Remember that these interpretations are not causal unless our \(X_j\) has been randomly assigned and meets the condition of independence (definition 1.6):
| Continuous \(X_{j}\) | Binary \(X_{j}\) | |
| \(\hat\beta_j\) | For every one unit increase in \(X_{j}\), there is an expected \(\hat\beta_j \times 100\) percentage point change in the probability of a unit being in category \(Y_t=1\), holding all other explanatory variables constant. | There is a \(\hat\beta_j\times 100\) percentage point difference in the probability of a unit being in category \(Y_t=1\) between category \(X_{j} = 1\) and category \(X_{j} = 0\), holding all other explanatory variables constant. |
| \(\widehat{\beta_0}\) | When all explanatory variables equal 0, the expected probability of a unit being in category \(Y_t=1\) is \(\hat\beta_0 \times 100\) | For category \(X_{j} = 0\), the expected probability of a unit being in category \(Y_t=1\) is \(\hat\beta_j \times 100\) (when all other explanatory variables equal 0). |
You should not use the linear probability model for prediction. This is because the linear probability model can produce probabilities \(\pr_t\) that are outside of the range 0 and 1. This violates the axioms of probability, which state any event must have a probability between 0 and 1.
If you are interested in prediction, see the binomial logistic model, which ensures that probabilities \(\pr_t\) are always between 0 and 1.
The linear probability model is only useful when we are considering the relationship between \(X_j\) and \(Y\), since the linear nature of the model makes interpreting parameters \(\beta_j\) simple.
Hypothesis testing of each coefficient with a t-test (and the p-values) is provided in most regression outputs. The mechanics of t-test were provided in chapter 4.
- If our coefficient \(\hat\beta_j\) is statistically significant, then we can reject that there is no relationship between \(X_j\) and \(Y\), and conclude there is a relationship between \(X_j\) and \(Y\).
- If our coefficient \(\hat\beta_j\) is not statistically significant, then we cannot reject there is no relationship between \(X_j\) and \(Y\).
R-Squared is provided in the regression output, and was discussed in definition 4.11.
R-Squared is the percentage of variation in \(Y\) our model explains. It is always between 0 and 1. R-squared never decreases as we add more explanatory variables - so it is important not to overly focus on it.
Adjusted R-Squared is similar but penalises models for having too many explanatory variables.
F-tests can test the significance of multiple parameters at once. The mechanics of F-test were provided in chapter 4.
To conduct a f-test in R, we do the following (where model 0 is the null model, and model 1 is the alternative model):
anova(model0, model1)If the test is statistically significant, then all the additional parameters in model 1 are jointly significant. If the test is not statistically significant, then all the additional parameters in model 1 are not jointly significant.
Binomial Logistic Regression
The linear probability model is for binary \(Y\) (bernoulli distribution). The outcome of interest is the \(\E(Y_t | \set X_t) = \P (Y_t = 1 | \set X_t)\), which we denote as \(\pr_t\). The logistic model applies a link function \(g(\cdot)\) to \(\pr_t\) to ensure that predicted probabilities are always between 0 and 1. The model is specified as:
\[ \log\left(\frac{\pr_t}{1 - \pr_t}\right) = \beta_0 + \beta_1X_{t1} + \beta_2X_{t2} + \dots + \beta_pX_{tp} \tag{7.1}\]
Using the property of logarithms, we can rewrite the model in respect to \(\pr_t\):
\[ \pr_t = \frac{e^{\beta_0 + \beta_1X_{t1} + \beta_2X_{t2} + \dots + \beta_pX_{tp}}}{1+e^{\beta_0 + \beta_1X_{t1} + \beta_2X_{t2} + \dots + \beta_pX_{tp}}} \]
Where \(\beta_0\) is the intercept, \(\beta_1, \dots, \beta_p\) are the parameters that explain the relationship between \(X_j\) and \(Y\), that need to be estimated with sample data. The model is always estimated with the maximum likelihood estimator.
model <- glm(Y ~ X1 + X2 + X3,
data = mydata,
family = "binomial")
summary(model)Details on the maximum likelihood estimator are provided here.
Interpretation of coefficient estimates are very difficult - instead, we will interpret odds ratios, which are \(e^{\hat\beta_j}\). Remember that these interpretations are not causal unless our \(X_j\) has been randomly assigned and meets the condition of independence (definition 1.6):
Odds of an event \(A\) is the probability of \(A\) occuring divided by the probability of event \(A\) not occuring. We can apply the same logic to \(\P(Y_t = 1) = \pr_t\):
\[ \mathrm{odds}_A = \frac{\P(A)}{1 - \P(A)} \quad \implies \quad \mathrm{odds}_{Y_t = 1} = \frac{\pr_t}{1-\pr_t} \]
From eq. 7.1, if we exponent both sides, we can get the odds of \(Y_t = 1\) from the logistic regression:
\[ \frac{\pr_t}{1-\pr_t} = e^{\beta_0 + \beta_1X_{t1} + \beta_2X_{t2} + \dots + \beta_pX_{tp}} \]
An odds ratio is a ratio of two odds. For the odds of event \(A\) and \(B\), the odds ratio is:
\[ OR = \frac{\mathrm{odds}_A}{\mathrm{odds}_B} = \frac{\P A / ( 1 - \P A)}{\P B / (1 - \P B)} \]
We can apply the same to the logistic regression. We can find the odds of \(\pr_t | X_j = x+1\) and \(\pr_t | X_j = x\):
\[ OR = \frac{\mathrm{odds}_{\pr_t | X_j = x+1}}{\mathrm{odds}_{\pr_t | X_j = x}} = e^{\beta_j} \]
Thus, \(e^{\beta_j}\) is the multiplicative change in the odds of event \(Y_t = 1\), for every one unit increase in \(X_j\).
| Continuous \(X_{j}\) | Binary \(X_{j}\) | |
| \(\hat\beta_j\) | For every one unit increase in \(X_{j}\), there is an expected \(\times e^{\hat\beta_j}\) multiplicative change in the odds of a unit being in category \(Y_t=1\), holding all other explanatory variables constant. | There is a \(\times e^{\hat\beta_j}\) multiplicative difference in the odds of a unit being in category \(Y_t=1\) between category \(X_{j} = 1\) and category \(X_{j} = 0\), holding all other explanatory variables constant. |
| \(\hat\beta_0\) | The odds of event \(Y_t = 1\) when all explanatory variables equal 0 is \(e^{\hat\beta_0}\). | The odds of event \(Y_t = 1\) for category \(X_j = 0\) is \(e^{\hat\beta_0}\), when all other explanatory variables equal 0. |
To predict probabilities \(\pr_t = \P(Y_t = 1)\) for new observations \(t\), we use our fitted values equation:
\[ \hat\pr_t = \frac{e^{\hat\beta_0 + \hat\beta_1X_{t1} + \hat\beta_2X_{t2} + \dots + \hat\beta_pX_{tp}}}{1+e^{\hat\beta_0 + \hat\beta_1X_{t1} + \hat\beta_2X_{t2} + \dots + \hat\beta_pX_{tp}}} \]
In R, we can use the predict command:
my_predictions <- predict(model,
newdata = my_new_data,
type = "response")Where my_new_data is a data frame with \(X_1, \dots, X_p\) values for all the new observations we want to predict \(\hat Y\) for. If you only want to predict for your existing data on which the model was fitted on, you do not need the newdata argument.
We can also do classification, which is not about predicting the probability \(\pr_t\), but instead predicting if \(\hat Y = 1\) or \(\hat Y = 0\). We generally assign units \(\pr_t > 0.5\) to \(\hat Y = 1\), and others to \(\hat Y = 0\).
Hypothesis testing of each coefficient with a wald test (and the p-values) is provided in most regression outputs. The mechanics of wald test were provided in chapter 3.
- If our coefficient \(\hat\beta_j\) is statistically significant, then we can reject that there is no relationship between \(X_j\) and \(Y\), and conclude there is a relationship between \(X_j\) and \(Y\).
- If our coefficient \(\hat\beta_j\) is not statistically significant, then we cannot reject there is no relationship between \(X_j\) and \(Y\).
Likelihood Ratio tests can test the significance of multiple parameters at once. The mechanics of Likelihood Ratio test were provided in chapter 3.
To conduct a Likelihood Ratio test in R, we do the following (where model 0 is the null model, and model 1 is the alternative model):
anova(model0, model1, test = "LRT")If the test is statistically significant, then all the additional parameters in model 1 are jointly significant. If the test is not statistically significant, then all the additional parameters in model 1 are not jointly significant.
AIC and BIC are model fit metrics (information criterion statistics) that are provided in the regression output, and was discussed in chapter 3.
AIC and BIC can be used to compare how good different models are, which allows us to choose which model we want to use. Both metrics reward better explanatory power, and punish complexity (more parameters). Lower values of AIC and BIC indicate better model fits.
The BIC tends to penalise extra parameters more strongly than AIC. Generally, when comparing two models, AIC and BIC will agree.