Least Squares Theory

In the previous chapter, we talked about the Maximum Likelihood Estimator. However, MLE can be computationally difficult to calculate, often without analytical solutions, requiring iterative algorithms.

In this chapter, we introduce a series of “Least-Squares” estimators, that for a specific type of model (linear model), have fixed solutions derivable through minimisation. First, we introduce the classical linear model and its assumptions. Then, we discuss the most common least-square estimator: ordinary least squares (OLS). Finally, we discuss extensions to OLS (generalised least squares, instrumental variables) for when the assumptions needed for OLS are violated.


The Classical Linear Model

A class of estimators called least squares estimators is another way to estimate population parameters \(\theta\) besides MLE. However, this class of estimators can only be applied in one specific model: the classical linear model.

Let us say there are individuals \(t = 1, 2, \dots, n\) in the population. Each individual’s \(Y\) value is determined by a set of random variables \(Y_1, \dots, Y_n\). Any individual random variable \(Y_t\) is data generating process defined as \(Y_t \sim \set D(\mu_Y, \sigma^2_Y)\), where \(\set D\) represents any distributional form, \(\mu_Y\) is the mean of the random variable, and \(\sigma^2_Y\) is the variance of the random variable.

The classical linear model is a specification that the mean \(\mu_Y\) of \(Y_t\) is linearly determined by a set of explanatory variables \(\set X = \{X_1, \dots, X_p\}\):

\[ \E(Y_t|\set X_t) = \beta_0 + \beta_1 X_{1t} + \beta_2 X_{2t} + \dots + \beta_p X_{pt} \]

Where \(\beta_0, \dots, \beta_p\) are a set of population parameters (that need to be estimated) that determine how \(\mu_Y\) changes in respect to explanatory variables \(\set X\).

You will frequently see the linear model represented in another form:

\[ Y_t = \underbrace{\beta_0 + \beta_1 X_{1t} + \dots + \beta_p X_{pt}}_{\E(Y_t | \set X_t)} + \eps_t, \quad \eps_t \sim \set D(\mu_{\eps_t} = 0, \sigma^2_{\eps_t}) \]

Where the error term \(\eps_t\) represents the variance/randomness in our data generating process, and the rest of the model represents \(\mu_y\).

We know that this data generating process applies for random variables \(Y_1, Y_2, \dots, Y_n\). To represent all random variables together, we use the matrix representation of the linear model:

\[ \b y = \b{X\beta} + \b\eps \iff \begin{pmatrix} y_1 \\ y_2 \\ \vdots \\ y_n\end{pmatrix} = \begin{pmatrix} 1 & x_{11} & \dots & x_{1p} \\ 1 & x_{21} & \dots & x_{2p} \\ \vdots & \vdots & \dots & \vdots \\ 1 & x_{n1} & \dots & x_{np}\end{pmatrix} \begin{pmatrix} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_p\end{pmatrix} + \begin{pmatrix} \eps_1 \\ \eps_2 \\ \vdots \\ \eps_n \end{pmatrix} \]

The classical linear model has a set of five assumptions:


Ordinary Least Squares

OLS is an estimation process that finds the values \(\hat{\b\beta}\) by the \(\hat{\b\beta}\) values that minimise the sum of squared residuals (SSR) (also called the sum of squared errors), the difference between \(\b y\) and predicted \(\hat{\b y}\) squared, where \(\hat{\b y}\) is defined as \(\hat{\b y} = \b X \hat{\b\beta}\).

Definition 4.7 (Sum of Squared Residuals) We will define the SSR by function \(S(\hat{\b\beta})\):

\[ S(\hat{\b\beta}) = \sum\limits_{i=1}^n (Y_i - \hat Y_i)^2 \ = \ (\b y - \hat{\b y})^\top (\b y - \hat{\b y}) \]

One common question is why are we squaring the residuals (difference between \(\b y\) and \(\hat{\b y}\)). The main reason is that squaring gets rid of negative and positive residuals, which might cancel each other out. We do not care about the direction of residuals/errors, only the magnitude.

The reason we choose to square the residuals, and not to use absolute value, is because of a variety of unique properties of OLS (unbiasedness, variance, efficeincy) that we will explore throughout this chapter.

We know that predicted \(\hat{\b y} = \b X \hat{\b \beta}\). Thus, let us plug that into \(S(\hat{\b\beta})\), and simplify to get:

\[ \begin{align} S(\hat{\b\beta}) & = (\b y - \b X \hat{\b\beta})^\top (\b y - \b X \hat{\b\beta}) \\ & = \b y^\top \b y - \hat{\b\beta}^\top \b X^\top \b y - \b y^\top \b X \hat{\b\beta} + \hat{\b\beta}^\top \b{X^\top X} \hat{\b\beta} \\ & = \b y^\top \b y - 2 \hat{\b\beta}^\top \b{X^\top y} + \hat{\b\beta}^\top \b{X^\top X} \hat{\b\beta} \end{align} \]

Now, we want to maximise in respect to \(\hat{\b\beta}\), so let us take the gradient of function \(S\) in respect to \(\hat{\b\beta}\), and set it equal to 0. Then, we can solve for \(\hat{\b\beta}\) to get our parameter estimates:

\[ \begin{align} \frac{\partial S}{\partial \hat{\b\beta}} = -2 \b{X^\top y} + 2 & \b{X^\top X} \hat{\b\beta} = 0 \\ 2 \b{X^\top X}\hat{\b\beta} & = 2 \b{X^\top y} \\ \hat{\b\beta} & = (2 \b{X^\top X})^{-1}2 \b{X^\top y} \\ \hat{\b\beta} & = (2^{-1})2(\b{X^\top X})^{-1}\b{X^\top y} \\ \hat{\b\beta} & = (\b{X^\top X})^{-1}\b{X^\top y} \end{align} \]

Definition 4.8 (OLS Estimate) The OLS estimate of parameters \(\b\beta\) is:

\[ \hat{\b\beta}_{\mathrm{OLS}} = (\b{X^\top X})^{-1}\b{X^\top y} \]

For simple linear regression, our sum of squared errors (definition 4.7) is:

\[ S(\hat\beta_0, \hat\beta_1) = \sum\limits_{t=1}^n (Y_t - \hat\beta_0 - \hat\beta_1 X_t)^2 \]

Our first order conditions by taking the partial derivative in respect to \(\hat\beta_0\) and \(\hat\beta_1\) are:

\[ \begin{align} & \frac{\partial S}{\partial \hat\beta_0} = \sum\limits_{i=1}^n (Y_t -\hat\beta_0 = \hat\beta X_t) = 0 \\ & \frac{\partial S}{\partial \hat\beta_1} = \sum\limits_{i=1}^n X_t (Y_t -\hat\beta_0 = \hat\beta X_t) = 0 \\ \end{align} \]

And the final solutions for \(\hat\beta_0\) and \(\hat\beta_1\) after solving this system of equations is:

\[ \begin{align} & \hat\beta_0 = \bar Y - \hat\beta_1 \bar X \\ & \hat\beta_1 = \frac{\sum (X_t -\bar X)(Y_t -\bar Y)}{\sum (X_t - \bar X)^2} = \frac{Cov(X, Y)}{\V Y} \end{align} \]

OLS has some unique properties, that we will explore in the tabs below.


OLS Estimator Properties

We know that an estimator has two finite sample properties: unbiasedness (definition 2.4) and variance (definition 2.5), and has the asymptotic property of consistency (definition 2.6). Before we explore these properties, let us first transform our OLS estimates (definition 4.8) into a form more useful for showing these properties of OLS.

Let us start with our OLS solution (definition 4.8), and plug in our original model \(\b y = \b{X\beta} + \b\eps\) into where \(\b y\) is in the OLS solution:

\[ \begin{align} \hat{\b\beta} & = (\b{X^\top X})^{-1} \b{X^\top y} \\ & = (\b{X^\top X})^{-1} \b X^\top (\b{X\beta} + \b\eps) \\ & = \underbrace{(\b{X^\top X})^{-1}\b{X^\top X}}_{\text{inverses cancel}}\b\beta + (\b{X^\top X})^{-1} \b{X^\top \eps} \\ & = \b\beta + (\b{X^\top X})^{-1}\b{X^\top \eps} \end{align} \tag{4.2}\]

Now, we are ready to explore the properties of the ordinary least squares estimator: unbiasedness, variance, and consistency.


Gauss-Markov Theorem

You might ask, what is so special about Ordinary Least Squares, and why should we use this estimator? The answer lies in the Gauss-Markov Theorem.

Theorem 4.5 (Gauss-Markov) If all of the assumptions of linearity, i.i.d., no perfect multicollinearity, strict exogeneity, and spherical errors are all met, then the Ordinary Least Squares estimator is the best linear unbiased estimator (BLUE) - the unbiased linear estimator with the lowest variance of any other unbiased estimator.

Formally, if \(\hat{\b\beta}\) is the OLS estimator, and \(\tilde{\b\beta}\) is any other linear unbiased estimator, then

\[ \V(\hat{\b\beta}|\b X) ≤ \V(\tilde{\b\beta} | \b X) \]


Any linear estimator \(\tilde{\b\beta}\) must be in the form \(\tilde{\b\beta} = \b{Cy}\), where \(\b C\) is some linear mapping. For example, using projection matrix \(\b P\) (definition 4.9), OLS can be written as \(\hat{\b\beta} = \b{Py}\). Before we prove the Gauss-Markov theorem, we need a lemma about any unbiased linear estimator.

Lemma 4.1 For any linear estimator \(\tilde{\b\beta} = \b{Cy}\) to be unbiased, \(\b{CX} = \b I\).

Proof: Let us start off with our linear estimator \(\tilde{\b\beta} = \b{Cy}\), and plug in the true linear model \(\b y = \b{X\beta} + \b\eps\) into our linear estimator:

\[ \tilde{\b\beta} = \b C (\b{X\beta} + \b\eps) = \b{CX\beta} + \b{C\eps} \]

Now, let us find the expected value of this estimator conditional on \(\b X\). Remember that the expected values of constants (like \(\b C\), \(\b \beta\), and \(\b X\) since we are conditioning on \(\b X\)) are the constants themselves.

\[ \begin{align} \E(\tilde{\b\beta}|\b X) & = \E(\b{CX\beta} + \b{C\eps}) \\ & = \b{CX\beta} + \b C \E(\b\eps| \b X) \end{align} \]

From the strict exogeneity assumption (definition 4.4), we know \(\E(\b\eps | \b X) = 0\), so we can simplify to

\[ \E(\tilde{\b\beta}|\b X) = \b{CX\beta} \]

And using the law of iterated expectations (theorem 1.3), we can find \(\E\tilde{\b\beta}\):

\[ \E\tilde{\b\beta} = \E[\E(\tilde{\b\beta}|\b X)] = \E[\b{CX\beta}] = \b{CX\beta} \]

For unbiasedness (definition 2.4), we know \(\E\tilde{\b\beta} = \b\beta\). The only way \(\b{CX\beta}\) will equal \(\b\beta\) is if \(\b{CX} = \b I\). Thus, for any linear unbiased estimator, the lemma \(\b{CX} = \b I\) must hold.

With this lemma, now let us prove Gauss-Markov. First, let us calculate the variance of unbiased linear estimator \(\tilde{\b\beta}\):

\[ \begin{align} \V(\tilde{\b\beta} | \b X) & = \V(\b{Cy} | \b X) \\ & = \V(\b C( \b{X\beta} + \b \eps)| \b X) \\ & = \V(\b{CX\beta} + \b{C\eps} | \b X) \end{align} \]

And since we know from Lemma 4.1 that \(\b{CX = I}\), we can get

\[ \V(\tilde{\b\beta} | \b X) = \V(\b\beta + \b{C\eps} | \b X) \]

We know that \(\b\beta\) is a vector of fixed constants (the true population values). We also know \(\b C\) is some fixed constant matrix (that depends on \(\b X\), but we are conditioning on \(\b X\)). Thus, we can use theorem 1.2 to rewrite the above as

\[ \V(\tilde{\b\beta} | \b X) = \b C\V(\b\eps | \b X) \b C^\top \]

Now, according to the assumption of spherical errors (definition 4.6), we know that \(\V(\b \eps| \b X) = \sigma^2 \b I_n\). Thus, let us plug that into our equation to get

\[ \begin{align} \V(\tilde{\b\beta} | \b X) & = \b C \sigma^2 \b I_n \b C^\top \\ & = \sigma^2 \b{CC^\top} \end{align} \tag{4.3}\]

Now we have the variance of estimator \(\tilde{\b\beta}\). To prove Gauss-Markov, we need to show that the variance of \(\tilde{\b\beta}\) is greater than the variance of \(\hat{\b\beta}\). For this to be true,

\[ \V(\tilde{\b\beta}|\b X) - \V(\tilde{\b\beta}| \b X) ≥ 0 \]

We can plug in the variance of \(\tilde{\b\beta}\) from eq. 4.3, and the variance of OLS \(\hat{\b\beta}\) from theorem 4.3:

\[ \begin{align} \sigma^2 \b{CC^\top} - \sigma^2 (\b{X^\top X})^{-1} & ≥ 0 \\ \sigma^2 (\b{CC^\top} - (\b{X^\top X})^{-1}) & ≥ 0 \end{align} \]

We know from Lemma 4.1 that \(\b{CX} = \b I\), which through the properties of tranposes, also implies that \(\b{X^\top C^\top} = (\b{CX})^\top = \b I\). Multipling by \(\b I\) doesn’t change anything, so we can insert a \(\b{CX}\) and \(\b{X^\top C^\top}\) into our equation above to get

\[ \sigma^2 (\b{CC^\top} - \b{CX} (\b{X^\top X})^{-1}\b{X^\top C^\top}) ≥ 0 \]

Factoring out \(\b C\) and \(\b C^\top\), and remembering our residual maker \(\b M\) (definition 4.10),

\[ \begin{align} \sigma^2 \b C(\b I - \b X(\b{X^\top X}^{-1}\b X^\top) \b C^\top & ≥ 0 \\ \sigma^2 \b{CMC} & ≥ 0 \end{align} \]

We know \(\sigma^2\), the variance of the error term, must be positive. \(\b{CMC}\) is also a positive semi-definite matrix (behaves like a positive number). The proof is provided below.

To show \(\b {CMC}\) is positive semi-definite, the following must be true for every vector \(\b z\):

\[ \b{z^\top CMC^\top z} ≥ 0 \]

Remember that from definition 4.10 that \(\b M\) is symmetric and idempotent. This implies that \(\b M = \b{MM} = \b M^\top\). Thus, plugging this in, we get

\[ \underbrace{\b{z^\top CM}}_{\b w^\top} \underbrace{\b{M^\top C^\top z}}_{\b w} = \b{w^\top w} = \sum\limits_{i=1}^n w_i^2 ≥ 0 \]

Which is true since the square of any number cannot be negative. Thus, \(\b{CMC}\) is positive semi-definite, and behaves like a positive number.

This property means that OLS produces the best estimates for any linear model, which makes it very popular in statistics (especially considering many statistical models are linear).


OLS and Non-Spherical Errors

For the classical linear model, one of the assumptions was spherical errors (definition 4.6). This was an assumption made on the variance-covariance matrix of error term \(\eps_t\).

The spherical errors assumption can thus be violated in two ways. First, is conditional heteroscedasticity, where only homoscedasticity is violated, but no autocorrleation still holds. The variance-covariance matrix of errors will take the form:

\[ \V(\b\eps|\b X) = \b\Omega= \begin{pmatrix} \sigma^2_1 & 0 & 0 & \dots \\ 0 & \sigma^2_2 & 0 & \dots \\ 0 & 0& \sigma^2_3 & \vdots \\ \vdots & \vdots & \dots & \ddots \end{pmatrix} \]

The above is a residual plot of OLS residuals \(\hat\eps_i\) against some explanatory variable \(X\). Notice how for homoscedasticity, the variance of the error terms (how spread out they are up-down wise) is constant for any value of \(X\).

For heteroscedasticity, we can clearly see that the residual variance is smaller for some \(X\) values, and larger for other \(X\) values. If you see a pattern in your residual plot, it is likely heteroscedasticity.

The second way spherical errors can be violated is with autocorrelation, where both of the assumptions of spherical errors are violated.

What is the impact of violating spherical errors?

  1. OLS estimates remain unbiased. This is because our OLS unbiasedness proof (theorem 4.2) does not depend on the spherical errors assumption.
  2. Our derived OLS variance is incorrect. This is because our variance formula (theorem 4.3) depends on the spherical errors assumption.
  3. OLS is no longer the best linear unbiased estimator - the linear unbiased estimator with the lowest variance. This is because the Gauss-Markov Theorem (theorem 4.5) depends on the spherical errors assumption. Thus, there are other linear unbiased estimators with lower variance.

Since OLS remains unbiased, we can still use OLS as an estimator. We just have to correct our OLS variance calculations to account for the fact that spherical errors is not met. There are three different variance formulas used for different forms of non-spherical errors.


Generalised Least Squares

We mentioned that if spherical errors (definition 4.6) is violated, OLS is no longer the linear unbiased estimator with the least variance. Instead, another estimator, the Generalised Least Squares estimator, is the best linear unbiased estimator.

In the generalised least squares estimator, we assume that the variance-covariance is

\[ \V(\b\eps | \b X) = \E(\b{\eps\eps^\top}) = \sigma^2 \b\Omega \tag{4.4}\]

Where \(\sigma^2\) is an unknown scalar constant, but \(\b\Omega\) is a known matrix that is equivalent to the population variance-covariance matrix of errors. The variance is equivalent to \(\E(\b{\eps\eps^\top})\) because we assume by strict exogeneity (definition 4.4) that \(\E(\b\eps) = 0\).

Let us define a matrix \(\b\Omega^{-1/2}\), which will be the inverse of the square root of \(\b\Omega\). This means that the following should be true:

\[ \b\Omega^{-1/2} \ \b\Omega \ {\b\Omega^{-1/2}}^\top = \b I \tag{4.5}\]

We multipy \(\b\Omega^{-1/2}\) to all terms of model \(\b y = \b{X\beta} + \b\eps\) to get a transformed model

\[ \underbrace{\b\Omega^{-1/2}}_{\b y^*} \b y = \underbrace{\b\Omega^{-1/2} \b X}_{\b X^*} \b \beta + \underbrace{\b\Omega^{-1/2} \b \eps}_{\b \eps^*} \tag{4.6}\]

This transformed model meets spherical errors, which we can prove by plugging in the definition of \(\b\eps^*\) from above, and the definition of \(\E(\b{\eps\eps^\top})\) from eq. 4.4:

\[ \begin{align} \V (\b\eps^* | \b X) & = \E(\b\eps^* \b\eps^{*\top}) \\ & = \E(\b\Omega^{-1/2} \b \eps \b\eps^\top {\b\Omega^{-1/2}}^\top) \\ & = \b\Omega^{-1/2} \E(\b{\eps \eps^\top}) \b\Omega^{-1/2} \\ & = \b\Omega^{-1/2} \sigma^2 \b\Omega \b\Omega^{-1/2} \end{align} \]

And by moving scalar \(\sigma^2\) to the front, and using the property from eq. 4.5, we get:

\[ \V (\b\eps^* | \b X) = \sigma^2 \underbrace{\b\Omega^{-1/2} \b\Omega \b\Omega^{-1/2}}_{\b I} = \sigma^2 \b I \]

Thus proving this transformed model meets the spherical errors assumption (definition 4.6). Thus, we can use OLS on this transformed model, and it will be the best linear unbiased estimator. Our OLS estimator (definition 4.8) of the transformed model will be:

\[ \hat{\b\beta} = (\b X^{*\top} \b X^*)^{-1} \b X^{*\top} \b y^* \]

And if we plug in our definitions of \(\b y^*\), and \(\b X^*\) from eq. 4.6, we can get

\[ \hat{\b\beta} = \left[(\b\Omega^{-1/2} \b X)^\top (\b\Omega^{-1/2} \b X) \right]^{-1} (\b\Omega^{-1/2} \b X) (\b\Omega^{-1/2} \b y) \]

And using the properties of matrix transposes, and that \(\b\Omega^{-1/2} \b\Omega^{-1/2} = \b\Omega^{-1}\), we can get

\[ \begin{align} \hat{\b\beta} & = [\b X^\top \b\Omega^{-1/2} \b\Omega^{-1/2} \b X]^{-1} \b X^\top \b\Omega^{-1/2} \b\Omega^{-1/2} \b y \\ & = (\b X^\top \b\Omega^{-1} \b X)^{-1} \b X^\top \b\Omega^{-1} \b y \end{align} \]

Definition 4.14 (Generalised Least Squares Estimator) The GLS estimator is

\[ \hat{\b\beta} = (\b X^\top \b\Omega^{-1} \b X)^{-1} \b X^\top \b\Omega^{-1} \b y \]

Where \(\b\Omega\) is the population variance-covariance matrix of errors. The variance is

\[ \V\hat{\b\beta} = (\b X^\top \b\Omega^{-1} \b X)^{-1} \]


The obvious issue is that we do not generally know the form of \(\b\Omega\). This means the theoretical GLS estimator is often not feasible. Instead, we will use another estimator, called the Feasible Generalised Least Squares (FGLS) estimator.

The only times when GLS is feasible is when we are confident we have a specific form of autocorrelation (such as AR(1), MA(1), which we will cover in the stochastic processes chapter). This is because the covariance-variance matrix is known for these processes, so we can directly use them in GLS.


Feasible Generalised Least Squares

The issue with GLS is that we do not know the form of \(\b\Omega\). Thus, the feasible generalised least squares estimator estimates \(\hat{\b\Omega}\), before estimating the GLS estimator.

There are a few ways we can go about doings this, including the Cochrane-Orcutt Estimator, the Weighted Least Squares estimator, and the 2-stage GLS estimator:


Instrumental Variables Estimator

One of the assumptions in the classical model is exogeneity (definition 4.4). This assumption is critical in the proofs of OLS unbiasedness and asymptotic consistency. This implies that when exogeneity is violated, our estimates of \(\hat\beta\) become unrealiable.

The instrumental variables estimator is a solution to this issue. The idea is to find a third variable (or more) \(Z\), that does meet this condition of exogeneity:

\[ \E(\b{Z^\top \eps}) = 0 \tag{4.8}\]

and we will have no exogeneity if \(Z\) is not correlated with the error term. We then use these instruments \(Z\) to predict \(X\), which will get us the parts of \(X\) that are explained by \(Z\) (and thus, uncorrelated with the error term). Then, we can use that exogenous part of \(X\) to estimate the relationship with \(Y\). However, this hinges on \(Z\) meeting that moments condition.

Definition 4.15 (Assumptions of Instruments) For instrument(s) \(Z\) to meet the moment condition \(\E(\b{Z^\top \eps}) = 0\), the following facts must be true:

  1. \(Z\) must be exogenous/ignorable, i.e. \(Cov(Z, \eps) = 0\).
  2. \(Z\) must be relevant, i.e. \(Cov(Z, X) ≠ 0\).
  3. \(Z\) must meet the exclusions restriction (which is implied by exogenous). This means that \(Z\) cannot have an independent effect on \(Y\), outside of its impact on \(Y\) through \(X\).


Let us derive the IV estimator (and an alternative IV estimator called 2SLS), and explore the asymptotic properties of this estimator.


Statistical Inference

Standard errors are by definition, the square root of the variance of the estimator, which we derived for both OLS under spherical errors, and OLS under non-spherical errors.

There is an issue though: \(\sigma^2\) is the population variance of error term \(\eps_i\), and appears in the OLS variance under spherical errors. But we don’t know this population value. Thus, we will need an estimator \(s^2\) that will estimate \(\sigma^2\):

\[ s^2 = \frac{\b{\hat\eps^\top \hat\eps}}{n - p-1} = \frac{\sum_{t=1}^n \hat\eps_t^2}{n-p-1} \]

Where \(\hat{\b\eps}\) are equal to \(\b y - \hat{\b y}\), and can be calculated with residual maker \(\b M\) as shown in eq. 4.1. \(n\) is the size of our sample, and \(p\) is the number of explanatory variables we have. We will not prove it here, but this is an unbiased estimator of \(\sigma^2\)

For OLS variance in conditional heteroscedasticity (robust), we have the unknown population term \(\sigma^2_i\), which we estimate with \(s^2_i\):

\[ \sigma^2_i \approx s^2_i = \hat\eps_i^2 \]

However, our estimate \(s^2\) and \(s_i^2\) has an implication - every estimator has variance and uncertainty.

Under the central limit theorem (theorem 2.2), our standardised sampling distribution of \(\hat\beta_j\) should be normally distributed. However, because we are estimating \(\sigma^2\) with \(s^2\), this uncertainty in estimates \(s^2\) means we cannot use the normal distribution as given by the central limit theorem. Instead, we use a t-distribution to account for the uncertainty.

Once we have our correct standard errors, we can conduct hypothesis testing. There are two main hypothesis tests: the t-test for single parameters, and the f-test for multiple parameters or comparing models: