Multivariate Methods

So far, we have focused on models with one outcome variable \(Y\). However, in many statistical situations, we have multiple outcome variables.

In this chapter, we start by discussing unsupservised learning methods including principle components analysis and cluster analysis. Then, we discuss a series of latent variable models, including factor anlaysis, item response theory, and latent class models.


Overview

In statistics, we often want to measure concepts. However, some concepts are not directly observable. For example, we cannot measure how happy someone is, or the quality of life in a country. However, these concepts cause certain indicator variables to change. For example, if quality of life in a country is higher, you might expect that country to exhibit higher salaries, better work-life balance, good health care and education, and so on.

These multiple observed indicator variables \(x_1, \dots, x_p\) themselves are not that interesting for us - however, when combined together, they can be used to create interesting results, or as measures of some concept of interest to us. The main multivariate approaches include:

Method Uses
Principle Components Analysis For dimensional reduction and interpreting the main drivers of variation in observed variables.
Cluster Anlaysis For understanding hidden patterns and structures within our observed variables.
Factor Analysis For measuring continuous latent variables with continuous observed variables.
Item Response Theory For measuring continuous latent variables with binary/categorical observed variables.
Latent Class Models For measuring categorical latent variables with categorical observed variables.
Structural Equation Models To link latent variable models together through larger models of relationships.


Principle Components Analysis

Principle components analysis (PCA) is a way to combine multiple observed variables into fewer variables, which is a process called dimensional reduction. We start off with a set of observed variables \(\b x_t = (x_1, x_2, \dots, x_p)_t\) for each observation \(t\). Each observed variable \(x_i\) has a variance \(\V x_i\), and their total variance is \(\V x_1 + \dots + \V x_p\).

PCA takes these \(p\) number of original variables \(x_1, \dots, x_p\), and calculates a set of \(p\) new variables called principle components \(y_1, \dots, y_p\). Each principle component \(y_j\) is made up a linear combination of the original variables:

\[ y_j = \ a_{1j}x_1 + a_{2j}x_2 + \dots + a_{pj}x_p \]

All of the principle components together have the same variance as the original variables: \(\sum \V y_j = \sum \V x_i\). Thus, the new principle components carry the same information/variation as the original variables, just with a different distribution between each variable. Each principle component is uncorrelated with the next principle component - thus each PC conveys distinct aspects of the data.

We can interpret our PCA in 3 ways:


Cluster Analysis


Factor Analysis

Latent variables \(\xi\) (also called factors) are variables that we cannot directly measure. However, these latent variables \(\xi\) can be measured through observed outcome variables \(Y_1, \dots, Y_p\), called items. Factor analysis assume that we have a set of continuous observed items \(Y_1, Y_2, \dots, Y_p\), that are all the result of some continuous latent factor variable \(\xi\).

The latent factor \(\xi\) is assumed to be distributed \(\xi \sim \mathcal N(\kappa = 0, \ \phi = 1)\). We assume that each item \(X_i\) is normally distributed, and is related to the latent factor \(\xi\) by a linear model:

\[ Y_i = \tau_i + \lambda_i\xi + \delta_i, \quad \delta_i \sim \mathcal N(0, \theta_{ii}) \]

\(\lambda_i\) is the slope (called the factor loadings), which determines the relationship/covariance between factor \(\xi\) and a specific item \(Y_i\). \(\delta_i\) is the error term, and is called the unique factor - the part of the item not explained by the factor.

We make a few assumptions on this linear model above.

  1. Error terms \(\delta_i\) for each regression model between \(\xi\) and \(Y_1, \dots, Y_p\) is normally distributed with a mean of 0. \(\delta_i \sim \mathcal N(0, \theta_{ii})\).
  2. Error terms \(\delta_1, \dots, \delta_p\) of each model \(i\) are uncorrelated with each other. This implies that correlations between \(Y_1, \dots, Y_p\) are entirely explained by the latent factor \(\xi\).
  3. Factor \(\xi\) is uncorrelated with the error term \(\delta_i\) (exogeneity).

The estimation of this model involves maximum likelihood estimation:

We can interpret our factor analysis models in a few ways:


Item Response Theory

Latent variables \(\xi\) (also called factors) are variables that we cannot directly measure. However, these latent variables \(\xi\) can be measured through observed outcome variables \(Y_1, \dots, Y_p\), called items. Item Response Theory (IRT), also called Latent Trait Models, assume that we have at least 3 binary observed items \(Y_1, \dots, Y_p\). We have one continuous latent factor \(\xi\).

We assume the factor \(\xi\) be normally distributed \(\xi \sim \mathcal N(\kappa = 0, \phi = 1)\).We assume that the relationship between an observed item \(Y_i\) and the latent factor \(\xi\) to be of the form of a binary logistic regression:

\[ \log\left(\frac{\pr_i(\xi)}{1 - \pr_i(\xi)}\right) = \tau_j + \lambda_j \xi,\quad \pr_i(\xi) = \P(Y_i = 1|\xi) \]

The intercept parameter \(\tau_j\) is known as the difficult parameter. It is the probability of a item \(Y_i\) equalling 1, when the factor \(\xi = 0\). The coefficient \(\lambda_i\) is the factor loading, which is also known as the discrimination parameter. This explains the relationship between the item \(Y_i\) and \(\xi\).

We can take the above equation, exponenting both sides and solving for \(\pr_i(\xi)\), getting:

\[ P(Y_i = 1|\xi) = \pr_i(\xi) = \frac{e^{\tau_i + \lambda_i\xi}}{1+e^{\tau_i + \lambda_i\xi}} \]

This allows us to get fitted probabilities of how \(\xi\) affects the probability of an item being \(Y_i = 1\). These fitted probabilities are called item response curves.

IRT can also be applied to items with three or more categories, although this is quite rare. We will use an ordinal logistic regression model (with cumulative probabilities), or a multinomial logistic regression model instead.

The estimation of this model involves maximum likelihood estimation:

We can interpret our item response theory model in a few ways:


Latent Class Models

Latent variables \(\xi\) (also called factors) are variables that we cannot directly measure. However, these latent variables \(\xi\) can be measured through observed outcome variables \(Y_1, \dots, Y_p\), called items. Latent Class Models assume that we have at least 3 categorical observed items \(Y_1, \dots, Y_p\). We have one categorical latent factor \(\xi\).

We assume the items \(Y_1, \dots, Y_p\) are observed categorical items, with each item \(Y_j\) having \(K_j\) number of categories. Let factor \(\xi\) be categorical with \(C\) categories/classes, where \(C\) is chosen by the user.

Our parameter of interest is the item response probability, which is the probability of an item \(Y_i\) equals category \(k\), given \(\xi\) equals category \(c\):

\[ \pr_{ikc} = \P(Y_i = k|\xi = c) \]

The model that describes the relationship between item \(Y_i\) and factor \(\xi\) is given by:

\[ \log\left(\frac{\pr_{ikc}}{\pr_{i1c}}\right) = \tau_{ik} + \sum\limits_{d=2}^C \lambda_{ikd}D_d \]

Where \(\lambda_{ik1} = 0\), and \(D_d\) are dummy variables for latent classes/categories \(d = 2, \dots, C\) of \(\xi\). Unlike the other two models, we cannot assume \(\xi\) is normally distributed, since it is categorical. Instead, we assume \(\xi\) is categorical with probabilities \(\alpha_c = \P(\xi = c)\).

We can interpret our latent class models in a few ways:


Structural Equation Modelling