Quasi-Experimental Methods

In the chapter 5, we introduced the frameworks of causal inference, and how randomisation can establish causality.

However, in the social sciences, we cannot always run randomised experiments where we control the assignment of treatment. In fact, in most scenarios, we have to rely on observational data. In this chapter, we introduce methods to identify causal effects when randomisation is not possible.


Overview

In chapter 5, we discussed how randomised experiments can establish causality. However, in the social sciences, randomised experiments where researchers control treatment assignment are not always possible to implement. Sometimes even with randomisation, something goes wrong, and we need another way to establish causality.

As a result, a series of quasi-experimental designs have been developed in order to estimate causal effects. These designs range in terms of credibility, and can generally only be implemented in certain scenarios where the real-world aligns with the specific design. The main designs, in order of credibility, are:

Design When to Use Estimands
Non-Compliance Designs When we have a randomised experiment, but some individuals do not comply with the treatment we assigned with them. LATE, ITT
Regression Discontinuity When treatment is assigned in the real-world by some cut-off value of some variable. LATE
Examiner Instruments When treatment assignment is influenced by the quasi-random assignment of individuals to decision makers (examiners), such as judges, doctors, caseworkers, etc. LATE
Shift-Share Instruments When treatment is assigned based on some exposure (share) to some exogenous/random shock (shift). LATE
Differences-in-Differences When there is variation in time of implementation of treatments between areas/units. ATT
Selection on Observables When treatment is believed to be assigned based on a set of variables (covariates) that we can observe. ATE, ATT

We will explore each of these designs in more details below.


Non-Compliance Designs

When we assign individuals to treatment/control in randomised experiments, we often cannot guarantee that individuals will actually follow through with treatment. Let us assume an encouragement \(Z_t \in \{0, 1\}\), which is our treatment assignment. Then, we have the treatment variable \(D_t \in \{0,1\}\), which is someone who actually took the treatment or not. Given this framework, we can divide all units \(i\) into 4 categories:

  1. Compliers: People who comply with encouragement \(Z_i\). Their \(Z_i = D_i\).
  2. Always-takers: People who no matter what encouragement \(Z_i\) is, always take treatment.
  3. Never-takers: People who no matter their encouragement \(Z_i\) is, never take treatment.
  4. Defiers: People who do the opposite of encouragement \(Z_i\), so always \(D_i ≠ Z_i\).

We can visually show what will happen with all 4 types of people in a table, called the principal strata:

\(Z_i = 1\) \(Z_i = 0\)
\(D_i = 1\) Complier/Always-Taker Defier/Always-Taker
\(D_i = 0\) Defier/Never-Taker Complier/Never-Taker

The idea of the non-compliance designs is to use our encouragement/treatment assignment \(Z\) as an instrument for \(D\) - actually taking the treatment.

To estimate the LATE in a non-compliance design, we typically use the 2-stage least squares estimator, as was detailed previously.

late <- fixest::feols(Y ~ 1 | D ~ Z, data = mydata, se = "hetero")

The 2SLS estimator (and IV estimator) are biased in small sample sizes, but asymptotically consistent, so we should be more careful when dealing with small samples.

When interpreting the LATE, we must be careful. The LATE is only the causal effect of taking the treatment for compliers. However, we cannot say anything about non-compliers, and we must be careful about generalising. We generally do not know who the compliers are as well, and different \(Z\) can result in different compliers.


Regression Discontinuity

Regression discontinuity designs are used when treatment is assigned based on some cutoff. For example, perhaps students only get scholarships if they get a certain score or above, or people get something after a certain age.

We have some treatment \(D_t\). There is some forcing variable \(X_t\) that perfectly determines \(D_t\) at some cutoff point \(X_t = c\).

\[ D_t = \begin{cases} 1 & \text{if } X_t > c \\ 0 & \text{if } X_t ≤ c \end{cases} \]

The idea is that right below and above the cutoff, individuals \(t\) are very similar. Thus, we have quasi-random variation and comparable treatment-control groups at the cutoff which we can use to find the treatment effect at the cutoff point.

To estimate a regression discontinuity, we model of potential outcomes \(\pt\) and \(\pc\), and find the “discontinuity” at the cutoff.

An extension to the regression discontinuity design is the fuzzy-regression discontinuity design, when treatment is assigned based on some cutoff, but there is some non-compliance. Some people over the cutoff may on average be treated, but compliance is not perfect.


Examiner Designs

Examiner designs are used in settings where individuals \(t\) are assigned to evaluators/examiners, who have some discretion in assigning treatment.

The classic example is judges and sentencing. We want to study the effect of incarceration on an outcome. Individuals prosecuted of a crime are first randomly assigned to courtrooms, and those courtrooms decide if these individuals will be incarcerated. However, courtrooms differ in the propensity for defendants to be incracerated. Other common set-ups include asylum decisions assigned to officers, healthcare diagnoses assigned to doctors, and so on.

More generally, we have \(n\) units, and \(K\) examiners \(1, \dots, K\) who have control over treatment status \(D_t\). Each unit \(t\) is assigned to an examiner \(k\) in a known way. The examiner \(k\) unit \(t\) is assigned to is stored in a categorical variable \(Q_t = k\).

Our assumptions for the examiner design are as follows:

  1. Relevance: Each examiner \(k\) should have a different propensity to assign treatment \(D\).
  2. Exogeneity/Ignorability: Assignment to examiners should be as-if random. There should be no backdoor paths between assignment to examiner and \(Y\).
  3. Exclusions Restriction: No direct relationship between assignment to examiner and \(Y\), that is not through \(D\). Exclusions restrictions can actually be allowed, as long as they occur randomly.
  4. Monotonicity: Examiner behaviour must be ordered. This means that if examiner \(k\) has a property, they should apply it to all subjects \(t\). For example, if \(k\) is more likely in assigning \(D\), it must be more likely for every unit \(t\) (this is an issue if an examiner has racial or gender biases for example).

If these assumptions are met, we can use two estimation methods for our instrument - which is the propensity of an examiner \(k\) assigning a treatment \(D_t = 1\).


Shift-Share Instruments

Shift-share instruments, also called Bartik Instruments, are a way to explore the impacts of exposure (shares) to exogenous-random shocks (shifts). Let us say we are interested in some effect of variable \(X\) in some city/region/market \(\ell\).

\[ Y_\ell = \alpha + \beta X_\ell + \eps_\ell \]

Where we assume there is endogeneity/confounders, such that \(\E(X_\ell \eps_\ell) ≠ 0\). Now, imagine we have a second dimension of observations called types \(k = 1, \dots, K\). That means for every area/region \(\ell\) can be observed for all types \(k\), meaning we can have \(Y_{\ell k}, X_{\ell k}\). For example, you could say \(Y\) is the observed value in the city \(\ell\) for an industry/group \(k\).

Given \(Y_{\ell k}, X_{\ell k}\), we have two potential sources of variation:

  1. Shares: A unit and type varying \(Z_{\ell k}\), that varies for both type \(k\) and city/region \(\ell\).
  2. Shifts: A type-varying change variable \(G_k\), that varies based on type \(k\), but applies to all cities/regions \(\ell\) equally. This is a shock that affects all cities/regions \(\ell\).

Interacting shifts and shares gets us \(S_\ell = \sum_k Z_{\ell k} \cdot G_k\), called a shift-share, that only varies at the city/region level \(\ell\). The share measures the exposure (share) to the shock (shift). Our idea is to use \(S_\ell\) to instrument for \(X_\ell\) from the original endogenous model.

Fouka, Mazumder, and Tabellin (2022) want to study how migration of new minority groups affects the social position of pre-existing minority groups. Specifically, they want to study how the great migration of Blacks from the south to northern states affected white American’s view on previous European immigrants.

In this example, \(\ell\) is the unit for cities, and \(k\) is the unit for Blacks from state \(k\). The shares \(Z_{\ell k}\) are the number of Black migrants born in state \(k\) living in city \(\ell\) prior to the shock. The shift (shock) \(G_k\) is the number of blacks born in state \(k\), who left that state during the great migration.

The shift-share \(S_\ell\) is the exposure \(Z_{\ell k}\), the pre-shock number of blacks from state \(k\), multiplied to the \(G_k\), the number of blacks who left state \(k\). The expectation is that blacks are more likely to move to a city \(\ell\) that already has a strong Black community from their home state \(k\).

This \(S_\ell\) captures the predicted level of Black migration in a MSA, based on the timing of out-migration \(G_k\) from different states \(k\). Thus, \(S_\ell\) (as the predicted number of Black migration) should be correlated with \(X_\ell\), the amount of Black migrants in city \(\ell\).

There are two perspectives of assumptions for Shift-Share designs - making assumptions on the shares (which is more common with less types \(k\)), and making assumptions on the shifts:

To estimate, we use the two-stage-least-squares estimator, as was detailed previously.

late <- fixest::feols(Y ~ 1 | D ~ S, data = mydata, se = "hetero")

The 2SLS estimator (and IV estimator) are biased in small sample sizes, but asymptotically consistent, so we should be more careful when dealing with small samples.


Differences-in-Differences


Selection on Observables

Assuming we meet the assumptions, there are multiple estimators, including regression, matching, propensity score matching, and weighting.

We should generally be careful with selection on observables, as it is considered to be the least robust of the quasi-experimental designs. This is particularly the case in the social sciences, when there are tons and tons of unobservable confounders which are nearly impossible to control for.