Download the notebook here! Interactive online version:

Introduction

This course introduces students to basic microeconmetric methods. The objective is to learn how to make and evaluate causal claims. By the end of the course, students should to able to apply each of the methods discussed and critically evaluate research based on them.

I just want to discuss some basic features of the course. We discuss the core references, the tooling for the course, student projects, and illustrate the basics of the potential outcomes model and causal graphs.

Causal questions

What is the causal effect of …

neighborhood of residence on educational performance, deviance, and youth development
school vouchers on learning?
of charter schools on learning?
worker training on earnings?
…

What causal question brought you here?

Core reference Test

The whole course is built on the following textbook:

Winship, C., & Morgan, S. L. (2007). Counterfactuals and causal inference: Methods and principles for social research. Cambridge, England: Cambridge University Press.

This is a rather non-standard textbook in economics. However, I very much enjoy working with it as it provides a coherent conceptual framework for a host of different methods for causal analysis. It then clearly delineates the special cases that allow the application of particular methods. We will follow their lead and structure our thinking around the counterfactual approach to causal analysis and its two key ingredients potential outcome model and directed graphs.

It also is one of the few textbooks that includes extensive simulation studies to convey the economic assumptions required to apply certain estimation strategies.

It is not very technical at all, so will also need to draw on more conventional resources to fill this gap.

Wooldridge, J. M. (2001). *Econometric analysis of cross section and panel data*. Cambridge, MA: The MIT Press.
Angrist, J. D., & Pischke, J. (2009). *Mostly harmless econometrics: An empiricists companion*. Princeton, NJ: Princeton University Press.
Frölich, M., and Sperlich, S. (2019). *Impact evaluation: Treatment effects and causal analysis*. Cambridge, England: Cambridge University Press.

Focusing on the conceptual framework as much as we do in the class has its cost. We might not get to discuss all the approaches you might be particularly interested in. However, my goal is that all of you can draw on this framework later on to think about your econometric problem in a structured way. This then enables you to choose the right approach for the analysis and study it in more detail on your own.

1aaae079ba554f60a37605d012d2f833

Combining this counterfactual approach to causal analysis with sufficient domain-expertise will allow you to leave the valley of despair.

Lectures

We follow the general structure of Winship & Morgan (2007).

Counterfactuals, potential outcomes and causal graphs
Estimating causal effects by conditioning on observables
- regression, matching, …
Estimating causal effects by other means
- instrumental variables, mechanism-based estimation, regression discontinuity design, …

Tooling

We will use open-source software and some of the tools building on it extensively throughout the course.

We will briefly discuss each of these components over the next week. By then end of the term, you hopefully have a good sense on how we combine all of them to produce sound empirical research. Transparency and reproducibility are a the absolute minimum of sound data science and all then can be very achieved using the kind of tools of our class.

Compared to other classes on the topic, we will do quite some programming in class. I think I have a good reason to do so. From my own experience in learning and teaching the material, there is nothing better to understand the potential and limitations of the approaches we discuss than to implemented them in a simulation setup where we have full control of the underlying data generating process.

To cite Richard Feynman: What I cannot create, I cannot understand.

However, it is often problematic that students have a very, very heterogeneous background regarding their prior programming experience and some feel intimidated by the need to not only learn the material we discuss in class but also catch up on the programming. To mitigate this valid concern, we started several accompanying initiatives that will get you up to speed such as additional workshop, help desks, etc. Make sure to join our Q&A channels in Zulip and attend the our Computing Primer.

Problem sets

Thanks to Mila Kiseleva, Tim Mensinger, and Sebastian Gsell we now have four problem sets available on our website.

Potential outcome model
Matching
Regression-discontinuity design
Generalized Roy model

Just as the whole course, they do not only require you to further digest the material in the course but also require you to do some programming. They are available on our course website and we will discuss them in due course.

Projects

Applying methods from data science and understanding their potential and limitations is only possible when bringing them to bear on one’s one research project. So we will work on student projects during the course. More details are available here.

Data sources

Throughout the course, we will use several data sets that commonly serve as teaching examples. We collected them from several textbooks and are available in a central place in our online repository here.

Potential outcome model

The potential outcome model serves us several purposes:

help stipulate assumptions
evaluate alternative data analysis techniques
think carefully about process of causal exposure

Basic setup

There are three simple variables:

$D$, treatment
$Y$, observed outcome
$Y_1$, outcome in the treatment state
$Y_0$, outcome in the no-treatment state

Examples

economics of education
health economics
industrial organization
$...$

Exploration

We will use our first dataset to illustrate the basic problems of causal analysis. We will use the original data from the article below:

LaLonde, R. J. (1986). Evaluating the econometric evaluations of training programs with experimental data. The American Economic Review, 76(4), 604-620.

He summarizes the basic setup as follows:

The National Supported Work Demonstration (NSW) was temporary employment program desinged to help disadvantaged workers lacking basic job skills move into the labor market by giving them work experience and counseling in sheltered environment. Unlike other federally sponsored employment programs, the NSW program assigned qualified applications randomly. Those assigned to the treatment group received all the benefits of the NSW program, while those assigned to the control group were left to fend for themselves.

What is the effect of the program?

We will have a quick look at a subset of the data to illustrate the fundamental problem of evaluation, i.e. we only observe one of the potential outcomes depending on the treatment status but never both.

[1]:

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

# We collected a host of data from two other influential textbooks.
df = pd.read_csv("../../datasets/processed/dehejia_waba/nsw_lalonde.csv")
df.index.set_names("Individual", inplace=True)

[2]:

df.describe()

[2]:

	treat	age	education	black	hispanic	married	nodegree	re75	re78
count	722.000000	722.000000	722.000000	722.000000	722.000000	722.000000	722.000000	722.000000	722.000000
mean	0.411357	24.520776	10.267313	0.800554	0.105263	0.162050	0.779778	3042.896575	5454.635848
std	0.492421	6.625947	1.704774	0.399861	0.307105	0.368752	0.414683	5066.143366	6252.943422
min	0.000000	17.000000	3.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
25%	0.000000	19.000000	9.000000	1.000000	0.000000	0.000000	1.000000	0.000000	0.000000
50%	0.000000	23.000000	10.000000	1.000000	0.000000	0.000000	1.000000	936.307950	3951.889000
75%	1.000000	27.000000	11.000000	1.000000	0.000000	0.000000	1.000000	3993.207000	8772.004250
max	1.000000	55.000000	16.000000	1.000000	1.000000	1.000000	1.000000	37431.660000	60307.930000

[3]:

# It is important to check for missing values first.
for column in df.columns:
    assert not df[column].isna().any()

Note that this lecture, just as all other lectures, is available on so you can easily continue working on it and take your exploration to another direction.

There are numerous discrete variables in this dataset describing the individual’s background. How does their distribution look like?

[4]:

columns_background = [
    "treat",
    "age",
    "education",
    "black",
    "hispanic",
    "married",
    "nodegree",
]
for column in columns_background:
    sns.countplot(x=df[column], color="#1f77b4")
    plt.show()

../../_images/lectures_introduction_notebook_21_0.png

../../_images/lectures_introduction_notebook_21_1.png

../../_images/lectures_introduction_notebook_21_2.png

../../_images/lectures_introduction_notebook_21_3.png

../../_images/lectures_introduction_notebook_21_4.png

../../_images/lectures_introduction_notebook_21_5.png

../../_images/lectures_introduction_notebook_21_6.png

How about the continous earnings variable?

[5]:

columns_outcome = ["re75", "re78"]
for column in columns_outcome:

    earnings = df[column]

    # We drop all earnings at zero.
    earnings = earnings.loc[earnings > 0]

    ax = sns.histplot(earnings)
    ax.set_xlim([0, None])

    plt.show()

../../_images/lectures_introduction_notebook_23_0.png

../../_images/lectures_introduction_notebook_23_1.png

We work under the assumption that the data is generated by an experiment. Let’s make sure by checking the distribution of the background variables by treatment status.

[6]:

info = ["count", "mean", "std"]
for column in columns_background:
    print("\n\n", column.capitalize())
    print(df.groupby("treat")[column].describe()[info])



 Treat
       count  mean  std
treat
0      425.0   0.0  0.0
1      297.0   1.0  0.0


 Age
       count       mean       std
treat
0      425.0  24.447059  6.590276
1      297.0  24.626263  6.686391


 Education
       count       mean       std
treat
0      425.0  10.188235  1.618686
1      297.0  10.380471  1.817712


 Black
       count      mean       std
treat
0      425.0  0.800000  0.400471
1      297.0  0.801347  0.399660


 Hispanic
       count      mean       std
treat
0      425.0  0.112941  0.316894
1      297.0  0.094276  0.292706


 Married
       count      mean       std
treat
0      425.0  0.157647  0.364839
1      297.0  0.168350  0.374808


 Nodegree
       count      mean       std
treat
0      425.0  0.814118  0.389470
1      297.0  0.730640  0.444376

What is the data that corresponds to $(Y, Y_1, Y_0, D)$?

[7]:

# We first create True / False
is_treated = df["treat"] == 1

df["Y"] = df["re78"]
df["Y_0"] = df.loc[~is_treated, "re78"]
df["Y_1"] = df.loc[is_treated, "re78"]

df["D"] = np.nan
df.loc[~is_treated, "D"] = 0
df.loc[is_treated, "D"] = 1

df[["Y", "Y_1", "Y_0", "D"]].sample(10)

[7]:

	Y	Y_1	Y_0	D
Individual
479	6930.336	NaN	6930.336	0.0
94	3881.284	3881.284	NaN	1.0
146	3075.862	3075.862	NaN	1.0
407	20893.110	NaN	20893.110	0.0
269	12590.710	12590.710	NaN	1.0
8	2164.022	2164.022	NaN	1.0
592	0.000	NaN	0.000	0.0
260	0.000	0.000	NaN	1.0
421	3931.238	NaN	3931.238	0.0
35	0.000	0.000	NaN	1.0

Let us get a basic impression on how the distribution of earnings looks like by treatment status.

[8]:

df.groupby("D")["re78"].describe()

[8]:

	count	mean	std	min	25%	50%	75%	max
D
0.0	425.0	5090.048302	5718.088763	0.0	0.0000	3746.701	8329.823	39483.53
1.0	297.0	5976.352033	6923.796427	0.0	549.2984	4232.309	9381.295	60307.93

[9]:

ax = sns.histplot(df.loc[~is_treated, "Y"], label="untreated")
ax = sns.histplot(df.loc[is_treated, "Y"], label="treated")
ax.set_xlim(0, None)
ax.legend()

[9]:

<matplotlib.legend.Legend at 0x7fec7859b0d0>

../../_images/lectures_introduction_notebook_30_1.png

We are now ready to reproduce one of the key findings from this article. What is the difference in earnings in 1978 between those that did participate in the program and those that did not?

[10]:

stat = df.loc[is_treated, "Y"].mean() - df.loc[~is_treated, "Y"].mean()
f"{stat:.2f}"

[10]:

'886.30'

Earnings are $886.30 higher among those that participate in the treatment compared to those that do not. Can we say even more?

References

Here are some further references for the potential outcome model.

Heckman, J. J., and Vytlacil, E. J. (2007a). *Econometric evaluation of social programs, part I: Causal effects, structural models and econometric policy evaluation*. In J. J. Heckman, and E. E. Leamer (Eds.), Handbook of Econometrics (Vol. 6B, pp. 4779–4874). Amsterdam, Netherlands: Elsevier Science.
Imbens G. W., and Rubin D. B. (2015). *Causal inference for statistics, social, and biomedical sciences: An introduction*. Cambridge, England: Cambridge University Press.
Rosenbaum, P. R. (2017). *Observation and experiment: An introduction to causal inference*. Cambridge, MA: Harvard University Press.

Causal graphs

One unique feature of our core textbook is the heavy use of causal graphs to investigate and assess the validity of different estimation strategies. There are three general strategies to estimate causal effects and their applicability depends on the exact structure of the causal graph.

condition on variables, i.e. matching and regression-based estimation
exogenous variation, i.e. instrumental variables estimation
establish an exhaustive and isolated mechanism, i.e. structural estimation

Here are some examples of what to expect.

efba919ea1184db5a15e770676f46e13

f2863a475ff84d43b1a78f75eb0f61e1

a1cf1b11a00c4614b5ebd13c7d757fe0

The key message for now:

There is often more than one way to estimate a causal effect with differing demands about knowledge and observability

Pearl (2009) is the seminal reference on the use of graphs to represent general causal representations.

References

Huntington-Klein, N., Arenas, A., Beam, E., Bertoni, M., Bloem, J., Burli, P., Chen, N., Grieco, P., Ekpe, G., Pugatch, T., Saavedra, M., Stopnitzky, Y. (2021). The influence of hidden researcher decisions in applied microeconomics, Economic Impuiry, 59, 944–960.
Pearl, J. (2014). Causality. Cambridge, England: Cambridge University Press.
Pearl, J., and Mackenzie, D. (2018). The book of why: The new science of cause and effect. New York, NY: Basic Books.
Pearl J., Glymour M., and Jewell N. P. (2016). Causal inference in statistics: A primer. Chichester, UK: Wiley.
Spiegelhalter, D. (2021). The Art of Statistics: Learning from Data. New York: Hachette Book Group.

Resources

LaLonde, R. J. (1986). Evaluating the econometric evaluations of training programs with experimental data. The American Economic Review, 76(4), 604-620.