Note about fixed effects in difference-in-differences with Python and R
Author
Ville Voutilainen
See the did repo for updated notebooks and source code.
In this note, we cover in what way it is safe to include fixed effects (FEs) inthe difference-in-differences (DiD) model using Python statsmodels/patsy and R regression formulas.
TLDR:
It seems that in order to run the DiD specification in Python or R, we need to dummy-code fixed effects by hand to be safe. The construction of dummies based on categorical fields cannot be left to statsmodels/patsy nor to R formulas.
Imports
Code
# Python dependenciesimport warningswarnings.filterwarnings('ignore')import numpy as np np.random.seed(1337)import pandas as pdimport matplotlib.pyplot as pltimport statsmodels.formula.api as smffrom statsmodels.stats.outliers_influence import variance_inflation_factorimport patsy# Local helpersfrom did_helpers import simulate_did_data, plot_panel_data# R interfaceimport rpy2%load_ext rpy2.ipython
Code
%%Rlibrary(dagitty)
Simulate data
Simulate panel data from the following data-generating process (see the notebook did_simulated_datasets.ipynb for a further description):
Our difference-in-differences design can be illustrated with the following directed acyclical graph (DAG). We can use a DiD estimator to account for the group-specific confounding \(\gamma_j\) and obtain an estimate for the ATT (average treatment effect on the treated) estimand.
DiD specifications with different fixed effects encodings
Vanilla DiD using dummy coding
When treatment_group, time_group, and their interaction are coded as dummy variables, we get the correct DiD estimate without multicolinearity issues. That is, this works fine!
DiD with time FEs using formula and categoricals for all terms
Now try coding all terms as string variables in the regression formula.
This is where we run into trouble! Patsy/statsmodels does not realize to automatically form level treatment_group[control]:time_group[T.post] from the interaction term. This leads to multicolinearity issues with time FEs (see How to Test for Multicollinearity in Python).
It is unknown to me why this happens, and under Vanilla DiD using formulas things work just fine.
The problem seems to be that neither statsmodels/patsy nor R formulas are smart enough to remove multicolinear columns automatically. That is, from the arising multicolinearity issue alone, they do not understand that t absorbs time_group (that is, neither understands that t and time_group are essentially the “same thing”). As per patsy documentation under redundancy and categorical factors: “We’re only worried here about “structural redundancies”, those which occur inevitably no matter what the particular values occur in your data set. If you enter two different factors x1 and x2, but set them to be numerically equal, then Patsy will indeed produce a design matrix that isn’t full rank. Avoiding that is your problem.”
While Python will show results for the interaction term (and the correct DiD estimate can be obtained when subtracting the second level from the first), R will not even yield an estimate for the second level! It seems to be that Python/R regression leave freedom for the researcher by design, whereas Stata would automatically guide the user in a similar case.
Python
Code
def timefe_did_with_formula_2(data): df = data["observed"].copy() df["t"] = df["t"].astype(str)# Fit regression reg_str ="Y ~ -1 + t + treatment_group + treatment_group:time_group" res = smf.ols(reg_str, data=df).fit()print("Regression: {}\n".format(reg_str))print(res.summary())return df, reg_str
# Check for the multicolinearity issueX = patsy.dmatrices(reg_str, df, return_type="dataframe")[1]print(pd.DataFrame({"variable": X.columns,"VIF": [variance_inflation_factor(X.values, i) for i inrange(X.shape[1])]}))
Call:
lm(formula = reg_str, data = df)
Residuals:
Min 1Q Median 3Q Max
-3.7139 -0.6574 0.0146 0.6663 3.8942
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
t0 1.00591 0.04407 22.82 <2e-16
t1 2.02139 0.04407 45.87 <2e-16
t10 8.93838 0.05487 162.91 <2e-16
t11 9.97461 0.05487 181.79 <2e-16
t2 3.01118 0.04407 68.33 <2e-16
t3 4.04414 0.04407 91.76 <2e-16
t4 4.98520 0.04407 113.12 <2e-16
t5 5.98744 0.04407 135.86 <2e-16
t6 4.97897 0.05487 90.75 <2e-16
t7 5.98132 0.05487 109.01 <2e-16
t8 7.01107 0.05487 127.78 <2e-16
t9 7.93908 0.05487 144.70 <2e-16
treatment_grouptreatment 0.99386 0.03318 29.95 <2e-16
treatment_groupcontrol:time_grouppost 2.05414 0.04693 43.77 <2e-16
treatment_grouptreatment:time_grouppost NA NA NA NA
t0 ***
t1 ***
t10 ***
t11 ***
t2 ***
t3 ***
t4 ***
t5 ***
t6 ***
t7 ***
t8 ***
t9 ***
treatment_grouptreatment ***
treatment_groupcontrol:time_grouppost ***
treatment_grouptreatment:time_grouppost
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.9951 on 7186 degrees of freedom
Multiple R-squared: 0.9812, Adjusted R-squared: 0.9812
F-statistic: 2.677e+04 on 14 and 7186 DF, p-value: < 2.2e-16
Further checks
Below are some alternative ways to specifying the terms as categorical variables, but the same problem persists: statsmodels/patsy does not know that unit/time FEs are related to the interaction term and are not smart enough to infer it from multicolinearity alone.
Code
def alternative_did_specs(data): df = data["observed"].copy()# Code variables as categoricals df["treatment_time_group"] = df["treatment_group"] +"_"+ df["time_group"] df["t"] = pd.Categorical(df["t"], ordered=True) df["treatment_group"] = pd.Categorical( df["treatment_group"], categories=["control", "treatment"], ordered=True ) df["treatment_time_group"] = pd.Categorical(df["treatment_time_group"], ordered=False)# Alternative 1 reg_str ="Y ~ -1 + t + treatment_group + treatment_group:time_group" res = smf.ols(reg_str, data=df).fit()print("Regression: {}\n".format(reg_str))print(res.summary())print("\n")# Alternative 2 reg_str ="Y ~ -1 + t + treatment_group + treatment_time_group" res = smf.ols(reg_str, data=df).fit()print("Regression: {}\n".format(reg_str))print(res.summary())print("\n")