Marc Bellemare asks whether splitting your sample by an observed covariate is a reasonable approach for estimating heterogeneity in treatment effects:

To get a treatment heterogeneity, wouldn’t it be better to maintain your sample as is, but to interact your treatment (i.e., land title, college degree, etc.) with groups (i.e., small and large plots, race, etc.), going so far as to omitting the constant in order to be able to retain each group

In general, selection on observables will not cause bias in OLS estimates. So this approach is okay. You can prove this formally by showing that your treatment variable of interest is uncorrelated with the error term in the selected sample – see page 7 these slides for a sketch of that proof. However, I don’t find that proof to be very useful for generating the intuition about why this is the case, so here is a brief proof-by-stata:

clear all

set seed 12345*set up matrix of correlations between variables

matrix C = (1, .75, 0 \ .75, 1, 0 \ 0, 0, 1)

*simulate the data generating process – correlations between RHS variables

drawnorm T z u, n(1000) corr(C)*generate y using our RHS variables

*T is the variable of interest

*z is an observed variable that changes how T affects y

gen y=1+2*T+0.3*z+u if z>0reg y T z

reg y T z if z>0

reg y T z if z<0

So we get unbiased estimates of the average treatment effect and of the conditional treatment effects given z>0 and z.

You can also use this approach to see that for your point estimates, it doesn’t matter if you estimate the heterogeneous treatment effects by using a dummy variable interacted with the treatment instead. That is, it doesn’t matter provided you do a fully-saturated model – you have to interact the dummy with all your RHS variables, not just the treatment:

gen z_above_0 = z>0

reg y i.z_above_0##c.T i.z_above_0##c.z*for comparison purposes, make T*below & T*above

gen T_z_above_0 = T*z_above_0

gen T_z_below_0 = T*(1-z_above_0)reg y T_z_above_0 T_z_below_0 z_above_0 i.z_above_0##c.z

If you run the code yourself and mess with the seed value for the RNG, you can confirm that this method mechanically generates identical point estimates to the split-sample approach. However, the saturated approach assumes a common error term distribution across the whole sample, so this approach will *not* give you the same standard errors. Again, if you run it you can see they are the same.

One of the commenters on Marc’s blog pointed out that a case where this is definitely problematic is if we select our sample on a dependent variable. Suppose we have heterogeneity by unobserved characteristics u, and we try to get at this by splitting the sample using values of the outcome:

*now look at heterogeneity by unobserved variable, u

gen y2=1+2*T+0.3*z+u if u>0

reg y2 T z

sum y2, d*try splitting the sample by y

local y2_high = r(p75)

reg y2 T z if y2>`y2_high’

reg y2 T z if y2<`y2_high’

The two separate regressions now each generate biased estimates of the mean treatment effect, and the CIs also don’t include the heterogeneous treatment effects by u. In other words, catastrophe. This is also something we can prove in general (page 15 of the slides linked above) – T is not independent of u. This just reinforces the maxim that selection on X is okay, whereas selection on y is a big problem.

Pingback: Don’t control for outcome variables | Ceteris Non Paribus