Included-variable bias - Biased and Inefficient

If you have two regression models $E [Y | X] = β_{0} + β_{1} X$ and $E [Y | X, Z] = γ_{0} + γ_{1} X + γ_{2} Z$ then typically $γ_{1} \neq β_{1}$ , because they are different things¹

A common name for this phenomenon is omitted-variable bias. That’s an unfortunate name, because it implies a direction in a situation that’s completely symmetric. Yes, ${\hat{β}}_{1}$ is biased for $γ_{1}$ , but ${\hat{γ}}_{1}$ is equally biased for $β_{1}$ .

The idea that $γ_{1}$ is somehow natural and $β_{1}$ is wrong comes from the gold-standard² way of thinking about regression model choice: that there is a true model defined by having all its coefficients non-zero, and that your job is to find it. From this point of view, either $γ_{2} = 0$ , so $β_{1}$ is preferred but $β_{1} = γ_{1}$ , or $γ_{2} \neq 0$ , so $γ_{1}$ is preferred.

If you want $β_{1}$ then ${\hat{γ}}_{1}$ has included-variable bias. If you want $γ_{1}$ then ${\hat{β}}_{1}$ has omitted-variable bias. Or you can stop trying to think of the $β$ and $γ$ as being estimates of the same things and just talk about which one you actually want to estimate.

Also $γ_{2} \neq β_{1}$ , but that doesn’t tend to cause as much confusion.↩︎
ie, old and wrong↩︎