MOAR survey regression models - Biased and Inefficient

The VGAM package’s vglm() function, like the survey package’s svymle() function, allows for maximum likelihood fitting where linear predictors are added to one or more parameters of a distribution — but vglm() is a lot faster and has many distributions already built in. So, I stuck a complex-sampling interface on the the front of it, and made svyVGAM. It’s on github at the moment; I’m hoping to get it on CRAN soon.

The idea of vglm() is that we have a parametric family $f_{θ} (y)$ for the outcome in a regression, and we want to attach linear predictors to one or more components of $θ$ . For example, if $f_{θ} (y)$ is the density for $N (μ, σ^{2})$ , we would have parameters $θ = (μ, σ^{2})$ and would model $μ = x β$ to get the parametric view of standard linear model, and just model $σ^{2}$ as a single parameter. Or if $f_{θ} (y)$ is a zero-inflated Poisson model with Poisson mean $μ$ and proportion $π$ of structural zeros, we might have $θ = (l o g i t π, \log μ)$ , and have two models $λ = x β$ and $l o g i t π = z γ$ . In those two models the parameters $θ$ of the response distribution were directly modelled with the linear predictors, but we can also allow a link function. The zero-inflated Poisson could have been parametrised by $θ = (π, μ)$ and we would want an intermediate parameter $η = (l o g i t π, \log μ)$ for the linear predictor.

I will write $β$ for the regression parameters, $θ$ for the base parameters of the response distribution, and $η$ for the linear predictors. So, for example, in a classical linear model there would be two parameters $θ$ : the mean ( $θ_{1}$ ) and variance ( $θ_{2}$ ). The mean would have a set of regression parameters and the variance would have a single parameter. Collectively, these would be $β$ (maybe $β_{11} \dots β_{1 p}$ and $β_{21}$ ), and the two combinations that are plugged in as $θ$ would be called $η_{1}$ and $η_{2}$ . The big advantage of VGAM is that it does a lot of the work for the user: while the user can add new families, there are many pre-prepared ones, and there are built-in ways to constrain parameters to be equal or related in some other way. For example, a proportional odds model comes from constraining the slopes in a set of logistic regression models to be equal.

To provide survey versions of vglm(), we need to (a) get design-consistent point estimates out of vglm(), and (b) construct design-based standard errors for the fit. The first is easy: vglm() accepts frequency weights, which are equivalent to sampling weights for point estimation with independent observations.

The second can be done in two ways: resampling (which is straightforward, if potentially slow), and linearisation. Linearisation requires computing the influence functions of the parameters $h_{i} (β) = - {\hat{I}}_{w}^{- 1} U_{i} (β)$ where ${\hat{I}}_{w}$ is the weighted estimate of the population Fisher information, $U_{i} = \partial_{β} ℓ_{i} (β)$ is the loglikelihood contribution of the $i$ th observation, and $w_{i}$ is its weight. The influence functions have the property $\hat{β} - β_{0} = \sum_{i} w_{i} h_{i} (β_{0}) + o_{p} (‖ \hat{β} - β_{0} ‖)$ so that the variance of $\hat{β}$ is asymptotically the variance of the population total of the influence functions. The survey package provides a function svyrecvar() to estimate standard errors given the influence functions, or (a bit less efficiently) you can just call svytotal().

Resampling

A design object of class svyrep.design contains sets of replicate weights analogous to jackknife or bootstrap replicates. We need to call vglm() with each set of weights. It should be helpful to specify the full-sample estimates as starting values.

One complication is that sets of replicate weights will typically include some zeroes, which vglm() does not allow (eg, a bootstrap or jackknife resample will omit some observations). We set these to $10^{- 9}$ times the maximum weight, which has the desired effect that the observations are present in the fit but with (effectively) zero weight.

Linearisation

The cov.unscaled slot of a summary.vglm object contains the inverse of the estimated population Fisher information, ${\hat{I}}_{w}^{- 1}$ .

The vglm object provides $\partial_{η} ℓ_{i} (η)$ for the base parameters $θ$ , and also the model matrices that specify $\partial_{β} η$ , so we can construct $U_{i}$ . We need to take care with the constraints, which can cause a coefficient $β$ to appear in more than one linear predictor.

Suppose $β_{x}$ appears in both $η_{1}$ and $η_{2}$ , with $x$ values $x_{1}$ and $x_{2}$ . The chain rule tells us $\partial_{β_{x}} ℓ_{i} = \partial_{η_{1}} ℓ_{i} \partial_{β_{x}} η_{1} + \partial_{η_{2}} ℓ_{i} \partial_{β_{x}} η_{2} = (\partial_{η_{1}} ℓ_{i}) x_{1 i} + (\partial_{η_{2}} ℓ_{i}) x_{2 i}$ We might have $x_{1} \equiv x_{2} (= x)$ , but that just means $\partial_{β_{x}} ℓ_{i} = (\partial_{η_{1}} ℓ_{i}) x_{i} + (\partial_{η_{2}} ℓ_{i}) x_{i}$

The constraint matrix $C$ consists of 1s and 0s. If there are $p$ parameters in $M$ equations the matrix is $M \times p$ , with $C_{j k} = 1$ if parameter $k$ is in linear predictor $j$ . In the default, unconstrained setup, the constraint matrix consists of an $M \times M$ identity matrix for each parameter, pasted columnwise to give a $M \times p M$ matrix. In the proportional odds model, as another example, there are separate intercepts for each linear predictor but the other parameters all appear in every linear predictor. The first $M \times M$ block is the identity, and the rest of the matrix is a column of 1s for each predictor variable. Another way to say this is that $C_{j k} = \partial_{(β_{k} x_{k})} η_{j}$

So, if we want $\partial_{β} ℓ_{i}$ , the chain rule says $\begin{array}{rcl} \frac{\partial ℓ_{i}}{\partial β_{j}} & = & \sum_{k} \frac{\partial ℓ_{i}}{\partial η_{k}} \frac{\partial η_{k}}{\partial β_{j}} \\ = & \sum_{k} \frac{\partial ℓ_{i}}{\partial η_{k}} \frac{\partial η_{k}}{\partial (x β)_{j}} \frac{\partial (x β)_{j}}{\partial β_{j}} \\ = & \sum_{k} \frac{\partial ℓ_{i}}{\partial η_{k}} C_{k j} x_{i j} \end{array}$

There is one further complication. The model.matrix method for vglm objects returns a model matrix of dimension $M n \times p$ rather than $n \times p$ , so we need to sum over the rows for each observation, which can be identified from the row names, and then rescale. The right standardisation appears to come from defining ${\tilde{C}}_{k j} = \frac{C_{k j}}{\sum_{k} C_{k j}}$ and then $\frac{\partial ℓ_{i}}{\partial β_{j}} = \sum_{k} \frac{\partial ℓ_{i}}{\partial η_{k}} {\tilde{C}}_{k j} x_{i k j} .$

Example: zero-inflated Poisson

The Zero-Inflated Poisson model is a model for count data with excess zeros. The response distribution is a mixture of a point mass at zero and a Poisson distribution: if $Z$ is Bernoulli with probability $1 - p_{0}$ and $P$ is Poisson with mean $μ$ , then $Y = Z + (1 - Z) P$ is zero-inflated Poisson. The ZIP is a latent-class model; we can have $Y = 0$ either because $Z = 0$ (‘structural’ zeroes) or because $P = 0$ . That’s natural in some ecological examples: if you didn’t see any salmon it could be because the area is salmon-free (it’s Eden Park) or because you just randomly didn’t see any. To turn this into a regression model we typically put a logistic regression structure on $Z$ and a Poisson regression structure on $P$ .

There isn’t (as far as I know) existing software in R for design-based inference in zero-inflated Poisson models, so it’s a good example for the benefits of svyVGAM. The pscl package (Zeileis et al) fits zero-inflated models, and so does VGAM, so we can compare the model fitted with svyVGAM to both of those and to other work-arounds.

I’ll do an example with data on number of sexual partners, from NHANES 2003-2004. We will look at questions SXQ200 and SXQ100: number of male sexual partners. Combining these gives a ‘real’ zero-inflated variable: based on sexual orientation the zeroes would divide into ‘never’ and ‘not yet’.

Here’s how I created the dataset, from two NHANES files. It’s data(nhanes_sxq) in the package

library(foreign)
setwd("~/nhanes")
demo = read.xport("demo_c.xpt")
sxq = read.xport("sxq_c.xpt")
merged = merge(demo, sxq, by='SEQN')
merged$total = with(merged, ifelse(RIAGENDR==2, SXQ100+SXQ130, SXQ170+SXQ200))
merged$total[merged$SXQ020==2] = 0
merged$total[merged$total>2000] = NA
merged$age = merged$RIDAGEYR/25
merged$malepartners=with(merged, ifelse(RIAGENDR==2,SXQ100,SXQ200))
merged$malepartners[merged$malepartners>200]=NA
nhanes_sxq<-merged[, c("SDMVPSU","SDMVSTRA","WTINT2YR","RIDAGEYR","RIDRETH1","DMDEDUC","malepartners")]

Start off by loading the packages and setting up a survey design

library(svyVGAM)
library(pscl)
data(nhanes_sxq)
des = svydesign(id=~SDMVPSU,strat=~SDMVSTRA,weights=~WTINT2YR, nest=TRUE, data=nhanes_sxq)

First, we’ll fit the model just ignoring the survey design, using both pscl::zeroinfl and VGAM::vglm. These models use the same variables in a logistic regression for $Z$ and a Poisson regression for $P$ . In VGAM you would make the models different by constraining coefficients to be zero in one of the models; in pscl you would specify different models before and after the |.

unwt = zeroinfl(malepartners~RIDAGEYR+factor(RIDRETH1)+DMDEDUC|RIDAGEYR+factor(RIDRETH1)+DMDEDUC, data=nhanes_sxq)
summary(unwt)

## 
## Call:
## zeroinfl(formula = malepartners ~ RIDAGEYR + factor(RIDRETH1) + DMDEDUC | 
##     RIDAGEYR + factor(RIDRETH1) + DMDEDUC, data = nhanes_sxq)
## 
## Pearson residuals:
##     Min      1Q  Median      3Q     Max 
## -1.0204 -0.9433 -0.7871  0.1503 29.2567 
## 
## Count model coefficients (poisson with log link):
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        1.6666228  0.0506660  32.894  < 2e-16 ***
## RIDAGEYR          -0.0055102  0.0008969  -6.143 8.08e-10 ***
## factor(RIDRETH1)2 -0.0394019  0.0779480  -0.505    0.613    
## factor(RIDRETH1)3  0.6508821  0.0345734  18.826  < 2e-16 ***
## factor(RIDRETH1)4  0.6675311  0.0365963  18.240  < 2e-16 ***
## factor(RIDRETH1)5  0.5642954  0.0594928   9.485  < 2e-16 ***
## DMDEDUC            0.0094256  0.0135180   0.697    0.486    
## 
## Zero-inflation model coefficients (binomial with logit link):
##                    Estimate Std. Error z value Pr(>|z|)   
## (Intercept)        0.188125   0.187079   1.006  0.31461   
## RIDAGEYR          -0.002938   0.003629  -0.810  0.41823   
## factor(RIDRETH1)2 -0.079636   0.242311  -0.329  0.74242   
## factor(RIDRETH1)3  0.118369   0.116120   1.019  0.30803   
## factor(RIDRETH1)4  0.143300   0.127764   1.122  0.26203   
## factor(RIDRETH1)5  0.259515   0.223032   1.164  0.24460   
## DMDEDUC           -0.148881   0.053337  -2.791  0.00525 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Number of iterations in BFGS optimization: 18 
## Log-likelihood: -9518 on 14 Df

vglm(malepartners~RIDAGEYR+factor(RIDRETH1)+DMDEDUC, zipoisson(), data = nhanes_sxq, crit = "coef")

## 
## Call:
## vglm(formula = malepartners ~ RIDAGEYR + factor(RIDRETH1) + DMDEDUC, 
##     family = zipoisson(), data = nhanes_sxq, crit = "coef")
## 
## 
## Coefficients:
##       (Intercept):1       (Intercept):2          RIDAGEYR:1          RIDAGEYR:2 
##         0.188125349         1.666622759        -0.002937819        -0.005510160 
## factor(RIDRETH1)2:1 factor(RIDRETH1)2:2 factor(RIDRETH1)3:1 factor(RIDRETH1)3:2 
##        -0.079635992        -0.039401949         0.118369301         0.650882145 
## factor(RIDRETH1)4:1 factor(RIDRETH1)4:2 factor(RIDRETH1)5:1 factor(RIDRETH1)5:2 
##         0.143300364         0.667531080         0.259515415         0.564295398 
##           DMDEDUC:1           DMDEDUC:2 
##        -0.148881313         0.009425589 
## 
## Degrees of Freedom: 5050 Total; 5036 Residual
## Log-likelihood: -9517.556

A traditional work-around for regression models is to rescale the weights to sum to the sample size and then pretend they are precision weights or frequency weights.

nhanes_sxq$scaledwt<-nhanes_sxq$WTINT2YR/mean(nhanes_sxq$WTINT2YR)

wt= zeroinfl(malepartners~RIDAGEYR+factor(RIDRETH1)+DMDEDUC|RIDAGEYR+factor(RIDRETH1)+DMDEDUC, data=nhanes_sxq, weights=scaledwt)

## Warning in eval(family$initialize): non-integer #successes in a binomial glm!

summary(wt)

## 
## Call:
## zeroinfl(formula = malepartners ~ RIDAGEYR + factor(RIDRETH1) + DMDEDUC | 
##     RIDAGEYR + factor(RIDRETH1) + DMDEDUC, data = nhanes_sxq, weights = scaledwt)
## 
## Pearson residuals:
##     Min      1Q  Median      3Q     Max 
## -1.5864 -0.8418 -0.5430  0.1324 31.9106 
## 
## Count model coefficients (poisson with log link):
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        1.8340681  0.0614994  29.823  < 2e-16 ***
## RIDAGEYR          -0.0073881  0.0009059  -8.155 3.49e-16 ***
## factor(RIDRETH1)2 -0.1073312  0.0853535  -1.257   0.2086    
## factor(RIDRETH1)3  0.6551367  0.0481679  13.601  < 2e-16 ***
## factor(RIDRETH1)4  0.6358148  0.0529173  12.015  < 2e-16 ***
## factor(RIDRETH1)5  0.4774124  0.0666607   7.162 7.96e-13 ***
## DMDEDUC           -0.0237389  0.0143070  -1.659   0.0971 .  
## 
## Zero-inflation model coefficients (binomial with logit link):
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        0.660504   0.214458   3.080 0.002071 ** 
## RIDAGEYR          -0.007833   0.003673  -2.133 0.032959 *  
## factor(RIDRETH1)2 -0.116789   0.252451  -0.463 0.643636    
## factor(RIDRETH1)3  0.101971   0.151531   0.673 0.500987    
## factor(RIDRETH1)4 -0.160804   0.181429  -0.886 0.375444    
## factor(RIDRETH1)5  0.106779   0.230339   0.464 0.642954    
## DMDEDUC           -0.202397   0.057624  -3.512 0.000444 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Number of iterations in BFGS optimization: 18 
## Log-likelihood: -9766 on 14 Df

wtv= vglm(malepartners~RIDAGEYR+factor(RIDRETH1)+DMDEDUC, zipoisson(), data = nhanes_sxq, crit = "coef",weights=scaledwt)
summary(wtv)

## 
## Call:
## vglm(formula = malepartners ~ RIDAGEYR + factor(RIDRETH1) + DMDEDUC, 
##     family = zipoisson(), data = nhanes_sxq, weights = scaledwt, 
##     crit = "coef")
## 
## Pearson residuals:
##                     Min      1Q    Median         3Q    Max
## logitlink(pstr0) -1.828 -0.9335 -0.365675  0.8834927  1.852
## loglink(lambda)  -5.851 -1.2771 -0.002727 -0.0003665 65.774
## 
## Coefficients: 
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept):1        0.6605042  0.2144354   3.080 0.002069 ** 
## (Intercept):2        1.8340681  0.0614568  29.843  < 2e-16 ***
## RIDAGEYR:1          -0.0078333  0.0036726  -2.133 0.032934 *  
## RIDAGEYR:2          -0.0073881  0.0008995  -8.214  < 2e-16 ***
## factor(RIDRETH1)2:1 -0.1167894  0.2527278  -0.462 0.643999    
## factor(RIDRETH1)2:2 -0.1073312  0.0847641  -1.266 0.205430    
## factor(RIDRETH1)3:1  0.1019712  0.1515002   0.673 0.500899    
## factor(RIDRETH1)3:2  0.6551367  0.0481359  13.610  < 2e-16 ***
## factor(RIDRETH1)4:1 -0.1608040  0.1814098  -0.886 0.375395    
## factor(RIDRETH1)4:2  0.6358147  0.0529738  12.002  < 2e-16 ***
## factor(RIDRETH1)5:1  0.1067789  0.2303235   0.464 0.642931    
## factor(RIDRETH1)5:2  0.4774124  0.0663590   7.194 6.27e-13 ***
## DMDEDUC:1           -0.2023967  0.0576221  -3.512 0.000444 ***
## DMDEDUC:2           -0.0237389  0.0146964  -1.615 0.106249    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Names of linear predictors: logitlink(pstr0), loglink(lambda)
## 
## Log-likelihood: -9765.52 on 5036 degrees of freedom
## 
## Number of Fisher scoring iterations: 8 
## 
## No Hauck-Donner effect found in any of the estimates

## repwts
repdes = as.svrepdesign(des,type="Fay",fay.rho=0.2)
rep1 = withReplicates(repdes, quote( 
    coef(zeroinfl(malepartners~RIDAGEYR+factor(RIDRETH1)+DMDEDUC|RIDAGEYR+factor(RIDRETH1)+DMDEDUC, weights=.weights))
    ))
rep1

##                              theta     SE
## count_(Intercept)        1.8335175 0.1350
## count_RIDAGEYR          -0.0073709 0.0028
## count_factor(RIDRETH1)2 -0.1071380 0.2471
## count_factor(RIDRETH1)3  0.6552029 0.1858
## count_factor(RIDRETH1)4  0.6361156 0.1438
## count_factor(RIDRETH1)5  0.4769791 0.2501
## count_DMDEDUC           -0.0238310 0.0797
## zero_(Intercept)         0.6606269 0.2582
## zero_RIDAGEYR           -0.0078221 0.0039
## zero_factor(RIDRETH1)2  -0.1156275 0.2854
## zero_factor(RIDRETH1)3   0.1015741 0.0984
## zero_factor(RIDRETH1)4  -0.1620363 0.0859
## zero_factor(RIDRETH1)5   0.1065392 0.2120
## zero_DMDEDUC            -0.2025776 0.0586

repv = withReplicates(repdes, quote( 
    coef(vglm(malepartners~RIDAGEYR+factor(RIDRETH1)+DMDEDUC, zipoisson(), data = nhanes_sxq, crit = "coef",weights=.weights))
    ))
repv

##                          theta     SE
## (Intercept):1        0.6605042 0.2582
## (Intercept):2        1.8340681 0.1350
## RIDAGEYR:1          -0.0078333 0.0039
## RIDAGEYR:2          -0.0073881 0.0028
## factor(RIDRETH1)2:1 -0.1167894 0.2854
## factor(RIDRETH1)2:2 -0.1073312 0.2471
## factor(RIDRETH1)3:1  0.1019712 0.0983
## factor(RIDRETH1)3:2  0.6551367 0.1857
## factor(RIDRETH1)4:1 -0.1608040 0.0859
## factor(RIDRETH1)4:2  0.6358147 0.1438
## factor(RIDRETH1)5:1  0.1067789 0.2120
## factor(RIDRETH1)5:2  0.4774124 0.2501
## DMDEDUC:1           -0.2023967 0.0586
## DMDEDUC:2           -0.0237389 0.0797

And now with svyVGAM::svy_vglm by linearisation

## svy_vgam
library(svyVGAM)

svy_vglm(malepartners~RIDAGEYR+factor(RIDRETH1)+DMDEDUC, zipoisson(), design=des, crit = "coef")

## Stratified 1 - level Cluster Sampling design (with replacement)
## With (30) clusters.
## svydesign(id = ~SDMVPSU, strat = ~SDMVSTRA, weights = ~WTINT2YR, 
##     nest = TRUE, data = nhanes_sxq)
## 
## Call:
## vglm(formula = formula, family = family, data = surveydata, weights = .survey.prob.weights, 
##     crit = "coef")
## 
## 
## Coefficients:
##       (Intercept):1       (Intercept):2          RIDAGEYR:1          RIDAGEYR:2 
##         0.660504243         1.834068104        -0.007833317        -0.007388071 
## factor(RIDRETH1)2:1 factor(RIDRETH1)2:2 factor(RIDRETH1)3:1 factor(RIDRETH1)3:2 
##        -0.116789371        -0.107331190         0.101971159         0.655136725 
## factor(RIDRETH1)4:1 factor(RIDRETH1)4:2 factor(RIDRETH1)5:1 factor(RIDRETH1)5:2 
##        -0.160804047         0.635814748         0.106778915         0.477412443 
##           DMDEDUC:1           DMDEDUC:2 
##        -0.202396659        -0.023738881 
## 
## Degrees of Freedom: 5050 Total; 5036 Residual
## Log-likelihood: -493703966

and with replicate weights:

svy_vglm(malepartners~RIDAGEYR+factor(RIDRETH1)+DMDEDUC, zipoisson(), design=repdes, crit = "coef")

## Call: as.svrepdesign.default(des, type = "Fay", fay.rho = 0.2)
## Fay's variance method (rho= 0.2 ) with 16 replicates.
## 
## Call:
## vglm(formula = formula, family = family, data = surveydata, weights = .survey.prob.weights, 
##     crit = "coef")
## 
## 
## Coefficients:
##       (Intercept):1       (Intercept):2          RIDAGEYR:1          RIDAGEYR:2 
##         0.660504243         1.834068104        -0.007833317        -0.007388071 
## factor(RIDRETH1)2:1 factor(RIDRETH1)2:2 factor(RIDRETH1)3:1 factor(RIDRETH1)3:2 
##        -0.116789371        -0.107331190         0.101971159         0.655136725 
## factor(RIDRETH1)4:1 factor(RIDRETH1)4:2 factor(RIDRETH1)5:1 factor(RIDRETH1)5:2 
##        -0.160804047         0.635814748         0.106778915         0.477412443 
##           DMDEDUC:1           DMDEDUC:2 
##        -0.202396659        -0.023738881 
## 
## Degrees of Freedom: 5050 Total; 5036 Residual
## Log-likelihood: -9765.52