ALL Metrics
-
Views
-
Downloads
Get PDF
Get XML
Cite
Export
Track
Method Article

A Method to adjust for measurement error in multiple exposure variables measured with correlated errors in the absence of an internal validation study

[version 1; peer review: 1 approved, 1 approved with reservations]
PUBLISHED 18 Dec 2020
Author details Author details
OPEN PEER REVIEW
REVIEWER STATUS

Abstract

Difficulty in obtaining the correct measurement for an individual’s longterm exposure is a major challenge in epidemiological studies that investigate the association between exposures and health outcomes. Measurement error in an exposure biases the association between the exposure and a disease outcome. Usually, an internal validation study is required to adjust for exposure measurement error; it is challenging if such a study is not available. We propose a general method for adjusting for measurement error where multiple exposures are measured with correlated errors (a multivariate method) and illustrate the method using real data. We compare the results from the multivariate method with those obtained using a method that ignores measurement error (the naive method) and a method that ignores correlations between the errors and true exposures (the univariate method). It is found that ignoring measurement error leads to bias and underestimates the standard error. A sensitivity analysis shows that the magnitude of adjustment in the multivariate method is sensitive to the magnitude of measurement error, sign, and the correlation between the errors. We conclude that the multivariate method can be used to adjust for bias in the outcome-exposure association in a case where multiple exposures are measured with correlated errors in the absence of an internal validation study. The method is also useful in conducting a sensitivity analysis on the magnitude of measurement error and the sign of the error correlation.

Keywords

Measurement error, Internal validation study, Attenuation, Bias, Questionnaire data, Sensitivity analysis, Error correlation

Abbreviations

HIV: Human immunodeficiency virus; HBCT: Home-based HIV counseling and testing; HSRC: Human sciences research council; NCD: Non-communicable diseases; BMI: Body mass index; kg: kilogram; m2: metre squared; g: gram; MCMC: Markov Chain Monte Carlo; CI: Credible interval; JAGS: Just another gibbs sampler; BUGS: Bayesian inference using gibbs sampling; ACF: Autocorrelation function

Introduction

Difficulty in obtaining correct measurements of an individual’s long-term exposure is a major challenge in an epidemiological study that investigates the association between a continuous exposure and a health outcome. For instance, several studies estimated the correlations between self-reported intake from a questionnaire and the true long-term intake values to be less than 0.82 for fruits and about 0.72 for vegetables15, an implication that some of the variation in the diet intake measurements is due to random errors. Due to random error, the association between the dietary intakes and health outcomes may be biased. The effect of measurement error can be quantified using either: (i) the attenuation factor, which quantifies the bias in the association or (ii) the correlation coefficient between the true and the observed exposure, which quantifies the loss of statistical power to detect a significant association (i.e. validity coefficient)6.

Validation studies are used to assess the accuracy of the dietary questionnaire612. A validation study constitutes a small number of individuals from whom dietary intakes are measured repeatedly using an unbiased instrument13. There are two types of validation studies: the external and internal validation studies. An internal validation study is conducted on a subset of individuals from the main study, whereas an external validation study is carried on a group of subjects who are not part of the main study, but who are similar in characteristics to individuals in the main study. Validation studies are often expensive to conduct and, in some cases not feasible. Several methods have been proposed to handle measurement error in the absence of internal validation data1418.

Agogo et al.14 conducted a sensitivity analysis to investigate the effect of the magnitude of the correlation between errors in the covariates of interest and found that the magnitude of measurement error adjustment is sensitive to the assumed measurement error structure. Dellaportas and Stephens15 presented a Bayesian method for analysis of non-linear error-in-variable where prior knowledge of the unknown true covariate is incorporated. Huang et al.16 proposed a quantile regression-based non-linear mixed-effects joint models for longitudinal data that simultaneously accounts for a response with non-central location and for covariate with non-normality and measurement error under the Bayesian framework. Lin17 proposed a Bayesian semi-parametric accelerated failure time model to analyze censored survival data with covariate measurement error and evaluated their method using an intensive simulation study. Muff et al.18 introduced a Bayesian method to handle a mixture of classical and Berkson measurement errors in a single explanatory variable and illustrated their method to studying cardiovascular disease mortality.

The majority of these authors considered a case where one exposure is measured with error (hereafter, a univariate case). In a univariate method, the bias in the association between an outcome and the exposure is adjusted by dividing the unadjusted association estimate by the attenuation factor19. An attenuation factor is the ratio of the variance of the true exposure to the variance of the observed exposure. This method ignores correlations between the errors, which can lead to substantial bias. In this study, we suggest a general method for adjusting for measurement error where multiple exposures are measured with correlated errors in the absence of an internal validation study (hereafter, a multivariate method). We use real data to illustrate the method in handling a case where three exposures are measured with correlated errors (hereafter, the trivariate method) under a linear regression model and demonstrate the implementation of this method using R software20. Specifically, we use a subset of data from a home-based HIV counseling and testing study that was done in rural and peri-urban communities in KwaZulu-Natal Province, South Africa21. We compare the results obtained when using a method that ignores both the measurement error and correlation between the errors (hereafter, a naive method) with those obtained when using univariate and multiple exposures methods. Moreover, we conduct a sensitivity analysis to investigate how the coefficient estimates of parameters of interest are influenced by (1) a change in the level of uncertainty assumed for the limits of the validity coefficients and (2) varying the correlation between errors in the measured exposures.

The remaining sections of this paper are organized as follows. In section 2, we discuss materials and methods used in this study. We present the results of the study in section 3. Finally, we provide a discussion and conclusion in section 4.

Methods

Data and study design

In this work, we use a subset data from a home-based HIV counseling and testing (HBCT) study that was conducted in rural and peri-urban communities in KwaZulu-Natal Province, South Africa, between November 2011 and June 201221. The data were obtained from the Human Sciences Research Council (HSRC) of South Africa21. This study was conducted to provide a better understanding of the complexity, severity, and prevalence of non-communicable disease (NCDs) in a community known to have one of the highest rates of HIV incidence and prevalence in the world21.

Home-based HIV counseling and testing is a cross-sectional, single-site study in South Africa that aims to increase engagement in HIV care by integrating NCDs screening with community-based HIV testing22. A random sampling approach was used, where 587 participants over the age of 18 were selected from 50,000 people living in the Mpumuza suburb21. Anthropometric and biological measures were collected in the survey with the purpose of establishing the prevalence of a range of NCDs and associated risk factors. Eligible individuals participated in a face-to-face interview, physical, psychological and clinical examinations. Persons younger than 18 years living in Mpumuza and all household members not previously enrolled, and members unable to give written consent were excluded from the study. Mobile phones were used for data collection to increase efficiency in data capture and analysis21.

In our study, we used a subset data consisting of 76 individuals who self-reported the number of cigarettes smoked, fruit and vegetable consumption. We use the dataset to illustrate the multivariate method in modeling the amount of association between body mass index (BMI) and three exposures (smoking, fruit, and vegetable intakes). BMI was measured in kg/m2, while smoking was measured as the average number of cigarettes smoked per day. Initially, fruit and vegetable intakes were measured in terms of the number of servings consumed per day. It is often assumed that a standard portion of fruit/vegetable weighs about 80g5. Therefore, for this study, we converted the number of servings to grams per day (g/day) by multiplying the reported number of servings by 80g. The subset data has the following three properties that make it suitable for use in this work: (1) measurement error in the recorded number of cigarettes smoked due to possible misreporting, (2) measurement error in fruits and vegetable consumption due to recall bias, and conversion of the number of servings of fruits and vegetables into grams, and (3) the measurement error in the three exposures is often correlated, for instance, smokers are likely to over-report fruit and vegetable intakes due to their beneficial effects, and to under-report the number of cigarettes they smoke due to the associated harmful effects. Epidemiologically, BMI is used as a risk factor of a health outcome. However, in this study, we model BMI as an outcome as in other several studies, for instance,2326. The subset data is only used to illustrate the method and not to draw inference.

Ethical statement

Ethics approval was granted by both HSRC Research Ethics Committee (REC: 1/26/05/11) and the University of Washington Institutional Review Board (48733). Informed written consent was obtained from each participant in the study. Participants were provided with written information on the study (including the study’s background and objectives) and their rights regarding participation and withdrawal at any time.

A measurement error model for the data

An interest in epidemiological study could be to investigate the association between BMI and three exposures namely: fruit, vegetable and smoking using the multiple linear regression

Y=β0+βX1X1+βX2X2+βX3X3+ϵ,(1)

where Y denotes the BMI, β0 is the intercept, βX1, βX2 and βX3 are the coefficient parameters for the true long-term fruit (X1), vegetable (X2) and cigarette (X3) intake respectively and ϵ is the random error term. In this study, we use vegetable intake and cigarette smoking as confounders and assume that the main interest is in estimating βX1. In practice, the true intakes are unobservable and, therefore, the intakes recorded in self-reported questionnaires are used. Let W1, W2 and W3 denote the measured versions of X1, X2 and X3, respectively. The use of Wp’s in place of Xp’s, (p = 1, 2, 3), in Equation (1) yields biased estimates β^W1, β^W2 and β^W3 of βX1, βX2 and βX3 respectively. Let β^W=(β^W1,β^W2,β^W3)T.

We assumed that the observed exposures are related to the true exposures with additive measurement error as

Wi=α0i+α1iXi+ϵWi,i=1,2,3(2)

where ϵW = (ϵW1, ϵW2, ϵW3), ϵW ∼ N(0, ΣϵW); W = (W1, W2, W3); α0 = (α01, α02, α03), α1 = (α11, α12, α13); with the terms in α0 and α1 quantifying the constant bias and the proportional scaling bias respectively; ϵW is a random error term, ϵWi is assumed to be independent of the true exposure Xi and the systematic bias components, α0i and α1i.

Bias adjustment methods

A univariate method. In a univariate case, the bias in the association between an outcome and an exposure is adjusted by dividing the unadjusted association estimate by the attenuation factor19. Attenuation factor (λi) is defined as λi = var(Xi)/var(Wi), i.e., the ratio of the variance of the true exposure to the variance of the observed exposure, also referred to as reliability ratio. This method ignores correlations between the errors and also the correlation between the true exposures.

Multivariate method. We propose and describe a general approach for handling p-exposures (p≥2) measured with correlated errors. For simplicity and without loss of generality, we assume that Wi is measured without systematic bias (i.e., α0i = 0, α1i = 1 in Equation 2). For multiple exposures measured with correlated errors, the adjusted association estimates can be obtained by pre-multiplying the unadjusted association estimates by the inverse of the transpose of attenuation-contamination matrix as

β^X=(Λ^pT)1β^W,(3)

where β^X and β^W denotes vectors of true and biased coefficients for the p-exposures respectively and Λp denotes a p × p attenuation-contamination matrix19,27. The off-diagonal elements in Λ are known as contamination factors while the diagonal elements are called attenuation factors14. Noteworthy, the attenuation factor quantifies the bias in the association between an outcome and an exposure. In contrast, the contamination factor quantifies the effect of measurement error in one exposure variable on the other exposure variable’s estimate. β^W in Equation (3) can be obtained from the observed questionnaire data.

In the multiple exposures case, the estimate of attenuation-contamination matrix Λ^p is defined as

Λ^p=[σ^X12σ^X2X1σ^XpX1σ^X1X2σ^X22σ^XpX2σ^X1Xpσ^X2Xpσ^Xp2]Σ^X*[σ^W12σ^W2W1σ^WpW1σ^W1W2σ^W22σ^WpW2σ^W1Wpσ^W2Wpσ^Wp2]Σ^W11,(4)

where Σ^X* is the estimate of covariance matrix of the true exposures, Σ^W*−1 is the inverse of the estimate of covariance matrix of the measured exposures, σ^Xi2 is the variance estimate of Xi (i = 1, 2, ..., p) ; σ^XiXj (j = 1, 2, ..., p; ij) denotes the covariance estimate between the true exposures; σWi2 is the variance estimate of Wi; σ^WiWj (ij) is the covariance estimate between the observed exposures.

The elements of the variance-covariance matrix of the observed exposures, ΣW, are estimated from the observed data. The variances of the true exposures, σXi2 ’s, can be estimated using validity coefficients for the questionnaire. According to Kipnis et al.6, the validity coefficient is given by:

ρWiXi=cov(Wi,Xi)var(Wi)var(Xi),i=1,2,,p.=σXiσWi,(5)

where Wi is assumed to be the measured with error term only and ϵWi is assumed to be independent of Xi. From Equation (5), we estimate the variance of the true exposures as

σ^Xi2=(ρ^WiXiσ^Wi)2,(6)

by incorporating external validation information on ρWiXi. To obtain covariances between the true exposures, one of the following two approaches is used: (i) if external information about the correlation between true exposures (i.e. ρ^XiXj ) is available, we obtain covariances between true exposures as follows:

σ^XiXj=ρ^XiXjσ^Xiσ^Xj,ij,(7)

where σ^Xi are obtained as shown in Equation (6); (ii) if we can obtain prior information about the correlation between the errors in the observed exposures, ρ^ϵWiϵWj, we can solve for σ^XiXj by decomposing the covariance of observed exposures into unknown covariance between true exposures and unknown covariance between errors as follows:

σ^WiWj=σ^XiXj+σ^ϵWiϵWj+σ^XiϵWj0+σXjϵWi0=σ^XiXj+ρ^ϵWiϵWjσ^ϵWiσ^ϵWj,(8)

where Xi and ϵWj, Xj and ϵWi are assumed to be uncorrelated.

From Equation (2) and Equation (6), the estimate of the error variance σ^ϵWi2 is

σ^ϵWi2=σ^Wi2σ^Wi2ρ^WiXi2σ^Xi2,=σ^Wi2(1ρ^WiXi2),(9)

See Appendix B of the extended data28 for the proof.

From Equation (8)Equation (9), the covariances between the true exposures are given by

σ^XiXj=σ^WiWjρ^ϵWiϵWjσ^Wiσ^Wj(1ρ^WiXi2)(1ρ^WjXj2),(ij),(10)

Using the observed data and external information, we can determine all the terms required to estimate the attenuation-contamination matrix, Λ, as shown in Equation (4) and adjust for the bias in the association between the exposures measured with error and the outcome using Equation (3).

Illustration of the multivariate method using the study data

We illustrate a method that accounts for uncertainty in the validity measures attributable to heterogeneity in the study populations and in parameter estimation. The proposed Bayesian method applies Markov Chain Monte Carlo (MCMC) estimation approach to combine observed self-reported data and external validation data in adjusting for measurement error in three exposures measured with correlated errors. MCMC is a class of algorithms that samples from the posterior distributions by traversing the parameter space29. The posterior distribution is obtained by updating the prior distribution with observed data. The steps for implementing the trivariate method are described below.

We first obtained external information on validity coefficients and generated validity coefficients for use by interpreting the lower and upper limits obtained from the literature as the 95% credible intervals (CIs) of the distribution of possible values respectively. Due to the skewed distribution of validity coefficients, Fisher’s transformation was used to generate the validity coefficients as explained in the next section.

Second, for the observed exposures, we estimated the posterior distribution of the covariance matrix (ΣW). The exposures were assumed to follow a multivariate normal distribution with mean and covariance, i.e., W ∼ N3(µW, ΣW). We assumed a weakly informative multivariate normal prior for µW as µW prior ∼ N3(0,106 I3), where I3 is a 3 × 3 identity matrix. In a multivariate normal distribution, ΣW must satisfy two conditions: (1) be positive definite (i.e. WTΣWW > 0, for all W) and (2) be a symmetric matrix. The semi-conjugate prior distribution for ΣW, which has these two properties, is the inverse-Wishart distribution29. To minimize the influence of the prior information on the estimate of ΣW, we considered weakly informative inverse-wishart prior as ΣW prior ∼ IW(I3, v), where v = 3 is the degrees of freedom.

Third, using the validity coefficients generated from the external data and the posterior distribution of covariance matrix for observed exposures, we estimated the variance of true intakes, σ^Xi2 (i = 1, 2, 3), using the relationship given in Equation (6) so that

σ^X12=(ρ^W1X1σ^W1)2σ^X22=(ρ^W2X2σ^W2)2σ^X32=(ρ^W3X3σ^W3)2.(11)

The covariances between true intakes (σ^XiXj ; j = 1, 2, 3) were estimated as,

σ^X1X2=σ^W1W2ρ^ϵW1ϵW2σ^W1σ^W2(1ρ^W1X12)(1ρ^W2X22)σ^X1X3=σ^W1W3ρ^ϵW1ϵW3σ^W1σ^W3(1ρ^W1X12)(1ρ^W3X32)σ^X2X3=σ^W2W3ρ^ϵW2ϵW3σ^W2σ^W3(1ρ^W2X22)(1ρ^W3X32).(12)

by incorporating external validation information on correlation between the errors (ρϵWi ϵWj). We generated the correlation between errors from a plausible range guided by correlation in the observed data and prior expert information on the most likely sign of the correlation between the exposures, as described in the next section.

Having obtained the covariance matrices of the true and observed exposures, we estimated the attenuation-contamination matrix (Λ3) from their joint distribution as

Λ^3=[σ^X12σ^X1X2σ^X1X3σ^X1X2σ^X22σ^X2X3σ^X1X3σ^X2X3σ^X32]Σ^X[σ^W12σ^W1W2σ^W1W3σ^W1W2σ^W22σ^W2W3σ^W1W3σ^W2W3σ^W32]Σ^W11,(13)

where Σ^X is the estimate of covariance matrix of the three true exposures, Σ^W1 is the inverse of the estimate of covariance matrix of the three measured-with-error exposures, σ^Xi2 is the variance estimate of Xi (i = 1, 2, 3); σ^XiXj (ij) denotes the covariance estimate between the true exposures; σ^Wi2 is the variance estimate of Wi; σ^WiWj (ij) is the covariance estimate between the observed exposures.

Lastly, we fitted a Bayesian multiple linear regression model (hereafter, naive method) to obtain the posterior distributions of the unadjusted coefficient estimates β^W=(β^W1,β^W2,β^W3)T. In the naive model, we assumed weakly informative normal independent priors by choosing a very small precision (large variance) for the unadjusted coefficient estimates as βWi prior ∼ N(0, 106). The adjusted coefficient estimates β^X were then obtained from the joint posterior distribution of Λ^3 and β^W as

β^X=(Λ^3T)1β^W.(14)

Software implementation of the trivariate method

We implemented the trivariate method in R version 3.6.3 using rjags (version 4-10), coda (version 0.19-3), MCMCpack (version 1.4-9), and mvtnorm (version1.1-1) packages. To facilitate Bayesian estimation of the covariance matrix of the observed exposures (ΣW), rjags package was used to provide an interface from R to the JAGS library30. JAGS is a gibbs sampler that uses MCMC to draw dependent samples from the posterior distribution of the parameters31. The Bayesian estimation of ΣW proceeded in the following steps: (1) defining a model for ΣW under Bayesian inference using gibbs sampling (BUGS) algorithm in a stand alone file, (2) reading the model file using the jags.model function, (3) updating the model using the update method for jags objects and (4) extracting the posterior samples of the model using the coda.samples function from the coda package.

MCMCregress function from the MCMCpack package was used to generate a posterior density sample from the naive linear regression model32. MCMC convergence diagnostics of all the model parameters was done using trace plots and autocorrelation (ACF) plots from the coda package33. See extended data: Appendix C28 for convergence diagnostics results. For each model, the burn-in iterations were set to 2,000 and 10,000 MCMC iterations were run after the burn-in iterations. Every first sample value was kept in the MCMC simulations by using a thinning interval of 1. When compiling a JAGS model, an initial sampling step may be needed during which the samplers learn their behaviour to maximize their performance34. Therefore, the number of iterations for adaptation in the the jags model was set to 500. The results were presented in terms of density plots, posterior mean and median. We compared the results obtained under naive, univariate, and trivariate methods. The R code used for analysis is presented in the extended data28.

External information on the validity coefficient and error correlations for the study data

External information on the validity coefficient and error correlations for fruit, vegetable, and cigarette information was obtained from the literature. According to Kaaks et al.1, the validity coefficient of self-reported fruit intake ranges from 0.33 to 0.79, while that of vegetable intake ranged from 0.30 to 0.60. A meta-analysis study on the validity of questionnaires assessing fruit and vegetable consumption by Collese et al.2 reported validity coefficients of 0.26 for vegetables and 0.49 for fruits. Other similar validation studies reported validity coefficients in the aforementioned ranges for fruits and vegetables3,4,35. Therefore, based on these information we considered a range of 0.3 to 0.8 for fruits and a range of 0.25 and 0.7 for vegetables.

In the Scottish Heart Health Study of 2,849 men and 2,900 women36, the correlation between the self-reported number of cigarettes and biochemical measures was reported between 0.67 and 0.72. In a study on the validation of self-reported smoking by analysis of hair for nicotine and cotinine37, the validity coefficient between the number of cigarettes smoked per day and nicotine/cotinine levels in hair and plasma was found to be between 0.48 and 0.63, while the correlation between the average number of cigarettes smoked and carboxyhemoglobin was 0.70. In a follow-up study to examine the relationships among self-reported cigarette consumption, exhaled carbon monoxide, and urinary cotinine/creatinine ratio in pregnant women38, a validity coefficient in the range of 0.61 to 0.70 was reported. A study by Stram et al.39 found the correlation between the self-reported number of cigarettes smoked and the true lung dose to be between 0.40 and 0.70, and this range was consistent with the findings from the previously discussed related validation studies. Based on this information, we considered a validity coefficient range of 0.40 and 0.70.

We generated the correlation between errors from plausible ranges that were determined based on the correlation in the observed data and the most probable sign of the correlation among fruits, vegetables, and cigarettes as explained below:

a. Since the correlation coefficient between fruit and vegetable intake in the observed data was positive, we also assumed the error correlation between fruit and vegetables to be mostly positive;

b. An investigation on the correlation coefficient between cigarette smoking and fruits/vegetable intake in the observed data showed a negative correlation coefficient. Based on this and the fact that persons who tend to overstate fruit and vegetable consumption are likely to understate the number of cigarettes smoked, we assumed the error correlation to be mostly negative.

We obtained the upper limits of error correlations by assuming that the error covariance equals the covariance in the observed data and set the lower limit of the error correlation to zero, based on the assumption that the covariance in the observed data equals the covariance between the true intakes14.

Estimating the distribution of ρWiXi

Using the range of plausible values obtained from external validation information, we generated the validity coefficients using the Fisher-Z transformation method by assuming that the reported lower and upper limits are 0.05 and 0.95 quantiles of the uncertainty distribution, respectively. Fisher Z-transformation is a commonly used method to transform the sampling distribution of correlation coefficients to become approximately normally distributed40,41. The procedure is as outlined below:

  • (i) Using the Fisher Z-transformation formula

    FZi=0.5[ln(1+ρWiXi)ln(1ρWiXi)],(15)

    transform the lower (rl) and upper (ru) limits of the validity coefficient ρWiXi to get the corresponding Fisher-Z transformed values FZl and FZu respectively.

  • (ii) Compute the mean µZi and the standard deviation σZi of FZi as µZi = 0.5(FZuFZl) and σZi=0.5(FZuFZ1)Zα/2 where Zα/2 is the (1α2)% quantile of a standard normal random variable.

  • (iii) Generate FZi ’s as FZi~N(μZi,σZi2)

  • (iv) Using the inverse of Fisher Z-transformation, back-transform the generated FZi ’s to validity coefficient as

    ρWiXi=exp(2FZi)1exp(2FZi)+1.(16)

Sensitivity analysis

We investigated how varying the level of uncertainty assumed for the limits of the validity coefficients reported from literature affected the estimates for fruit, vegetable, and the average number of cigarettes smoked. We also investigated how the estimates varied with the magnitude of the correlation between errors in fruit and vegetable intake, fruit and cigarette smoking, and vegetable and cigarette smoking. This helps determine the estimates’ sensitivity to various magnitudes of CI and the correlation between errors when using the multivariate method.

Results

Table 1 presents regression coefficients estimates for fruit intake (g/day), vegetable intake (g/day), and the average amount of smoked cigarettes a day obtained using the naive method and the two bias adjustment methods (i.e., univariate and trivariate methods). The regression coefficient estimate adjusted for bias using either the univariate or trivariate method was greater in absolute value than that obtained using the naive method. Specifically, for fruit intake and the average number of cigarettes smoked, the bias-adjusted coefficient estimates were three times as large as the naive coefficient estimates. For vegetable intake, the increase in the strength of the association was about four times as compared to the naive regression coefficient estimates.

Table 1. Comparison of posterior Mean (Standard Deviation) and posterior Median for the estimates of fruit (g/day), vegetable(g/day) and average number of cigarettes smoked per day unadjusted for measurement error (naive estimates) and adjusted for measurement error using univariate and trivariate methods.

MethodEstimate for fruit
intake
Estimate for
vegetable intake
Mean (SD)MedianMean (SD)Median
Naive 0.009 (0.012) 0.009 0.008 (0.014) 0.008
Univariate 0.026 (0.036) 0.027 0.031 (0.051) 0.031
Trivariate 0.026 (0.036) 0.026 0.033 (0.051) 0.033
Method Estimate for smoking
Mean (SD)Median
Naive -0.253 (0.640) -0.247
Univariate -0.740 (1.874) -0.721
Trivariate -0.714 (1.875) -0.695

For both fruit intake and the average number of cigarettes smoked, the univariate method gave slightly greater estimates while the bias-adjusted values for vegetable intake were slightly lower in the univariate method. The variability of the regression coefficient estimate of the number of cigarettes smoked was higher than that for both fruits and vegetable intake. Again, the variability in either the univariate or trivariate method was higher than in the naive method due to uncertainty involved in adjusting for measurement error.

Figure 1Figure 3 show the kernel densities representing the distributions of adjusted for measurement error (solid curves) and naive (dotted curves) estimates for fruits intake, vegetable intake, and the number of cigarettes smoked, respectively. The solid vertical lines on the density plots depict the posterior mean of the adjusted regression coefficients, while the vertical dotted lines show the posterior mean of the naive regression coefficient estimates. A careful investigation of the posterior means as represented by the vertical lines on the kernel densities reveals that the adjusted for bias regression coefficient estimates are generally higher (in absolute value) than their corresponding naive estimates.

850af098-24cb-4658-ab9b-697d717d8b19_figure1.gif

Figure 1. Kernel densities for the distribution of adjusted for measurement error and unadjusted estimates for fruit intake.

The solid vertical lines show the posterior means of coefficient estimates adjusted for bias; the dotted vertical lines indicate the posterior means of unadjusted coefficient estimates.

850af098-24cb-4658-ab9b-697d717d8b19_figure2.gif

Figure 2. Kernel densities for the distribution of adjusted for measurement error and unadjusted estimates for vegetable intake.

850af098-24cb-4658-ab9b-697d717d8b19_figure3.gif

Figure 3. Kernel densities for the distribution of adjusted for measurement error and unadjusted estimates for cigarette smoking.

With the naive method, the variance of the regression coefficient for vegetable intake is more underestimated than for fruit intake, as depicted by the smaller length between the tails of the density plots. Of the three exposures considered in this study, the regression coefficient variance for the average number of cigarettes smoked is the most underestimated (see Table 1 and Figure 1Figure 3). In general, a comparison of the regression coefficients’ variance in the naive and the trivariate method shows that the naive method underestimates the variance of regression coefficients.

Presented in Table 2 are the mean (standard deviation) and the median for the estimates of fruit, vegetable, and the average number of cigarettes smoked adjusted for measurement error using the trivariate method in exploring the effects of the magnitude of uncertainty in the reported validity coefficients. From the results, the CI assumed in the distribution of the validity coefficient does not affect the mean and the median estimates of fruit, vegetable, and smoking. With the trivariate method, the results further show that the estimates’ uncertainty is slightly affected by the level of uncertainty assumed for the validity coefficients. Figure 4 to Figure 6 presents the mean coefficient estimates of fruit, vegetable and the average number of cigarettes smoked adjusted for measurement error using the trivariate method in the sensitivity analysis by varying the magnitude of error correlation between measurements for the exposures (see Tables D1 to D3 in the extended data for more details28).

Table 2. The Mean (Standard Deviation) and the Median for the estimates of fruit (g/day), vegetable(g/day) and average number of cigarettes smoked per day adjusted for measurement error using the trivariate method in the sensitivity analysis by equating the limits of literature reported validity coefficients to different CIs.

CI (%) Estimate for fruit
intake
Estimate for vegetable
intake
Mean (SD) Median Mean (SD) Median
850.027 (0.038) 0.027 0.032 (0.051) 0.032
90 0.026 (0.037) 0.027 0.032 (0.051) 0.032
95 0.026 (0.036) 0.026 0.033 (0.051) 0.033
99 0.025 (0.035) 0.025 0.033 (0.052) 0.033
CI (%)Estimate for smoking
Mean (SD)Median
85 -0.695 (1.839) -0.676
90 -0.704 (1.856) -0.685
95 -0.714 (1.875) -0.695
99 -0.727 (1.899) -0.708

The graphs show that varying the magnitude of the correlation between errors in any two exposures affects the estimates for the three exposures. For instance, from Figure 4, increasing the magnitude of the positive correlation between errors in fruit and vegetable intakes increase the mean estimates for both fruit and vegetable intake while it causes a decrease (in absolute value) in the estimate for the average number of cigarettes smoked; decreasing the negative correlation between errors in the measurements for fruit and cigarette smoking decreases (in absolute value) the mean estimates for both fruit and the average number of cigarettes smoked while it leads to an increase in the estimate for vegetable intake (Figure 5). Similarly, a decrease in the magnitude of the negative correlation between errors in vegetable and number of cigarettes smoked causes a decrease (in absolute value) in the estimates for both vegetables and the average number of cigarettes smoked and an increase in the estimates for fruit intake (Figure 6).

850af098-24cb-4658-ab9b-697d717d8b19_figure4.gif

Figure 4. The mean estimates for fruit (g/day), vegetable(g/day) and average number of cigarettes smoked per day adjusted for measurement error using the trivariate method in the sensitivity analysis by varying the magnitude of error correlation between measurements for fruit and vegetable.

850af098-24cb-4658-ab9b-697d717d8b19_figure5.gif

Figure 5. The mean estimates for fruit (g/day), vegetable(g/day) and average number of cigarettes smoked per day adjusted for measurement error using the trivariate method in the sensitivity analysis by varying the magnitude of error correlation between measurements for fruit and average number of cigarettes smoked.

850af098-24cb-4658-ab9b-697d717d8b19_figure6.gif

Figure 6. The mean estimates for fruit (g/day), vegetable(g/day) and average number of cigarettes smoked per day adjusted for measurement error using the trivariate method in the sensitivity analysis by varying the magnitude of error correlation between measurements for vegetable and average number of cigarettes smoked.

Discussion and conclusion

In this study, we proposed and illustrated a method that adjusts for measurement error in multiple exposures measured with correlated errors in the absence of internal validation data. The method combines external validation data from the literature with the observed self-reported data to adjust for bias in the association between the exposures and the outcome and conduct a sensitivity analysis on the measurement error and correlation between the errors. The advantages of the multivariate method presented in this work includes: (1) the method can be used to adjust for bias in the outcome-exposure association caused by measurement error reported in multiple exposures measured with correlated errors, (2) the method is useful in the absence of the costly internal validation data, provided that external information on the correlation between the observed and the true data or the error correlations of the observed data are plausible within the study context, (3) it can be used in the sensitivity analysis on the effect of uncertainty of the reported validity coefficients, (4) can be used for sensitivity analysis on the magnitude and the direction of correlated errors, (5) the method can adjust for confounding effect in the outcome regression model and (6) This method can be easily implemented on the readily available and free software R as shown in the extended data28. Often, fruit and vegetable intakes are considered as one food group. Our study is relevant because fruit intake and vegetable intake are separately assessed as independent food groups and adjusted for correlated measurement errors.

In the HBCT study example used for illustration, the estimates for fruit intake, vegetable intake, and the average number of cigarettes smoked adjusted for bias using the trivariate method were almost similar to the estimates adjusted for bias using the univariate method. The slight differences between the bias-adjusted coefficient estimates in the univariate and trivariate methods could be attributed to the weak correlations between errors assumed in this study. Sensitivity analysis on the magnitude of error correlation showed that the estimates obtained using the two methods would be different when stronger error correlations are assumed. Further, from the sensitivity analysis, we found that in a case where multiple exposures are measured with correlated errors, an increase in the magnitude of error correlation between two exposures can increase their estimates and decrease the estimate of the other exposure. From the sensitivity analysis of the level of uncertainty using CI assumed for the validity coefficients, we found that the assumed CI minimally influenced the exposures’ estimates. However, the CIs for the validity coefficients should be reasonably chosen as studies have shown that uncertainty in the estimates may be affected by the level of uncertainty assigned to the validity coefficients14. From our results, we also noted that the presence of measurement error in multiple exposures can bias the association in either direction.

This study has a few limitations: (1) for simplicity, we assumed that the exposures are measured without systematic bias, i.e., only with random errors. However, in practice, the exposures can be measured with systematic error. In such a case, the systematic error components can be incorporated in the measurement error model and also in estimating the attenuation-contamination matrix; (2) although we can have a multiplicative measurement error structure42, our study assumed an additive measurement error structure. Exposures measured with multiplicative error can be handled using our method by first converting the multiplicative structure to an additive structure through a suitable transformation that linearizes the error structure and (3) our study focused on a subset of current daily smokers, which is not a representative of the HBCT cohort and, therefore, the results are not generalizable.

From the findings of this study, we conclude that the multivariate method can be used to adjust for bias in the outcome-exposure association in a case where two or more exposures are measured with correlated errors. This is possible even in the absence of internal validation data provided that there is prior information about the validity of the data collection instruments and the magnitude of the measurement error correlation between the exposures. The method is useful in conducting a sensitivity analysis on the magnitude of measurement error and the sign of the error correlation.

Data availability

Source data

Data used in this study are made available to the researcher upon registration and agreeing to the terms and conditions of use in the HSRC web site at http://curation.hsrc.ac.za/ Dataset-565-datafiles.phtml.

Extended data

Figshare: A Method to Adjust for Measurement Error in Multiple Exposures Measured with Correlated Error in the Absence of Internal Validation Study-Supplementary materials. https://doi.org/10.6084/m9.figshare.13147970.v228

The file shows the validity coefficient derivation, Proof for the estimate of error variance, R code for implementing the methods and convergence diagnostics results (i.e. Trace plots and ACF plots for the standard deviation and naive regression coefficient estimates of the fruits, vegetables and average number of cigarettes smoked, with explanation) and the sensitivity analysis results (supporting Tables) for varying the magnitude of error correlation between the exposures.

The extended data are available under the terms of the Creative Commons Zero (CC0) license.

Comments on this article Comments (0)

Version 1
VERSION 1 PUBLISHED 18 Dec 2020
Comment
Author details Author details
Competing interests
Grant information
Copyright
Download
 
Export To
metrics
Views Downloads
F1000Research - -
PubMed Central
Data from PMC are received and updated monthly.
- -
Citations
CITE
how to cite this article
Muoka AK, Agogo GO, Ngesa OO and Mwambi HG. A Method to adjust for measurement error in multiple exposure variables measured with correlated errors in the absence of an internal validation study [version 1; peer review: 1 approved, 1 approved with reservations]. F1000Research 2020, 9:1486 (https://doi.org/10.12688/f1000research.27892.1)
NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.
track
receive updates on this article
Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status: ?
Key to Reviewer Statuses VIEW
ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions
Version 1
VERSION 1
PUBLISHED 18 Dec 2020
Views
6
Cite
Reviewer Report 25 Mar 2022
Kesaobaka Molebatsi, University of Botswana, Notwane Rd, Gaborone, Botswana 
Approved
VIEWS 6
The authors have proposed a method that accounts for measurement errors in multiple exposures that are correlated and called it a multivariate method. They have clarified the challenges of ignoring such a problem well in the absence of validation samples, ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Molebatsi K. Reviewer Report For: A Method to adjust for measurement error in multiple exposure variables measured with correlated errors in the absence of an internal validation study [version 1; peer review: 1 approved, 1 approved with reservations]. F1000Research 2020, 9:1486 (https://doi.org/10.5256/f1000research.30843.r121761)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
Views
10
Cite
Reviewer Report 06 May 2021
Erica Ponzi, Oslo Center for Biostatistics and Epidemiology, University of Oslo, Oslo, Norway 
Approved with Reservations
VIEWS 10
The paper presents an application of measurement error modeling to a dataset from a home-based HIV counseling and testing (HBCT) study. It focuses on the case where multiple variables are measured with error and such errors are correlated. 
The ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Ponzi E. Reviewer Report For: A Method to adjust for measurement error in multiple exposure variables measured with correlated errors in the absence of an internal validation study [version 1; peer review: 1 approved, 1 approved with reservations]. F1000Research 2020, 9:1486 (https://doi.org/10.5256/f1000research.30843.r84307)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.

Comments on this article Comments (0)

Version 1
VERSION 1 PUBLISHED 18 Dec 2020
Comment
Alongside their report, reviewers assign a status to the article:
Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions
Sign In
If you've forgotten your password, please enter your email address below and we'll send you instructions on how to reset your password.

The email address should be the one you originally registered with F1000.

Email address not valid, please try again

You registered with F1000 via Google, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Google account password, please click here.

You registered with F1000 via Facebook, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Facebook account password, please click here.

Code not correct, please try again
Email us for further assistance.
Server error, please try again.