Using DHS and MICS data to complement or replace NGO baseline health data: an exploratory study

Background: Non-government organizations (NGOs) spend substantial time and resources collecting baseline data in order to plan and implement health interventions with marginalized populations. Typically interviews with households, often mothers, take over an hour, placing a burden on the respondents. Meanwhile, estimates of numerous health and social indicators in many countries already exist in publicly available datasets, such as the Demographic and Health Surveys (DHS) and the Multiple Indicator Cluster Surveys (MICS), and it is worth considering whether these could serve as estimates of baseline conditions. The objective of this study was to compare indicator estimates from non-governmental organizations (NGO) health projects’ baseline reports with estimates calculated using the Demographic and Health Surveys (DHS) or the Multiple Indicator Cluster Surveys (MICS), matching for location, year, and season of data collection. Methods: We extracted estimates of 129 indicators from 46 NGO baseline reports, 25 DHS datasets and three MICS datasets, generating 1,996 pairs of matched DHS/MICS and NGO indicators. We subtracted NGO from DHS/MICS estimates to yield difference and absolute difference, exploring differences by indicator. We partitioned variance of the differences by geographical level, year, and season using ANOVA. Results: Differences between NGO and DHS/MICS estimates were large for many indicators but 33% fell within 5% of one another. Differences were smaller for indicators with prevalence <15% or >85%. Difference between estimates increased with increasing year and geographical level differences. However, <1% of the variance of the differences was explained by year, geographical level, and season. Conclusions: There are situations where publicly available data could complement NGO baseline survey data, most importantly when the NGO has tolerance for estimates of low or unknown accuracy.


Introduction
Non-government and civil society organizations spend substantial time and resources collecting baseline data in order to plan and implement health interventions with marginalized populations, and to measure the impact of those interventions (Data for Impact, 2019). Typical methods involve baseline and endline household surveys, where the household residents are interviewed and asked a hundred or more questions about asset ownership, mother and child health, diet, health system access, and other topics of interest. The costs of these surveys vary depending on design, methods, sample size, survey length, and local context (Data for Impact, 2019), but in the authors' experience tens of thousands of dollars is typical, and in some cases, much more. Depending on the number and nature of questions, interviews can be over an hour long, placing a burden on the respondents. In addition, the accuracy of the indicator estimates in NGO-led surveys may be insufficient for project design and monitoring purposes, due to relatively small sample sizes and the inherent high variability of the indicators of interest.
Meanwhile, estimates of numerous health and social indicators in many countries already exist in publicly available datasets, such as the Demographic and Health Surveys (DHS), supported by USAID (U.S. Agency for International Development, 2018), and the Multiple Indicator Cluster Surveys (MICS), supported by UNICEF (UNICEF, 2020), and it is worth considering whether these could serve as estimates of baseline conditions. DHS/MICS provide standardized data collected using rigorous methods and large sample sizes, and datasets are available on request for free. They are designed to be representative at the national, regional and provincial level (but rarely at lower levels, such as district and village, where NGOs are working), and probably exclude homeless, institutionalized and nomadic populations (Carr-Hill, 2013). DHS/MICS are collected every three to ten years so there may up to ten-years gap between DHS/MICS data collection and the baseline conditions that the NGO wants characterized. Although some indicators' descriptions have been modified and improved over time, caution is taken to ensure that data are directly comparable across countries, regions and years (Hancioglu & Arnold, 2013;UNICEF, 2020;U.S. Agency for International Development, 2018). DHS/MICS surveys are adapted to specific country needs and are conducted by well-trained interviewers who have access to tools and guidelines for quality assurance throughout (UNICEF, 2020; U.S. Agency for International Development, 2018).
Using publicly available data to complement or replace NGOs' primary data collection for project baseline measures and project monitoring would save valuable resources, reducing the burden on data collectors and respondents alike. A few studies have compared estimates between DHS/MICS and NGO surveys. One found that they provided very different estimates of electricity and water access in Kenya, Tanzania, and Uganda (Carr-Hill, 2017), and a second found that DHS and a NGO-led survey provided similar estimates of several maternal and child health estimates in Rwanda (Langston et al., 2015). Other studies found that estimates of the market share of faith-based health care providers by DHS and NGO surveys in sub-Saharan Africa were within 5 to 50% of each other (Wodon et al., 2012), and the confidence intervals for the difference between Lot Quality Assurance Sampling (LQAS) and DHS district-level estimates were within +/-10% for 15 of 37 health indicators (Anoke et al., 2015). Therefore, no consensus exists on the potential for DHS/MICS to substitute NGO surveys.
We hypothesized that publicly available data can provide estimates of baseline conditions similar to those reported in NGO baseline reports when matched as closely as possible for location, year, and season of data collection. We tested this hypothesis by comparing indicator estimates from NGO reports with estimates calculated using DHS/MICS.

Data from NGO baseline reports
We collected and retained a sample of 46 NGO baseline reports through a combination of internet search and personal contacts with Canadian and Vietnamese NGOs using the following selection criteria: i) household survey (n>100) which used valid methods and representative sampling to generate point estimates of maternal, newborn and child health indicators; ii) conduced between 2005 and 2019; iii) in a low-or middle-income country.
The baseline reports from NGOs working on maternal, newborn and child health covered 23 countries spanning South Asia (Bangladesh, India, Pakistan), Africa (Burkina Faso, Ethiopia, Ghana, Kenya, Liberia, Malawi, Mali, Mozambique, Nigeria, Senegal, South Sudan, Tanzania, Zambia), South/Central America (Bolivia, Honduras), the Caribbean (Haiti), and SE Asia (Laos, Myanmar, Philippines, Vietnam) (Table 1) (Berti, 2021). From the reports, we extracted: country name, NGO name, dates of data collection, population of study, inclusion/exclusion criteria, indicator name and definition, sample size (total and n for each indicator), and the indicator estimate (percentage and standard deviation (SD) if available).
We also retained the location of data collection (e.g. country, region, province, district, or/and village) and geographical level. These geographical levels of data aggregation were defined as: (1) the smallest geographical subdivision in a country (village, town, locality, traditional authority); (2) district or district council (larger than a village but smaller than the third level); (3) province, state, department, county or district (if it refers to a division equivalent to province or state); (4) region (combining several units of level 3); (5) country level.

Data from DHS and MICS surveys
We matched 25 DHS and 3 MICS surveys (from Vietnam, Laos, and South Sudan) with 46 NGO baseline reports (Table 1). We used the most recent DHS/MICS survey carried out prior to the NGO baseline survey, with some surveys matching more than one NGO survey.
Indicators from DHS/MICS were calculated following the methods recommended by DHS/MICS accounting for weighting and sample selection (Croft et al., 2018). Wherever possible, we used the methods employed by the NGO to create the matching DHS/MICS indicator. For instance, if the NGO baseline survey included women of reproductive age and their children aged 0-24 months living in the district of Homoine in Mozambique, we extracted the same sample from the DHS/MICS. In the absence of representative data from the same geographical level, we used DHS/MICS data from the next level up in the geopolitical hierarchy to match the lower level from the NGO. For instance, if data from the district of Homoine were not available in the DHS, we used data from the province of Inhambane (one level up).

Indicators retained for analysis
We matched similar indicators from NGO baseline reports with DHS/MICS wherever available and excluded those that had no match in the DHS/MICS datasets. Table 2 provides an example of how the data were matched for the indicator "Woman received at least three antenatal care visits (ANC) during last pregnancy".
In total there were 129 indicators (Table 3) from eight main groups including child anthropometry, child diet, child health, household characteristics, household wealth, maternal characteristics, maternal health, and WASH. We excluded estimates based on fewer than ten observations (n=64), in either the DHS/MICS or NGO data, retaining a total of 1,996 pairs of NGO-DHS/MICS indicators for analyses.
After collating the data, we grouped similar indicators into 37 subgroups (Table 3) on the basis of whether they had similar definitions/concepts (e.g. stunting prevalence in different age groups). We refined the grouping by using scatterplots of the difference of estimates by year difference and geographical level difference to check if any indicators differed widely from others in the grouping. After assessing the indicators graphically, we separated "Diarrhea in the last two weeks: 0-5m" from the same indicator for other age groups since the differences of estimates were closer to zero for this age group than the others. We also separated "Household has a car" from the subgroup "Household has agricultural land/bike/ phone" since car ownership was much lower than ownership of other assets.

NGO versus DHS/MICS
We subtracted NGO from DHS/MICS estimates to calculate difference and absolute difference between estimates.
To compare data from NGO and DHS/MICS we used: same or different season of data collection; number of years difference between data collection (DHS/MICS year -NGO year); and number of geographical levels difference (DHS/MICS level -NGO level). If data collection spanned two years, for instance data collection started in 2013 and was completed in 2014, the year of data collection was coded as "2013.5". Geographical level difference was calculated by subtracting the NGO level from DHS/MICS level. For example, we subtracted district level data available from the Mozambique NGO survey (level=2) from province level data collected in the DHS (level=3), making the geographical level difference one. We grouped geographical level differences as: no difference; one level difference; 2-3 levels difference.
We plotted how difference and absolute difference between DHS/MICS and NGO estimates varied with the indicator and indicator grouping. We used Analysis of Variance (ANOVA) to partition the variance of difference or absolute difference between estimates (DHS/MICS estimate -NGO estimate) by indicator, geographical level difference (as 0,1,2+), year difference (continuous), and season (same season, different season, season unknown).

DHS versus DHS
In order to better understand the contribution of difference in methods employed in the different sources of survey data (DHS/MICS and NGO) to the resulting difference in estimates, we repeated the analyses used to compare DHS/MICS and NGO estimates but this time comparing DHS data from one country, year and geographical level to a different year and/or geographical level from the same country. The assumption is that the DHS methods are similar between years and geographical levels, whereas DHS/MICS and NGOs may use somewhat different methods. There is a level of discordance between DHS/MICS and NGO estimates, and there would also be discordance between two DHS estimates. The difference between DHS/MICS-NGO discordance and DHS-DHS discordance will not be due to difference in years, or geographical levels, but rather due to difference in methods.
For the DHS-DHS comparisons, we compiled DHS data from the seven countries that contributed the most pairs in the DHS/MICS-NGO dataset: Bangladesh, Ethiopia, Kenya, Malawi, Pakistan, Tanzania, and Zambia. Retaining the same indicators as in the DHS/MICS -NGO comparisons, we calculated estimates for different geographical levels, i.e. at the country level, and for each region, province and district available. For this analysis, we included district data to mimic the NGO data, even though these estimates are not always representative at this level in the DHS. We excluded indicators based on a sample size smaller than ten observations (n=26,539).
We matched DHS indicators from different cycles and geographical levels using different combinations mimicking the actual DHS/MICS-NGO scenarios: indicators from the same level but different years (Scenario 1), indicators from the   same year but different levels (Scenario 2), and indicators from different years and levels (Scenario 3). To mimic the NGO data, we used data from the most recent cycle and the lower geographical levels, whereas to represent the comparative DHS data we used older DHS cycle and higher geographical level data. Using DHS data only, we were not able to simulate a scenario where DHS/MICS and NGO data were from the same year and geographical level. Table 4 provides an example of how we compared the estimates for an ANC indicator in Zambia using 31 pairs from DHS in the three scenarios for this one country. Repeating across all indicators and all countries yielded 109,251 pairs of DHS-DHS indicators.
We calculated the difference and absolute difference between these pairs of estimates, mimicking the scenarios from the DHS/MICS-NGO data. Table 5 summarises the DHS cycles included as well as the geographical level comparison for each scenario in each of the seven countries.
Finally, as with DHS/MICS vs NGO estimates, we used ANOVA to partition the variance of difference or absolute difference between DHS estimates by indicator, geographical level difference, and year difference. We did not include season in this analysis since most DHS data are collected during the same season within a country.

Simulations
We simulated a situation where the only source of imprecision of the indicator's measures would be from sampling error, in order to separate this known and estimable source of error from other sources of error that lead to differences in indicator estimates. The simulation samples from a "true" prevalence (p) of 1%, 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, and 99%. We assumed an n of 500, which was a typical sample size of both DHS and NGO samples in our data set. We then generated a "Baseline Estimate 1" (to mimic the DHS/MICS estimates) by drawing randomly from a binomial distribution with mean n*p and variance np(1-p). A "Baseline estimate 2" (to mimic the NGO estimate) was generated in the same way, and the difference between the first and second estimate was calculated. We ran 1,000 iterations to estimate the distribution of the differences.
In order to investigate how absolute differences vary by the nature of the point prevalence estimates we used box plots to compare simulated, DHS-DHS and DHS/MICS-NGO absolute differences.

Results
The NGO reports often presented over 100 indicators in their baseline reports. On average, 18 of their indicators were also available in the DHS/MICS datasets. The estimate sample size for the NGO surveys ranged from 12 to 16,530 and from 10 to 98,446 for the DHS/MICS. Table 6 presents, by indicator subgroup, mean DHS/MICS and NGO percentage prevalence estimates, mean difference between pairs (DHS/MICS minus NGO) and percentage of differences falling within 5 and 20 percentage points. Some subgroups have mean difference close to zero, but almost all have at least some pairs that are widely different (not within 20%). Fifteen subgroups had positive (DHS<NGO) and 21 had negative (DHS>NGO) mean differences, but we identified no meaningful pattern in which indicators were negative and which were positive, and all the differences (except for consumption of vitamin A-rich foods) were within 1 standard deviation of 0.  Figure 2 shows the boxplot distribution of the mean difference between estimates by subgroup. The only subgroups that had all the pairs of indicators within AE20% were "Consumption of vitamin A-rich foods", "Bottle fed yesterday", "Diarrhea in the last two weeks: 0-5m", "Diarrhea in the past two weeks: given more to eat", and "Household has a car". Other indicators that had most of their pairs within AE20% were "Household treats drinking water" and "Ever married". All the indicators with the smallest differences between estimates had very low or very high prevalence (Table 6), except for "Consumption of vitamin A-rich foods" (that was based on only four pairs of estimates). Table 7 summarizes the absolute differences between DHS/MICS and NGO, and between DHS and DHS. They are summarized according to the similarity of data collection timing (year and season), geographical level, and sample size. Using the absolute difference enabled us to see the size of the difference without taking the direction into account. The absolute difference between DHS/MICS and NGO estimates increases as year difference increases, as geographical levels difference increase, and as sample sizes decrease. The differences between DHS and DHS show similar patterns in terms of broad geographical level, sample size, and ≥3.5 years versus 0 to 3 years' time differences.  Figure 3 as boxplots of the absolute difference between estimates by the indicator reference value (the DHS estimate or the estimate simulating DHS). The distribution of absolute differences is similar between DHS/MICS -NGO and DHS -DHS, with DHS/MICS -NGO showing only a slightly larger spread. For all three types of comparisons, the distribution of the absolute difference between estimates is narrower in the extremes and larger when the reference value is between 35% and 65%. Since the simulated sampling error differences are small (range <10%), only a small proportion of the differences can be attributed to sampling error.

Discussion
Our study showed that many indicators presented large differences between NGO and DHS/MICS estimates. Almost all indicators had at least some pairs that were widely different. Only about 33% of the pairs of indicators were within 5%, and about 80% of the pairs of indicators were within 20%. Agreement between indicators was higher when comparing indicators that had low or high prevalence (e.g. <15% or >85%), which is consistent with sampling theory, but throughout the prevalence range, the distribution of differences in the DHS/MICS-NGO and DHS-DHS comparisons is larger than that found from sampling error alone (reflected in the simulation distribution). An NGO could obtain an accurate estimate using DHS/MICS data for indicators with expected values close to 0% or 100%.
We had hoped that if DHS/MICS and NGO estimates were similar, then NGOs could forego baseline data collection and use as a substitute DHS/MICS estimates, or estimates from some other publicly available dataset instead, saving NGO time and money, and reducing respondent burden. While we cannot give a blanket recommendation that DHS and MICS  could always replace NGO baseline surveys, there are at least some situations where DHS/MICS could be used to the NGO's advantage: when the estimate is expected to be less than 15% or above 85%; when the indicator of interest is one of the few with consistent similarity between DHS/MICS and NGO estimates; and when the NGO has tolerance for estimates of low or unknown accuracy.  We had hypothesized that publicly available data can provide estimates of baseline conditions similar to those reported in NGO baseline reports when matched as closely as possible for location, year, and season of data collection. From the descriptive analyses, we found that as year difference increased, the mean difference between estimates slightly increased, and estimates derived from lower geographical levels (such as village or district from NGO and province for DHS/MICS) contributed to a higher mean absolute difference between estimates. In general, larger sample sizes were obtained at higher geographical levels and the larger the sample size (with their smaller sampling error) from DHS/MICS or NGO, the smaller the mean absolute difference between estimates. This meant that the advantage of geographical proximity is offset by the larger sampling error associated with small sample sizes. Whether the seasons of data collection were matched or different did not make a measurable difference to the similarity between estimates.
However, the partition of variance analyses showed that DHS/MICS and NGO estimates differed, for the most part, in unpredictable ways, and geographical levels, years difference and seasons explained only a small part of the variation.
We hypothesize that large differences between estimates from NGO baseline reports and DHS/MICS data are due to three main reasons: (i) It is possible that NGOs' estimates are collected from different populations with different underlying true values. NGOs often try to target lower wealth villages, and so baseline estimates may be worse off than the nationally representative DHS/MICS estimates. Note, however, that differences in household wealth indicators were small (e.g. "Household has electricity" 0.8% difference; "Household has a car" 0.2% difference). Additionally, the differences between DHS/MICS and NGO estimates might reflect actual changes over the years or across different geographical locations. Results from the analyses comparing data from the same source (DHS) but from different years and geographical levels also resulted in large differences between estimates.
(ii) Different methods employed while sampling, collecting, processing and analyzing data might also have contributed to the differences between DHS/MICS and NGO estimates.
(iii) Several indicators related to maternal and child health included in this study have not been validated and some have been shown to have low validity, such as maternal report of skilled birth attendance (Blanc et al., 2016).
Inappropriate conflation of answer options and inconsistent coding and analysis of DHS surveys has also been documented (Footman et al., 2015). High measurement error can result in bias in unpredictable direction and dimension, resulting in large differences between estimates. Whatever the cause of the large differences between estimates was, it was not possible to know which of the data sources (DHS/MICS or NGO) provided the most accurate estimation of the true prevalence in the NGOs target populations. Furthermore, while we have been comparing DHS/MICS and NGO point estimates, these indicators are measured with  error. The standard error (SE) for the DHS indicators is greater than 5% in eleven percent of the estimates. An estimate with a standard error of 5% will have a 95% confidence interval of ± 9.8%.
Our analyses document and try to understand the large differences between NGO and DHS/MICS estimates. However, a study comparing DHS data to a small population-based survey from Rwanda showed that nine out of fifteen indicators related to maternal, newborn and child health were within a 10% difference (Langston et al., 2015). Similarly, in case studies from Nepal and Vietnam (HealthBridge, 2020) there were many indicators where the DHS/MICS and NGO estimates were similar. In Nepal 70% of indicators were within 20% of one another. Estimates for ANC, ironfolic acid uptake, vitamin A supplementation at 18-23 months and mobile ownership were similar while breastfeeding, child dietary diversity and tetanus vaccination in pregnancy differed widely. In contrast, in Vietnam NGO estimates for exclusive/continued breastfeeding and dietary diversity at 6-8 months were close to DHS, while others differed by >30%.
Using secondary data may be useful, especially in situations of budget or mobility restraint, such as during the COVID-19 pandemic with limited data collection opportunities. However, use of DHS surveys may risk underestimating the scale of problems for poor and marginalised groups such as nomads or slum dwellers (Carr-Hill, 2017). When using DHS/MICS data, the user must keep in mind the potential differences between DHS/MICS and NGO estimates.
This study had some limitations. Most NGO data we used came from unpublished, not peer-reviewed reports created for internal use only. Indicators extracted from NGO reports were not necessarily consistent across all reports and often SDs or SEs were missing. Although, we matched the methods employed by the NGO as closely as possible in order to obtain the same indicators from DHS/MICS, some reports provided limited information concerning methods of data collection and analysis. Dates of and season of data collection were impossible to assess for eight reports. Assigning the geographical level of data from the NGO report was also challenging for some settings due to lack of contextual information. However, we were able to communicate with several NGOs in order to obtain supplementary information about the reports' methods.

Conclusion
Our hypothesis was that publicly available data can provide estimates of baseline conditions similar to those reported in NGO baseline reports when matched as closely as possible for location, year, and season of data collection. Our answer to this, in brief, is that publicly available data can be used, if the NGO is tolerant of imprecise estimates.
While an NGO may use the evidence presented here to justify forgoing their own baseline survey, they should keep in mind that DHS and MICS provide estimates for only some of the indicators of interest to the NGO. On average, we estimated 18 of the NGO's indicators using DHS/MICS, but NGOs were often reporting 100+ estimates. Furthermore, collecting data in the NGO working area can provide valuable insights for project design and implementation.

Data availability
This study used data owned by the DHS, the MICS and the NGOs that shared their baseline report. The DHS data can be downloaded at: https://www.dhsprogram.com, and the MICS data can be obtained at: https://mics.unicef.org. The DHS and MICS require registration and data access are only granted for legitimate research purposes.
The NGO reports were either available online on each NGO website or obtained by personal contact by email. The full list of NGO reports used in this study including report title, year of publication, organization name and how to access each report can be found at: Harvard Dataverse: Details on reports used in the Maxdata project. https://doi.org/10.7910/DVN/32FUQV (Berti, 2021).
Data are available under the terms of the Creative Commons Zero "No rights reserved" data waiver (CC0 1.0 Public domain dedication). This is an open access peer review report distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Neff Walker
Senior Scientist, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, USA This paper addresses a very interesting question but I must admit I remain unconvinced that the analyses presented here actually answer the question raised. From my perspective and that of the authors, the answer to this question is primarily related to how representative the data from the big surveys are of the program population. So for me, at one level the answer to this question is rather straightforward. If an NGO is working at provincial level and there has been a recent (in the last year) DHS or MICS survey that was sampled to be representative at the provincial level, then I would always use the DHS and MICS data as the baseline data forgoing the data collection by the NGO. Not only does this save time and energy, but my bias is that I believe the methods (e.g., sample size, mapping and household listing for sampling) used by major surveys like DHS and MICS are almost better than what an NGO would use. Of course, the answer to this question becomes less clear as the DHS or MICS survey data become less representative of the ideal program baseline, either in time or population. This is in large part what the analyses presented in this paper were all about.
On the time issue, would I still use the DHS data if it was two years old? In large part the answer to that question depends on how rapidly the indicators you are measuring change. If, for example, one was interested in measuring baseline values such as total fertility rate, household composition, wealth which change very slowly two or even five year period between the survey and the start of the NGO program would not be a big concern. However, for indicators that may change quickly, say coverage of some interventions like bednet ownership, vitamin A supplementation coverage which can quickly be scaled up with campaigns, data from a survey that is two or more years old would probably not provide a very accurate measure. While the analyses presented here did some work to look at this issue I am not really sure that it was captured in their analyses.
On population representation, using data from a very recent DHS survey that is sampled to be representative at a provincial level as baseline data for a program that covers 80% of the province may be okay. The data are not perfectly representative but unless there is extreme heterogeneity in the indicators of interest using the indicator values from the province probably provide a reasonable estimate of baseline coverage and therefore could replace a separate NGO-run survey. As with time, the key is how much variability is there within the population that was sampled for the DHS or MICS survey. This is hard to know, but clearly urban rural differences, ethnic mixes, topology all link to this. Again, I would think this would be a major issue that again, am not sure is captured in the current analyses.
I think these are exactly the issues that the authors were seeking to address in their analyses so why do I feel like these results do not really help us much in finding the answer? In their analyses they matched DHS/MICS data to data from NGO's and then looked at how well these points matched when they varies in time or population. One major issue for me revolves around the NGO survey data. The analytical approach basically assumes that NGO surveys produced the right answer (as they are for the correct population and for the right time) ignoring that methodological or procedural weakness in the NGO surveys (e.g., lack of household listing, mapping during sampling, small sample size, poor training and supervision of interviewers) may make their results far from a gold standard comparison. I think I would be happier with analyses that restricted the comparisons to NGO surveys where a review of the methods and procedures and the sample size of the survey make me more confident about the quality of the NGO estimates.
The second issue I have is the inclusion of data from DHS and MICS surveys that is broken down into smaller geographic region that was part of the sampling frame. Even if there are sufficient households at the district level, I am left wondering why we would expect the data to be very representative if that was not part of the sampling frame. I suppose that was part of the authors point of the analyses, but then we were not surprised that we found very weak correspondence between estimates of variables at the smaller geographic areas. I am a bit surprised the authors did not build a bit more on previous work on small area estimation that has tried to address many of the same issues focusing not just on differences due to population differences but also on techniques to adjust for these.

If applicable, is the statistical analysis and its interpretation appropriate? Yes
Are all the source data underlying the results available to ensure full reproducibility? Yes

Are the conclusions drawn adequately supported by the results? Yes
Competing Interests: No competing interests were disclosed.

Luay Basil
Canadian Red Cross Society, Ottawa, ON, Canada The paper addresses an important issue tackled frequently by NGOs that struggle to decide whether to use existing data or collect their own at baseline. It follows a rigorous methodology and uses a good number of studies to draw conclusions from. Thank you for the excellent work. Below are some reflections, questions and suggestions.

Introduction:
Statement in p3 "In addition, the accuracy of the indicator estimates in NGO-led surveys may be insufficient for project design and monitoring purposes, due to relatively small sample sizes and the inherent high variability of the indicators of interest." Since the statement is in the introduction section, and since the paper shows later that sampling errors are not a factor for the studies that met the criteria of inclusion, how do we reconcile between both? ○ A sample size of NGOs might be adequate for indicators such as antenatal care, postnatal care, delivery by skilled birth attendant, but not for illnesses among children under 5 years when we ask about children with symptoms in the past 2 weeks or when measuring the prevalence of early child marriage among adolescents since the age group is between 14 and 17, especially if the focus is on girls. Have such indicators been part of the simulations to see the potential errors by NGOs in sampling? Can the paper mention whether sample size by NGOs is a factor in such cases? Are such indicators the one that are in the small error of sampling in Figure 3? If so, would it be better for NGOs to use DHS/MICS data even when the surveys are a few years old or for a higher level of geography compared to their areas? ○ In addition to the challenges in sample size, sometimes the questionnaire in NGOs' baseline surveys are not technically sound to measure a standard indicator. Has this been observed from the cases reviewed? If so, the paper could refer to that and add suggest using the questionnaire of DHS/MICS if the NGO is going to collect its own baseline data, especially since they are adjusted to local context. ○ Table 4: Scenario 1 shows improvement for a geography over time but there is not much difference within the provinces or geographies in the same year.
Scenario 2 shows there is not much difference between 3rd level and national level in the same survey. Scenario 3 shows estimates of a later study that is not much different from earlier (slight improvement) because the earlier performance was quite high at 88.5%. Question: The table illustrate an important point; to what level is it representative of all examined studies? It would be good to mention that. If it is not representative, could the paper cite another example where there are significant differences? Or, at least mention that there is, if this is the case. See related comment on Table 4 later. Table 6: Is there a need to explain in the methodology the rationale behind selecting 5% and 20% as the thresholds for comparison of differences? What would the picture be if the thresholds were 10% and 20%?

Discussion:
P19 statement "In general, larger sample sizes were obtained at higher geographical levels and the larger the sample size (with their smaller sampling error) from DHS/MICS or NGO, the smaller the mean absolute difference between estimates. This meant that the advantage of geographical proximity is offset by the larger sampling error associated with small sample sizes." One would expect that the NGO would calculate the adequate sample size required using a sample calculator (like the one of RADAR project). If the resources available would result in a sample size that is significantly less than the adequate one, should there be a recommendation that the NGO uses DHS/MICS data? P19 statement "It is possible that NGOs' estimates are collected from different populations with different underlying true values. NGOs often try to target lower wealth villages, and so baseline estimates may be worse off than the nationally representative DHS/MICS estimates. Note, however, that differences in household wealth indicators were small (e.g. "Household has electricity" 0.8% difference; "Household has a car" 0.2% difference)." Since the DHS presents some findings by wealth quintiles, one would expect that findings from the lowest quintile could represent the areas NGOs work in and be close to those from NGOs data. Was comparison between indicators values from NGOs and the lowest wealth quintile from DHS made? If so, could you add a statement to reflect that? P 19 statement "Additionally, the differences between DHS/MICS and NGO estimates might reflect actual changes over the years or across different geographical locations. Results from the analyses comparing data from the same source (DHS) but from different years and geographical levels also resulted in large differences between estimates." Table 4 for Zambia does not show large differences. Are there other studies that have that?

Conclusions:
P 22 "Our hypothesis was that publicly available data can provide estimates of baseline conditions similar to those reported in NGO baseline reports when matched as closely as possible for location, year, and season of data collection. Our answer to this, in brief, is that publicly available data can be used, if the NGO is tolerant of imprecise estimates." The paper also shows that NGO can use DHS/MICS when the values of the indicators are very high or very low. P 22 statement "While an NGO may use the evidence presented here to justify forgoing their own baseline survey, they should keep in mind that DHS and MICS provide estimates for only some of the indicators of interest to the NGO. On average, we estimated 18 of the NGO's indicators using DHS/MICS, but NGOs were often reporting 100+ estimates. Furthermore, collecting data in the NGO working area can provide valuable insights for project design and implementation." It would be good to expand on this in the discussion section. NGOs' need to measure different outcome levels on knowledge, attitudes and practice; the first two guide project design, implementation and setting targets. NGOs also need to report on the different outcome levels to their donors. DHS/MICS focus more on practice/utilization of services and less on attitude and knowledge.

Additional point
In a webinar presenting the paper in March 2, 2021, it was mentioned if an NGO wants to have a baseline so it can compare with the end-line, it can have a properly randomized and controlled end-line that can give good findings on the project's impact. Could that be added to the paper?

Are sufficient details of methods and analysis provided to allow replication by others? Yes
If applicable, is the statistical analysis and its interpretation appropriate? I cannot comment. A qualified statistician is required.