CELL5M: A geospatial database of agricultural indicators for Africa South of the Sahara

Recent progress in large-scale georeferenced data collection is widening opportunities for combining multi-disciplinary datasets from biophysical to socioeconomic domains, advancing our analytical and modeling capacity. Granular spatial datasets provide critical information necessary for decision makers to identify target areas, assess baseline conditions, prioritize investment options, set goals and targets and monitor impacts. However, key challenges in reconciling data across themes, scales and borders restrict our capacity to produce global and regional maps and time series. This paper provides overview, structure and coverage of CELL5M—an open-access database of geospatial indicators at 5 arc-minute grid resolution—and introduces a range of analytical applications and case-uses. CELL5M covers a wide set of agriculture-relevant domains for all countries in Africa South of the Sahara and supports our understanding of multi-dimensional spatial variability inherent in farming landscapes throughout the region.

Over 70 percent of the population in Africa South of the Sahara (SSA) live in rural areas, their livelihood and food security often depending on smallholdings and rainfed agriculture (Livingstone et al., 2011). Many are also farming some of the most degraded soils in the world (Cox & Koo, 2014), a challenge exacerbated by over-reliance on low-yielding crop varieties (Mueller et al., 2012) and inadequate market infrastructure (Guo & Cox, 2014). Erratic shifts in weather and climate-related shocks are particularly hard felt in the region (Challinor et al., 2007). Development practitioners recognize that Africa's economic development largely hinges on smallholder investment through improved agricultural yields, nutrition, ecosystem services and marketing opportunities (Dixon et al., 2001). Historically, however, there has been a lack of reliable, granular data to inform and monitor food and agricultural policies at appropriate scales. With the launch of the Sustainable Development Goals (SDGs) (http://unstats.un.org/ sdgs) -including zero global poverty and hunger by 2030-more granular, global and regional-level data need to reach decision makers for monitoring countries' progress toward the goals.
Recent progress in georeferenced data collection and dissemination has widened access to multi-disciplinary datasets and created opportunities to advance data analytics (Azzarri et al., 2016). As data capacity improves, however, the potential of georeferenced socioeconomic datasets has not been fully utilized (Azzarri et al., 2016). A key challenge is reconciling and harmonizing multi-disciplinary indicators that can inform agricultural investments across scales and borders. To this end, HarvestChoice (http://harvestchoice.org), a joint project between the International Food Policy Research Institute (IFPRI) and the University of Minnesota, developed the CELL5M database (http://dx.doi.org/10.7910/DVN/G4TBLF), an open access catalog of georeferenced baseline indicators covering a broad range of agriculture-relevant domains. In this paper, we provide an overview of CELL5M and present a range of tools and applications for spatial targeting and strategic decision-making.

CELL5M Overview
What is CELL5M? CELL5M is a geospatial database of biophysical and socioeconomic indicators for SSA covering four broad research domains: agriculture, agroecology, demographics and markets (Table 1). All indicators are referenced to a uniform geographical information systems (GIS) grid: a flat table populated by over 300,000 grid cells overlaying SSA at 5 arc-minute spatial resolution. Each grid cell (or pixel) is approximately 10 kilometer × 10 kilometer and holds a stack of georeferenced data layers. CELL5M currently consists of over 750 data layers, providing a unique platform for multi-faceted analysis and fine-grain visualization at the nexus of agriculture and economic development. The database serves as the core to a decision-support system enabling development practitioners and analysts to explore complex relationships between major agroecological challenges (e.g., soil and land degradation) and socioeconomic trends (e.g., poverty, health, and nutrition) (Azzarri et al., 2016). The structure of CELL5M allows for simplified numerical aggregations of gridded data along specific geographic domains, either sub-nationally (e.g., across administrative boundaries, agroecological zones or watersheds) or across country borders for regional analyses (e.g., Omamo et al., 2006) Systematic assignment of grid cell ID To refer to a cell's boundary at any given spatial resolution, we created a universal identification system based on a basic unit of spatial analysis: the global grid cell (HarvestChoice, 2016b). In GIS, one typically uses coordinates (latitude and longitude) of the upper-left and lower-right corners of the grid cell's bounding box, or coordinates of the centroid, along with information on the projection system. To simplify identification, we universally label each cell as a sequential integer number, or grid cell ID. The grid cell ID can facilitate raster-based data analyses, aggregations and data sharing. The upper-left corner of the . Raw datasets are provided in multiple spatio-temporal resolutions, geographical extents, and formats (e.g., tabular, vector and raster). They undergo harmonization routines that aim to generate standardized, crossregional comparable statistics at uniform scale ( Figure 1). Raster and vector layers are typically re-projected to World Geodetic System (WGS) 84, a standard coordinate system for the Earth. Raster datasets of finer resolution (e.g., 30 arc-second) are aggregated using weights (e.g., land or population weights) or summarized (e.g., population headcounts) to 5 arc-minute resolution. Conversely, we apply a disaggregation process when the source data is coarser, which is generally the case with socioeconomic datasets that are geo-referenced to administrative units. Where applicable, care is taken to ensure that country totals of disaggregated data are consistent with official national statistics. To maximize coverage across SSA, missing data are imputed using coarser statistics and prior information. The result is a stack of harmonized, interoperable datasets based on a standardized grid system. CELL5M complies with open-data standards (Open Knowledge Foundation, 2016).

Key data layers
This section provides additional methodological details on example key datasets included in CELL5M.

Spatially-disaggregated crop production statistics
Beyond national-level assessments, spatially-disaggregated crop production statistics are the cornerstone of any analysis that explores the social, economic and environmental consequences of agricultural change and policies. The Spatial Production Allocation Model (SPAM) developed by the International food policy research institute (IFPRI) generates highly disaggregated, global distribution of area, production and yield for 42 commodities-accounting for 90 percent of the world's crop production (You et al., 2014). To generate these data layers, geospatial information on crops-including subnational crop production Using a variety of data sources and methods, CELL5M covers four broad research domains: biophysical, agricultural production, socio-economics and infrastructure (1). Using a combination of data resampling and harmonization routines (2), raw datasets are converted to a standard raster grid with a resulting set of uniform indicators across space and time (3). Indicators are distributed across platforms via application program interface and web mapping services (4). These services are freely and openly accessible through end-user tools (e.g., Mappr and Tablr, available at http://harvestchoice.org/) and decisionsupport systems (5); Africa RISING, FAOSTAT, the World Bank's Living Standards Measurement Study-Integrated Surveys on Agriculture (LSMS-ISA) and the Bill and Melinda Gates Foundation (BMGF) already consume CELL5M into their own analytical platforms.
statistics, satellite-derived land cover imagery, maps of irrigated areas, biophysical crop suitability assessments, population densities, cropping intensities and prices-is integrated to generate a set of prior estimates. These priors are then fed into an optimization model that applies cross-entropy principles, and area and production accounting constraints to allocate crops into individual pixels of a global grid at 5 arc-minute resolution (You & Wood, 2006;You et al., 2009) (Figure 2). The result for each grid cell is the area, production, value of production, and yield of each crop, split by the shares grown under irrigated, high-input rainfed, low-input rainfed and subsistence rainfed conditions. CELL5M includes the SSA extent of SPAM; global coverage of SPAM data layers are available at http://mapspam.info.

Market accessibility
Farm households need access to markets to support agricultural and rural development, particularly in poorer regions. Challenging road conditions and inadequate infrastructure add to travel time and transportation cost, limiting farmers' opportunity to purchase inputs and sell produce from remote crop production areas. The conventional method of measuring the Euclidean distance between two points in space (i.e., farm-gate and market) ignores the terrain, road conditions and infrastructure status, hence does not accurately capture travel time. Estimates of the travel time to markets provide a better proxy for market accessibility since they combine distance with other information including road quality, slope, land cover, and mode of transportation (Guo & Cox, 2014). To estimate market accessibility, we first identify the locations of different market centers and their sizes using population estimates from the Global Rural Urban Mapping Project (CIESIN et al., 2011). Then the travel times from farm-gate to the nearest cities of different population sizes are calculated using a spatial cost-distance algorithm and a combination of global spatial data layers including road network and type, elevation, slope, country boundaries, and land cover. CELL5M includes travel times to markets where populations are 20K (Figure 3), 50K, 100K, 250K, and at least 500K.

Subnational poverty
Poverty data layers in CELL5M are based on the comparison between household per-capita consumption expenditure and the $1.90 or $3.10/per-capita/day poverty lines (Figure 4), expressed in international equivalent purchasing power parity (PPP) dollars, circa 2011 (World Bank, 2014). By basing indicators on nationally-and regionally-representative household survey data, such as Household Income and Consumption Expenditure Survey (HICE), Integrated Household Survey (IHS), and Living Standards Measurement Study (LSMS), we avoid challenges with methods that combine national accounts and microdata (Chen & Ravallion, 2008;Deaton, 2005;Ravallion, 2003). Using microdata with expansion factors and national PPP adjustments guarantees the validity of national and subnational estimates and, along with . SPAM integrates information on crops (e.g., subnational crop production statistics, land cover satellite-data, maps of irrigated areas, biophysical crop suitability assessments, population densities, cropping intensities and prices) and cross-entropy principles to allocate crops into individual pixels of a GIS database. The result for each pixel is the area (shown above), production, value of production and yield of each crop. We estimate travel time to nearest market centers (cities) of different population sizes using a spatial cost-distance algorithm and a combination of global spatial data layers including road network and type, elevation, slope, country boundaries, water and land cover. Source: Authors (available from CELL5M).   development of interventions linking value-chain actors with producers. A similar domain-based approach was used to analyze the biophysical suitability of agricultural innovations to local contexts (e.g., Cox et al., 2015).

Agriculture and nutrition outcomes
The last decade has witnessed a surge of interest in leveraging agricultural development for better nutrition. However, there is a dearth of rigorous evidence and policy-relevant research on agriculture-nutrition linkages (Pinstrup-Andersen, 2013). As part of the Advancing Research on Nutrition and Agriculture (AReNA) initiative, HarvestChoice overlaid CELL5M indicators to an extensive series of georeferenced Demographic and Health Surveys (DHS; http://www.dhsprogram.com). Figure 5 shows the location of 28,866 clusters in SSA. Combining such datasets allows for more advanced econometric analyses to explore, for example, the spatial relationships between farming systems, biophysical characteristics, agricultural performance, market access and rural diets. For example, by overlaying agroecological indicators from CELL5M with childhood stunting data from DHS, Azzarri et al. (2016) showed that early childhood wasting is significantly more prevalent in the arid and semi-arid zones of SSA.

Typology of food production systems
Africa has a rich landscape of farming systems and agricultural biodiversity. This diversity presents a challenge for quantitative analyses at regional scale. In Benin et al. (2011), data layers from CELL5M were used to construct a typology of food production systems across SSA. Agricultural productivity zones (APZs) were developed by first intersecting farming systems (Dixon et al., 2001) with other indicators related to natural endowment and socioeconomic development, calculated from data retrieved from CELL5M and then applying spatial clustering techniques (Guo & Yu, 2015). The resulting APZs ( Figure 6) provide a more refined set of spatially-explicit typologies, compared to conventional country-level typologies, and allow policy makers to refine agricultural investment strategies.
Tools for visualization and spatial analyses CELL5M serves as the core database powering a growing number of open-access tools (see the list at http://harvestchoice.org/products/tool) and third-party applications reaching out to multiple audiences from research analysts to decision makers (Figure 1). Gridded datasets are particularly easy to store in numerical matrices making them relatively manageable and simple to query. This allows us to serve CELL5M indicators through a RESTful Application Programming Interface (API), which allows computer programs to access and query CELL5M data using HTTP requests. CELL5M's centroid coordinates (i.e., latitude and longitude) may be used to graph and summarize indicators using simple visualization tools (e.g., Tableau® or Microsoft Excel). Web-based interactive tools developed by HarvestChoice, for example Mappr (http:// harvestchoice.org/mappr) and Tablr (http://harvestchoice.org/tablr) use the API to return tabular, graphical and spatial representations  of CELL5M indicators. CELL5M raster layers are also served through a series of map services and may be queried via any GIS software compatible with OGC Web Map Service Standard (Open Geospatial Consortium, 2016) (e.g., ArcMap, QGIS, Leaflet or GDAL). For GIS users, the gridded data is also available in common raster formats (GeoTIFF and Esri ASCII). The World Bank's micro-level datasets from the Living Standards Measurement Study-Integrated Surveys on Agriculture (LSMS-ISA) program uses CELL5M services to retrieved data for each survey site, including agroecological and market accessibility characteristics, to enrich its own data products (communications with the LSMS-ISA team, March 19, 2015).

Conclusions
Through open and transparent sharing of high-resolution, harmonized multi-disciplinary datasets, CELL5M supports our understanding of multi-dimensional spatial variability in farming landscapes throughout SSA and helps better target potential interventions. A growing list of use-cases shows that CELL5M's reach has moved well beyond its initial scope and is now used by a larger pool of scientists and decision makers. With the double challenge of climate change mitigation and global food security, we anticipate an ever-growing demand for easy-to-access and easy-to-use, harmonized open datasets for agricultural research and economic development.
It is worth noting that many methodological shortcomings in harmonizing and imputing raw data from various sources still prevail. More research is required to develop reliable statistical methods to interpolate point-and administrative-level data and especially to generate reliable confidence intervals. This will also require more open datasets becoming available. Many institutions are already committed to freely open their agriculture and nutrition datasets, yet a broad community-wide effort is still needed to improve data interoperability and utilization (GODAN, 2015).
With advances in earth monitoring systems and image frequency and resolution, data products such as CELL5M necessitate further, continued investments to ensure that new data sources are incorporated, updated, modeled, and thoroughly validated. In that context, increased engagement with the broader community of data scientists and users is necessary for future success. We anticipate further collaboration with other emerging global data initiatives and partnerships (e.g., Global Partnership for Sustainable Development Data), especially those aimed at monitoring mechanisms towards achieving global development goals. I'm following the HarvestChoice webpage for a while now and used data provided a few years ago, so it's good to see the publication of their spatial database.

Data availability
The title and abstract are appropriate for the content of the article. The basic methods for generating CELL5M are explained, however I would like to ask the authors to not only describe the sources of key layers, but all data sources they have used (maybe in the SI). This description should include at least the input data for each data set, the original resolution or spatial units, the base year(s) and a reference to a full documentation. This is important in my opinion because at the moment you but they reconcile data sets only spatially might diverge in their methods and assumptions which prevents particular applications. A simple example are the livestock densities you use from the Gridded Livestock of the World 2007 which are modelled based on (among other things) climate as in WorldClim and the climate variables you present in CELL5M that use CRU. What happens often as well when developing a global data set is that areas are masked out or typologies of certain areas are created with distinct thresholds and if this information is not available to a user he might falsely interpret a spatial overlay of two data sets. Your description of how to assign a grid cell ID is a bit over the top in my view and not worth mentioning or maybe shorten in to one sentence and add to the introduction. Geographic Information Systems are around for 40 years now and any GIS works with grid cell IDs, they might just not be in the order of your IDs.

I have read this submission. I believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.
No competing interests were disclosed. CELL5M works toward solving a major problem in the area of food policy decision making; working with the glut of disparate data with differing spatial and temporal resolutions to identify accurate insights relevant to policy makers. Koo and colleagues provide a excellent overview of the CELL5M database.
The paper also provides the basic information on the methods used to harmonize the multiple data sets.
It appears that the CELL5M team has thought through most of the pitfalls of decision making at this scale. Our main concern in presenting this to policy makers is in abstracting some of the possible inconsistencies in data scale. Here are a few specific comments related to this concern: Is there some acknowledgement where data has been disaggregated from the national scale, yet presented at a smaller scale? While there is no choice but to use spatially mismatched data in this type of work, it should be made very transparent when the data is not presented at its true scale.
It should also be transparent where indicators are created from source data at multiple scales.
Combining national and sub-national data to create a fine-scale indicator can create a false sense of precision.
Many of the indicators are likely derived from data sets that either have similar features, creating an uber metric. How do you avoid double counting / weighting some features more than others?
Was the harmonization mainly spatial or did you also standardize feature names and units?
More general comments: The abstract could be strengthened. It does not explicitly address this paper until after three long sentences. It's then very general. A few points to make the abstract more concrete include: harmonized 750+ data sets for feature names, units, and spatial resolution. main themes are: w,x,y,z.
Provide the type of analysis that is possible and how it can be used (generalize one of the nice examples in the section " Agricultural development domains"). Be explicit that can integrate social, economic, and biophysical data How did you choose among the many data sets that provide similar information? For example, there are a few sources of data on crop production, yield gaps, and market access. Since different primary data (and methods) were used to create the various data sets, you will get different results when they are integrated here. For you audience, it's probably better to only have a single data source for each feature, but it would be helpful to be clear on your general criteria for which data are included.
The unique cell ID is a great feature for integrating multiple data sets. This also allows for faster, more stable queries and spatial operations using the web mapper or offline. Although we have not yet used it, the CELL5M data set is a great source for harmonized data for accessing, exploring, and analyzing data for the many uses the authors reference (baseline, setting goals, targeting actions, assessing scenarios, etc.).
We have read this submission. We believe that we have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.
No competing interests were disclosed. Competing Interests: