ALL Metrics
-
Views
Get PDF
Get XML
Cite
Export
Track
Data Note

A raster-based dataset for spatio-temporal analysis of forest fires in the Amazon rainforest from 2001 to 2020

[version 1; peer review: awaiting peer review]
PUBLISHED 12 Sep 2025
Author details Author details
OPEN PEER REVIEW
REVIEWER STATUS AWAITING PEER REVIEW

Abstract

Forest fires are becoming increasingly common worldwide, posing a threat to the environment, economy, and society. Spatiotemporal analysis of forest fires is important to understand their characteristics and causes and to inform decision-making. This type of analysis requires the availability of a number of factors that contribute to fire occurrence, such as land use, environment, climate, and human activities, at high spatial and temporal resolutions. The South American Amazon rainforest covers a large area, and acquiring a useful dataset for analysis requires extensive effort and computer-intensive processing. This study investigates potential data sources, establishes a methodology, and prepares a dataset of attributes useful for spatiotemporal fire analysis. We provide a raster-based dataset that includes fires, land use, environment, and climate factors at a spatial resolution of 500 m and monthly resolution from 2001 to 2020, which facilitates the analysis of forest fires in the Amazon. Moreover, because data sources and implementation procedures are detailed, this work also encourages similar research in other parts of the world.

Keywords

Amazon; Fires; Burnt Area; Land Cover; Elevation; Precipitation; Humidity; Temperature

Introduction

The alarming increase in the frequency and severity of forest fires around the globe has become a significant threat to forested areas worldwide. These wildfires not only threaten human lives and their properties but also continue to contribute to the reshaping of local and global ecosystems. Because of their varying spatiotemporal nature at multiple scales, they are substantially diverse in their frequency, size, intensity, and pattern.1 Similarly, the source of ignition is an amalgamation of numerous aspects such as weather, climate, land use, and other causes such as lightning, volcanic eruptions, rockfalls, and combustion material.2 This constant vulnerability of forests exposed to wildfires is horrifying, but when considered in the context of ecological and socio-economic consequences, it poses a major challenge to fire management authorities and related stakeholders.3

To ensure better preparedness and deploy improved preventive measures, the spatio-temporal relations between the probable causes of wildfires and the characteristics of those fire incidents must be analyzed. Such analysis will not only assist with mitigation but may also aid in the prediction and forecasting of future events by better understanding the underlying events propagating fire occurrences.4 Such in-depth spatio-temporal statistical investigations of these complex interactions require the collection of all available associated attributes, combined from heterogeneous sources (with varying extents, spatial scales, temporal resolutions, file formats, etc.) into a processed unified structure available in the form of common specifications. The South American Amazon is one of the largest rainforests in the world5 and hosts thousands of wildfires annually.6 Despite numerous studies related to spatio-temporal statistical analysis of forest fires in many regions of the world have been conducted,2,4,7–9 similar studies do not exist for the Amazon region, mainly because of the lack of data that could be readily available for analysis. Existing Amazon-specific studies3,10 tend to focus on sub-regions within the Amazon. For a study area of this size, data collection is a time-intensive task, with exhaustive pre-processing requiring cumbersome setups. Hence, the development of an Amazon-wide database that includes all available attributes related to fires, integrated into a common format, is required.

The aim of this work is to provide a scientific community with a dataset related to spatiotemporal forest fire analysis for the Amazon region. The dataset includes historical data of 20 years (2001-2020) in a monthly temporal resolution for the complete extent of the Amazon region at a spatial scale of 500 m. Because the study area of the entire Amazon rainforest is large, the raw data sources must be at a global or regional level (in South America). Otherwise, data for the same attribute are expected to be gathered from multiple local-level sources, raising concerns regarding data integrity. Global- and regional-level satellite-based raster products were acquired and further clipped for the South American region to compute three types of data: (a) raw data, (b) pre-processed data and (c) working data. A schematic overview of this study is presented in Figure 1. Raw data refer to data file(s) extracted from the accessed data packages (i.e., data layer of the subject attribute, taken out from the data package containing various other attribute layers as well). The extracted attribute layers have varying spatial resolutions, dissimilar spatial extents, different spatial projections, and inconsistent file formats. Raw data are pre-processed to acquire Pre-processed data, with the attribute layers in a consistent file format and with the same projection system. Finally, all attribute layers are processed to obtain Working data, with the data extent confined to the Amazon region and with fixed spatial resolution, such that each raster cell of an attribute layer aligns exactly over the raster cell of the other attribute layer.

9a5dda5b-5b11-473a-8467-7013c317e657_figure1.gif

Figure 1. Schematic overview of the data processing process.

This manuscript presents the complete process of data collection for raster-based attributes of forest fires in the Amazon rainforest, along with a description of the methodological baseline and details of the implementation process. The availability of such a ready-made dataset with a detailed methodology of data collection and computer-intensive preprocessing procedures will be useful to many researchers working in the domain of forest fire analysis. For example, this dataset has been used to map the geographic and temporal distributions of burned areas and risk factors in the Amazon from 2001 to 2020 using an ensemble approach that harnesses a range of machine learning algorithms.11 Moreover, this dataset encourages the creation of similar datasets for different study regions, spatial resolutions, and research domains.12

Methods

The Amazon rainforest has an area of over 5.2 million square kilometers, covers approximately one-third of South America, and extends into eight countries.5 Within this region, data management authorities in each country generally focus on their own regions. To create a database for the entire extent of the Amazon rainforest and to ensure that all relevant areas of potential importance are included in the study area, we defined the study area for this work as the entire Amazon basin, as shown in Figure 2. The extent of the study area can be defined as [-79.43629, -18.00816: -44.49108, 8.66346] with the coordinate reference system EPSG:4326 - World Geodetic System (WGS) 84 - Geographic. For spatiotemporal modeling, the selection of the data period needs to have a considerable temporal range as well as data availability for the chosen period. A review of the literature related to spatiotemporal modeling of forest fires, as summarized in Table 1, indicates that a period of 5-30 years with monthly or yearly frequency is used for the temporal characterization of forest fires. Keeping in view what is available for the Amazon Rainforest (for the whole region), we decided to proceed with a data period of 20 years from 2001 to 2020, with a monthly frequency as the temporal resolution. The spatial resolution was finalized as 500 m for the final spatial grid. This is based not only on the available data for the Amazon Rainforest but also on the computational complexity involved in a study area of approximately 5 million square kilometers.

9a5dda5b-5b11-473a-8467-7013c317e657_figure2.gif

Figure 2. Study area of Amazon rainforest.

Amazon boundary obtained from.20

Table 1. Summary of study characteristics from previous works related to forest fire analysis.

ReferenceStudy regionStudy areaData period Temporal resolution
A4Southern France40,000 sq.km1995-2018Monthly
B21Autazes, Brazil7,632 sq.km1985-2015Monthly
C22South Korea99,720 sq.km1980-2000Annual
D2,7,23Catalonia, Spain30,000 sq.km2004-2008Multi-Year
E8Castellon, Spain6,632 sq.km2001-2006Multi-Year
F24Islamabad, Pakistan158 sq.km2005-2018Multi-Month
G25California and Nevada, USA120,000 sq.km1984-2006Multi-Month

In addition to the study design involving spatial resolution, temporal frequency, and spatial data extent, another equally important aspect is the selection of covariates. These variables can be broadly categorized as attributes related to land use, climate, the environment, topography, and human activities. Land use and land cover (LULC) variables are highly related to forest fires, as the type of land surface not only determines fire ignition but also its propagation. Climatic variables, such as humidity, precipitation, wind speed, and temperature, also influence the occurrence of forest fires. Topographic variables such as elevation, slope, and aspect are also of core importance as they regulate how quickly a fire will move up or down the hills. Finally, human activities also play a critical role in the initiation of forest fires. Hence, variables such as population density, buildings, and the urban-forest interface are of high significance. Table 2 summarizes the list of potential forest fire analysis attributes discussed in the literature.

Table 2. Summary of study attributes from previous works related to forest fire analysis.

Reference Description of attributes
F3,8,9,26,27Land Use Effects/Vegetation Type/Deforestation/Forest Type/Land Cover
G4,9,26Population Density/Housing Density/Buildings
H3,4,26,27Elevation, Slope and Aspect
I9Humidity
J9Wind Speed
K4,9,26,27Temperature
L4,9Precipitation
M8Isothermality
N4Protected Zones
O3,8,9,26Road Density, Distance to Road
P3Maximum Cumulative Water Deficit
Q3,8Soil Type/Soil Texture/Soil Permeability

Data collection

From the list of attributes identified from the literature as potentially related to forest-fire analysis ( Table 2), not all of them are available for the entire Amazon Rainforest, let alone for the study period 2001-2020. Specifically, variables such as protected zones, isothermality, and maximum cumulative water deficit were only available for certain regions and for a particular time period. Similarly, elevation-related attributes were only available for certain years between the period 2001-2020. In this study, attributes that were available for the complete Amazon region and for the selected time period of 2001-2020, are identified and further acquired, as detailed in Table 3, with Date of Access: 01 May 2022. This section details the complete data-acquisition process related to each collected attribute.

Table 3. Summary of collected attributes related to forest fire analysis, with original temporal resolution of monthly frequency (except Land Cover which is Annual, and Elevation which is One time).

These attributes were pre-processed to acquire working data at 500 meters and monthly resolution, for the period of 2001 to 2020.

S#Variable nameDescriptionSpatial resolution Source
1.Burnt AreaClasses (Burnt, Not Burnt, Water)500 metersMODIS28
2.Land Cover (Annual)11 Classes of Land Cover5,600 metersMODIS29
3.PrecipitationAverage rate of precipitation10,000 metersGES-DISC30
4.Soil MoistureModel-calculated 37,000 metersCPC31
5.Elevation (One-time)Based on Digital Elevation Model1,000 metersEarthEnv32
6.Land Surface TemperatureDaytime observations5,000 metersMODIS33
7.Specific HumidityModel-calculated 1,000 metersGES DISC34
8.Evapotranspiration (ET)Model-calculated 1,000 metersGES DISC34
9.Near Surface Wind SpeedModel-calculated 1,000 metersGES DISC34
10.Near Surface Air TemperatureModel-calculated 1,000 metersGES DISC34

Burnt Area (BA)

The data product acquired was MODIS MCD64A1 Version 6.1, which is a gridded burnt area product at a resolution of 500 m, available in Hierarchical Data Format (HDF) format. The product provides the date of burn (in the form of the day of the year) for individual cells with additional classes, such as unburnt, missing data, and water. The data product is available for the period 2000 to the present (2022), with global spatial coverage in the form of regional subsets. The layers extracted from the data source are for regions 5 and 6, which cover the Amazon area. The data layer values are in units of a day, with a valid range of data values as between 1-366 (representing the day of the year). Further details related to the product, including the quality assessment and known issues, are available at MODIS MCD64A1 (https://lpdaac.usgs.gov/products/mcd64a1v061/ ).

As the burnt area product is available at the regional level, an additional data processing step for the burnt area product is the merging of two separate regional-level products to cover the entire region of the Amazon basin boundary. Additionally, the data were re-classified to assign a single value of 1 to all burn dates (1-366) to identify the cell with burn data as simply burnt. Hence, working data has four classes (burnt, unburnt, missing, and water) with values (1, 0, -1, and -2), respectively.

Land Cover (LC)

The data product acquired was MODIS MCD12C1 Version 6, which consists of three gridded land cover classification schemes at a resolution of 5,600 m, available in the HDF format. The three available classification schemes include Maps of the International Geosphere-Biosphere Programme (IGBP) providing 17 classes, University of Maryland (UMD) providing 16 classes, and Leaf Area Index (LAI) providing 11 classes. LAI classification schemes are extracted from the data product as 11 classes are sufficient for representation of different land covers in terms of Water, Urban, Forest, Grassland, etc., and additional classes available in other schemes are further subdivisions of forests and grassland types. The data product is available for the period 2000 to the present (2022) with global spatial coverage. The details of the land cover classes of the LAI scheme are provided in Table 4. The name of the layer extracted from the data source is Land Cover Type-3, with a range of data values between classes 0 and 10. Further details related to the product, including the quality assessment and known issues, are available at MODIS MCD12C1 (https://lpdaac.usgs.gov/products/mcd12c1v006/ ).

Table 4. Class details of Leaf Area Index (LAI) classification scheme, from MODIS.29

Class nameValueDescription
Water Bodies0Permanent water bodies
Grasslands1Dominated by herbaceous annuals (<2 m)
Shrublands2Shrub (1-2 m)
Broadleaf Croplands3Dominated by herbaceous annuals (<2 m) - cultivated with broadleaf crops
Savannas4From 10% to 60% tree cover (>2 m)
Evergreen Broadleaf Forests5Dominated by evergreen broadleaf and palmate trees (>2 m)
Deciduous Broadleaf Forests6Dominated by deciduous broadleaf trees (>2 m)
Evergreen Needleleaf Forests7Dominated by evergreen conifer trees (>2 m)
Deciduous Needleleaf Forests8Dominated by deciduous needleleaf (larch) tree (>2 m)
Non-Vegetated Lands9Non-vegetated barren (sand, rock, soil) /permanent snow and ice
Urban and Built-up Lands10Impervious surface area including building materials, asphalt, and vehicles
Unclassified255Missing inputs

Precipitation

The data product acquired is (Integrated Multi-satellite Retrievals for GPM (Global Precipitation Measurement (GPM)-based multi-satellite precipitation product, Version 06 B, available in Hierarchical Data Format version 5 (HDF5) format. The product provides a monthly product of average precipitation rates at a 0.1 °× 0.1 ° (approximately 10,000 m at the equator) spatial resolution, estimated from numerous precipitation-relevant satellite passive microwave (PMW) sensors. The dataset is available for 2000–2021 with global spatial coverage. The values are represented in millimeters per hour (mm/hr), with a scale factor of 1000 and missing values marked with -9999. Thus, a value of 500 indicates 500/1000 mm/h. Further details related to the product are available at the GES-DISC GPM IMERG Final Precipitation L3 ( https://disc.gsfc.nasa.gov/datasets/GPM_3IMERGM_06/summary ).

Soil moisture

The data product acquired is a model-calculated (not directly observed) averaged soil moisture water height equivalent, namely CPC Soil Moisture Version 2, available in the GEOTIFF format. The data are a monthly product of 0.5 °× 0.5 °(approximately 37,000 m at the equator) spatial resolution, with data available from 1948 to the present (2022). The spatial coverage of the product is [89.75N–89.75S, 0.25E–359.75E]. The values are represented in millimeters (mm), with missing values marked as − -9999. Further details related to the product are available at CPC Soil Moisture (https: //psl.noaa.gov/data/gridded/data.cpcsoil.html ).

In the preprocessing of the Soil Moisture data product, data transformation is implemented as an additional step. As the source data have a spatial offset, not aligning with the reference base map, the data are transformed to correct alignment using the Geospatial Data Abstraction Library (GDAL).13

Elevation

The acquired data product is a global multivariate package related to terrain features, which can serve many large-scale research publications. The data product is based on a 250 m Digital Elevation Model (DEM), available in Tagged Image File Format (TIF) format, from Global Multi-Resolution Terrain Elevation Data 2010 (GMTED2010).14 This data product provides many topographic variables, such as elevation, slope, aspect, northness, elasticity, roughness index, and topographic position index at different resolutions of 1, 10, 50, or 100 km, with global spatial coverage; however, our focus is only on elevation. The Elevation values are represented in meters (m). Further details related to this product are available at ( https://www.earthenv.org/topography ).

Land Surface Temperature (LST)

The data product acquired was MODIS MOD11C3 Version 6, which is a monthly Land Surface Temperature & Emissivity (LST&E) value product at a spatial resolution of 0.05 ° (approximately 5,600 m), available in the HDF format. The data product provides values for both daytime and nighttime observations, along with other details related to the quality assessment. The data product is available for the period 2000 to the present (2022) with global spatial coverage. The temperature values are represented in kelvin (K), with a scale factor of 0.02 and a range of values between 7,500 and 65,535. Thus, the LST value equal to X represents X*0.02 kelvin. Further details related to this product are available at MODIS MOD11C3 ( https://lpdaac.usgs.gov/products/mod11c3v006/ ).

Specific humidity, Evapotranspiration (ET), wind and air temperature

The acquired data provides a set of parameters related to land surface observations. The data is a simulation-based product of the Noah 3.6.1, model from Famine Early Warning Systems, Network (FEWS NET) Land Data Assimilation System (FLDAS). All the provided variables are available as a monthly product in a 0.10 degree spatial resolution (approximately 1,000 m at the equator) and available (as a layer) in NETCDF file format. The dataset is available for the period from 1982 to the present (2022).

with global spatial coverage. The values of Specific Humidity are represented as (kg/kg), using a ratio between kilogram of water (moisture) per kilogram of air; whereas Evapotranspiration, Wind and Air Temperature are measured in (kg/m2s), (m/s) and kelvin (K), respectively. Further details related to the product are available at the GES DISC-FLDAS Noah Land Surface Model L4 (https://disc.gsfc.nasa.gov/datasets/FLDAS_NOAH01_C_GL_M_001/summary ).

Data processing

All of the various attributes collected in the database have different spatial resolutions, as described in Table 3. Similarly, not all variables are available at monthly resolution, as Land Cover and Elevation are annual and one-time, respectively. Moreover, all of these variables cover different spatial extents and have dissimilar spatial orientations. To obtain a dataset with all the variables at a fixed spatial extent and resolution, we constructed a spatial grid of 500m resolution covering the Amazon region and obtained the cell values for this raster following the steps described below. Similarly, we executed the process to achieve a monthly temporal resolution for all variables, with the data period from 2001 to 2020.

Although the collected data packages for different attributes have heterogeneous specifications, their processing generally follows a common workflow. A methodological baseline of the processing steps is shown in Figure 3. Specifically, Accessed Data refers to the downloaded data package from data sources in various formats, such as HDF, HDF5, NETCDF, GEOTIFF, and TIF. Accessed data in source data formats, such as HDF, HDF5, and NETCDF, contained several layers with different attributes, and the layer related to the subject attribute was extracted from this set of layers. Accessed data with the source data formats of GEOTIFF or TIF contained only the required layer that was extracted. These extracted layers are referred to as the Raw Data. To bring all attributes to a common local projection, all Raw Data layers are projected onto ESPG:102033–South America Albers Equal Area Conic, to acquire Projected Layer. Such a projected coordinate system is best suited for study areas such as in this study, where land mass extends in the east-west orientation, rather than the north-south orientation.15 The projected layers are either global or regional-level (based on the specifications of the data source), and to confine them all to the Amazon Basin Boundary, these layers were further clipped using a shapefile-based (vector) Amazon Basin Boundary. This clipped layer is labelled as Pre-Processed Data.

9a5dda5b-5b11-473a-8467-7013c317e657_figure3.gif

Figure 3. Overview of methodology for data processing.

Although all layers are cropped to the Amazon basin boundary, their respective cells may not exactly align with each other owing to differences in their source data extent, cell-grid orientation, and spatial resolution. To obtain layers of the same spatial extent, resolution, and orientation, we constructed a fixed spatial grid (with data extent based on the Amazon Basin boundary) with 500m spatial resolution, and transfer cell value information from each attribute layer to this spatial grid, repeating the process for all attributes. The transfer of cell information includes an intermediate step of disaggregating the Pre-Processed d ata layer of varying spatial resolution to approximately 500m resolution, so the transfer process can be one-to-one, from raster grids of varying cell orientation, cell size, and data extent to a fixed spatial grid template. The disaggregation factor is different for each attribute, based on the spatial resolution of the source data. The spatial grids obtained are the final layers readily available for analysis and are termed Working Data. This workflow was followed for each monthly file (240 files over 20 years) for each attribute. Figure 4 illustrates an example of Land Surface Temperature in January 2020 for all three categories of raw data, pre-processed data and working data. Similarly, Figure 5 presents an example of a single monthly instance from January 2020 for all the variables collected.

9a5dda5b-5b11-473a-8467-7013c317e657_figure4.gif

Figure 4. Land surface temperature for January 2020.

Top: Raw data (Global), Bottom: Pre-processed data (cropped) and working data (re-sampled spatial grid).

9a5dda5b-5b11-473a-8467-7013c317e657_figure5.gif

Figure 5. Plots of the variables related to forest fires for the region of Amazon Rainforest in January 2020.

In terms of implementation, pre-processing work was completed using GIS software, and the processing work was executed in the statistical computing software R.16,17 All data layers were managed using the SpatRaster data structure in the Terra package.18

Technical validation

The raster-based dataset of covariates presented in this study is a collection of established datasets that do not include any newly created data records. This work mainly focuses on exhaustive data search and its acquisition process, followed by computer-intensive pre-processing to develop a dataset for the Amazon region. The technical validation process for each individual dataset is available in their respective documentation, as highlighted in the Data Collection section.

License

The raster-based dataset of covariates presented in this study was published under a Creative Commons Attribution 4.0, International (CC BY 4.0) License (https://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution, and reproduction in any medium or format, as long as appropriate credit is given to the authors and the source, a link to the license is provided, and it is indicated if changes were made.

Comments on this article Comments (0)

Version 1
VERSION 1 PUBLISHED 12 Sep 2025
Comment
Author details Author details
Competing interests
Grant information
Copyright
Download
 
Export To
metrics
Views Downloads
F1000Research - -
PubMed Central
Data from PMC are received and updated monthly.
- -
Citations
CITE
how to cite this article
Mahmood M and Moraga P. A raster-based dataset for spatio-temporal analysis of forest fires in the Amazon rainforest from 2001 to 2020 [version 1; peer review: awaiting peer review]. F1000Research 2025, 14:916 (https://doi.org/10.12688/f1000research.164537.1)
NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.
track
receive updates on this article
Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status:
AWAITING PEER REVIEW
AWAITING PEER REVIEW
?
Key to Reviewer Statuses VIEW
ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions

Comments on this article Comments (0)

Version 1
VERSION 1 PUBLISHED 12 Sep 2025
Comment
Alongside their report, reviewers assign a status to the article:
Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions
Sign In
If you've forgotten your password, please enter your email address below and we'll send you instructions on how to reset your password.

The email address should be the one you originally registered with F1000.

Email address not valid, please try again

You registered with F1000 via Google, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Google account password, please click here.

You registered with F1000 via Facebook, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Facebook account password, please click here.

Code not correct, please try again
Email us for further assistance.
Server error, please try again.