Keywords
Pervoskite solar cell, SCAPS-1D, dataset, photovoltaic technology, numerical simulation, machine learning.
This article is included in the Artificial Intelligence and Machine Learning gateway.
This article is included in the Energy gateway.
This article is included in the Perovskite Solar Cells collection.
This article is included in the Solar Fuels and Storage Technologies collection.
This paper presents a synthetic dataset to study the performance of perovskite solar cells (PSC) simulations using the simulation tool SCAPS-1D. The dataset consists of 18.570 simulated devices generated from four baseline device architectures and their respective photovoltaic performance values. The data set was generated through numerical simulations, and the evaluation of the electrical performance of the device was carried out by studying current density-voltage (J-V) curves under standard illumination conditions, temperature, and maximum applied voltage as working conditions, which were not modified. The dataset can be used to train different machine learning (ML) models using supervised methods or unsupervised techniques such as clustering or dimensionality reduction, which facilitate the identification of patterns or relationships between parameters. Thus, it can be useful in reverse design strategies to determine optimal configurations based on defined objectives. This work contributes to the development of PSC by providing a broad dataset for further analysis and optimization.
Pervoskite solar cell, SCAPS-1D, dataset, photovoltaic technology, numerical simulation, machine learning.
This version clarifies the sources and criteria used to set simulation conditions and parameter ranges, expands the discussion of dataset limitations (dimensionality, ideal conditions or surface defects), and introduces Figure 2 depicting the dataset creation workflow. We also standardized notation and units and made minor editorial revisions to improve readability.
See the authors' detailed response to the review by Ejaz Hussain
Perovskite solar cells (PSC) have emerged as one of the most promising technologies in the field of photovoltaic energy because of their high absorption coefficient, low manufacturing costs and great versatility in device design.1–3 However, the optimization of these solar cells involves a complex interaction between optical, electric, and structural properties of multiple functional layers.4–6 The experimental exploration of this design space is expensive and time-consuming, which has driven the increasing use of computational simulations as a complementary tool to understand and predict the performance of these devices.7–9 Multiple configurations have already been studied under standardized and comparable conditions using different simulation tools,10–18 which are essential to validate the implementation of new numerical analysis.
The use of the software SCAPS-1D (Solar Cell Capacitance Simulator) has become popular because it is freely available and its versatility for modeling thin-film heterojunction solar cells.19 SCAPS-1D solves Poisson and continuity equations to calculate the photovoltaic performance, considering charge generation, recombination mechanisms, and transport through multilayer structures.20 SCAPS-1D results allow us to understand how the photoelectric properties of PCS affect its performance,21 representing a useful tool for designing PSC. Additionally, the integration of machine learning (ML) with device simulations has been proposed, showing promise for accelerating materials development and device optimization.22–25 In recent years, the integration of machine learning with PSC research has accelerated device optimization. Supervised models like XGBoost and random forest,24,26–28 often paired with hyperparameter tuning techniques such as GridSearchCV, have successfully predicted device parameters and performance with high accuracy. For instance, a recent study integrated SCAPS-1D with ML models to analyze cesium-based perovskites, achieving a coefficient of determination (R2) of 99.99% with XGBoost and using SHAP analysis to identify the most influential device parameters.7 In this context, the development of synthetic databases acquires strategic relevance. These databases not only allow the systematization of knowledge about the relationships between material parameters and photovoltaic performance but also allow the training of ML models,25,27,29,30 the application of optimization techniques,31 and the reverse design of solar cells.28
This work presents a structured database composed of 18.570 simulations of PSC generated with SCAPS-1D.32 Four structures were analyzed, considering systematic variations of the active material and geometric parameters of the cell, where the materials most widely used in the literature were included to ensure the practical relevance of the dataset. Each entry in the dataset includes the input parameters that describe optical, electrical, and physical properties of the solar cell, as well as the electrical performance results in terms of open circuit voltage (Voc), short circuit current density (Jsc), fill factor (FF) and power conversion efficiency (PCE). These models can facilitate the integration and fusion of domain knowledge into more complex machine learning models that include synthesis conditions for solar cells. They would also allow the application of multi-objective optimization techniques to improve solar cell efficiency. In this way, this work aims to contribute to the accelerated advancement of the design of photovoltaic devices through reproducible computational approaches, provide a validated dataset for training ML models for PSC performance prediction and supplement the existing database that lacks storage for most of the simulation parameters.
The structure shown in Figure 1 corresponds to a nip-type PSC33 with five layers, which is the configuration studied in this work. The first and last layers are the electrical contacts, while the internal layers are responsible for the device’s energy conversion. For the top contact, fluorine-doped tin oxide (FTO) was used since it is ideal to function as a transparent electrode. Titanium oxide (TiO ) and tin oxide (SnO ) were used for the electron transport layer (ETL) due to their electronic properties and proven use in the scientific literature.17,34–38 For the perovskite absorber layer, methylammonium lead iodide (MAPbI ), methylammonium tin iodide (MASnI ) and formamidinium lead iodide (FAPbI ) were used because these materials have the highest reported energy efficiencies and have been extensively studied by the scientific community.10,25,34,39–41 For the hole transport layer (HTL), Spiro-OMeTAD and copper(I) thiocyanate (CuSCN) were used because configurations with these materials have demonstrated remarkable performance in hole mobility and effective energy alignment.41–43 Finally, the last layer is generally made of gold (Au) since it has high electrical conductivity.43 These materials were chosen for their optoelectronic properties, energy compatibility, and the high performance demonstrated in experimental and simulated studies available in the literature.11,20,21,35,41,43–47 To convert solar energy into electrical energy, the PSC absorbs photons from solar radiation in the perovskite layer, generating electron-hole pairs, which are separated and transported by the ETL layer, which extracts the electrons, while the HTL layer collects the holes. The top electrode, usually made of a transparent conductive oxide like FTO, allows the entry of light and the collection of carriers, while the bottom metallic electrode completes the circuit, allowing the flow of external current under load conditions.
SCAPS-1D analyzes the electrical response of a PSC solving a coupled set of differential equations that include the Poisson equation (1), the continuity equations for electrons (2) and holes (3), and the performance metrics equations (4)-(7). The Poisson equation is presented below:
The data set was generated through numerical simulation using the freely available software SCAPS-1D version 3.3.09 and the evaluation of the electrical performance of the device was carried out by studying J-V curves under the standard working conditions of AM1.5G illumination (1000 W/m ), temperature of 300 K and maximum applied voltage of 1.2 V.48 From these curves, the main electrical parameters that characterize the performance of the system were determined, including the Voc, Jsc, FF and PCE. These parameters were obtained directly from the software after simulating the optoelectronic behavior of the device, allowing a precise evaluation of the expected performance and allowing comparative analysis in terms of efficiency, stability, and robustness against variations in the materials, properties or thickness of the studied layers.
We selected the varied parameters because they directly affect the device’s optical and electrical behaviour. The thickness of the perovskite layer (T_PVK) must be adjusted to absorb the largest amount of photons, maximizing Jsc without exceeding the carrier diffusion length, since excessive thicknesses increase recombination losses and degrade Voc and FF.49 Similarly, the thicknesses of the transport layers (T_ETL and T_HTL) must be optimized to ensure efficient electron and hole transport with low recombination and series resistance.50,51 Additionally, the properties of perovskite have a direct influence on the performance of the cell; the bandgap (EG_PVK) establishes the balance between the current density and voltage, based on the Shockley-Queisser limit52; the dielectric permittivity (ER_PVK) influences exciton dissociation53,54; the acceptor density (NA_PVK) models the internal electric field, essential for charge separation and a high Voc55; and the defect density (NT_PVK) represents the main pathway for non-radiative recombination loss and limiting the carrier lifetime.56 The variation ranges were stablished based on physical models and publish data available in the literature,11,34,43,50,51 and this are specified in Table 1 and parameter combinations leading to convergence errors in the software were discarded, as these typically arose from physically realistic ranges, disrupting the solution of Poisson’s equation, continuity equations, or boundary conditions. Figure 2 shows stuopti the workflow developed for the process of parameter selection, simulation development, parameter variation, result extraction, and data storage in the final file.
| Layer | Parameter | Range | Units |
|---|---|---|---|
| ETL | T_ETL | 0.02 – 0.2 | μm |
| HTL | T_HTL | 0.1 – 0.7 | μm |
| Perovskite | T_PVK | 0.1 – 1.7 | μm |
| Eg | 1.1 – 1.9 | eV | |
| 8 – 20 | – | ||
| NA | 1 × 1013 – 1 × 1017 | cm−3 | |
| Nt | 4 × 1013 – 4 × 1015 | cm−3 |
| Reference | Reported | Simulated | ||||||
|---|---|---|---|---|---|---|---|---|
| VOC (V) | JSC (mA/cm2) | FF (%) | PCE (%) | VOC (V) | JSC (mA/cm2) | FF (%) | PCE (%) | |
| 59 | 1.04 | 30.5 | 82.69 | 26.95 | 1.02 | 29.8 | 78.28 | 26.12 |
| 60 | 0.98 | 18.6 | 82.50 | 13.40 | 0.96 | 18.2 | 79.97 | 13.07 |
| 61 | 1.02 | 22.7 | 62.67 | 21.42 | 1.01 | 22.4 | 65.92 | 21.26 |
| 62 | 0.91 | 24.1 | 54.19 | 16.08 | 0.93 | 24.3 | 78.98 | 16.49 |
| 63 | 0.87 | 24.9 | 85.80 | 15.50 | 0.85 | 24.4 | 81.26 | 15.14 |
The accuracy of the simulation methodology was validated by successfully reproducing the results reported by research articles as shown in Table 2. For this purpose, over 50 recently published scientific articles were collected that included most of the parameters to simulate a nip-type PSC in SCAPS-1D and also reported the values of Voc, Jsc, FF, and PCE. After a systematic review, to avoid unrealistic results, articles reporting energy efficiencies above the Shockley-Queisser limit57 for cells based on MAPbI , MASnI and FAPbI were excluded, as, according to experimental validations,58 the maximum efficiency achieved for these materials does not exceed 22.2%, 14.35% and 24.66%, respectively. It is important to note that, in the simulations of PSC, variable physical (surface roughness, grain size, and orientation), chemical (temperature and drying time, solvent and antisolvent engineering, and additives) and environmental factors (temperature variations, cloud cover, irradiance, etc.) are not incorporated, which can affect the actual performance of the cells. Therefore, it is expected that the values obtained through simulation will be higher than those observed experimentally maintaining consistency with realistic values.34,58
A quantitative assessment was conducted to determine the agreement between simulations and literature reports for the cases that are summarised in Table 2. For the five reproduced studies the root-mean-square error (RMSE) and mean absolute percentage error (MAPE) were calculated using simulated and reported data (Voc, Jsc, FF and PCE). RMSE values are Voc = 0.018 V, Jsc = 0.454 mA/cm2, FF = 11.59%, and PCE = 0.474%. RMSE values corresponding to MAPE are Voc = 1.89%, Jsc = 1.72%, FF = 12.92%, and PCE = 2.23%. These results indicate agreement for Voc, Jsc and PCE, while the error in FF is driven by one case (reported FF = 54.19% vs simulated FF = 78.98%) where there are slight variations in the obtained values compared to the reported ones, which can be attributed to multiple causes, as very few authors report in detail all the simulation conditions or all the parameters used. Some studies include models for recombination, absorption or defects without specifying numerical values (defect density, defect type, or recombination coefficients); therefore, the exact replication of the simulation conditions is limited. This is crucial for a simulation since it considers recombination phenomena, losses due to defects of the layers or interfaces derived from manufacturing processes or impurities in the materials that compose the cell, which can significantly affect the performance of the system. Although reported parameter variability prevents exact replication, the simulated results remain consistent with published data and therefore support the robustness of our methodology.
The dataset comprises 18.570 simulated PSC generated from four device architectures: TiO /MAPbI /CuSCN,60 TiO /MASnI /Spiro-OMeTAD,59 SnO /FAPbI /Spiro-OMeTAD,64 and TiO /MAPbI /Spiro-OMeTAD.65 The performance results of the cells were obtained by using the SCAPS-1D option “Batch set-up”, which allows carrying out a parametric study of PSC in specific value ranges and obtaining the results associated with all the combinations; the ranges specified in Table 1 were used, and only combinations that produced convergence errors were discarded. The dataset includes, for each record, nineteen PSC features (those could be taken as inputs or “X” values in case ML application is implemented) and four associated results, such as Voc, Jsc, FF, and PCE (that could be used as outputs or “y” values). The description of each convention name in the column is as follows:
Material (M):
• Column A: Material of the ETL layer (M_ETL).
• Column B: Material of the perovskite absorber layer (M_PVK).
• Column C: Material of the HTL layer (M_HTL).
Ranging parameters:
The next columns correspond to parameters that were varied as shown in Table 1:
• Column D: Thickness of ETL layer in ( ).
• Column I: Thickness of absorber layer in ( ).
• Column J: Bandgap of absorber layer in eV ( ).
• Column K: Dielectric permittivity of absorber layer ( ).
• Column L: Shallow acceptor density of absorber layer in ( ).
• Column N: Defect density of absorber layer in ( ).
• Column O: Thickness of HTL layer in ( ).
Constant value parameters:
The next columns have constant values for the specific parameters. It is important to include them to validate the results presented in this work.
• Column E: Bandgap of ETL layer in eV ( ).
• Column F: Dielectric permittivity value of ETL layer ( ).
• Column G: Shallow donor density of ETL layer in ( ).
• Column H: Defect density value of ETL layer in ( ).
• Column M: Shallow donor density value of absorber layer in ( ).
• Column P: Bandgap of HTL layer in eV ( ).
• Column Q: Dielectric permittivity of HTL layer ( ).
• Column R: Shallow donor density of HTL layer in ( ).
• Column S: Defect density of HTL layer in ( ).
Performance metrics:
• Column T: Open circuit voltage in V ( ).
• Column U: Short circuit current density in JSC
• Column V: Fill factor in percentage ( ).
• Column W: Power conversion efficiency in percentage ( ).
Basic descriptive statistics were conducted for the dataset, generating the data distribution for the performance metrics (Voc, Jsc, FF, and PCE). Figure 3(a) presents the data distribution for Voc, which shows a main peak in the multimodal distribution with a mean of 0.98 V and a median of 1.00 V, with a standard deviation of 0.14 V and an interquartile range (IQR) of 0.90 V–1.10 V. A smaller number of parametric combinations are observed for values below 0.7 V, which can be attributed to increased recombination due to the geometric configuration of the device or improper band alignment.57,66 The data distribution of Jsc presented in Figure 3(b), shows a multimodal distribution with four marked peaks around 16 , 21 , 27 , and 34 , with a mean of 20.7 ; median of 20.2 ; and standard deviation of 9.4 . The peaks suggest subsets defined by discrete thicknesses of the perovskite or by steps in the optical absorption imposed during the parametric sweep. Physically, current densities below 10 are associated with thin films or large bandgaps, while values above 30 are associated with sufficiently thick layers with low defect density, where carrier absorption and collection are maximized.67,68 Figure 3(c) presents the data distribution of the FF, where a noticeable peak is observed in values close to 79%, with a mean of 65.2%, a median of 69.2%, and a standard deviation of 15.8%. This demonstrates a high dispersion in the data, due to a considerable amount of data being located at values below 60%, which significantly affects the FF distribution for the dataset and shows that some configurations can generate losses, either due to recombination or cell defects.69 Finally, Figure 3(d) presents the data distribution of the PCE, which shows a peak around 9% with a mean of 12.8%, a median of 12.1%, and a standard deviation of 6.26%, exhibiting a broad and slightly bimodal shape: one cluster between 5% and 15% associated with devices with one or two suboptimal parameters (e.g., moderate Jsc and acceptable FF) and another peak between 18–23% that asociated with Voc and FF values. The decreasing trend of 25% reflects the limit imposed by maximum absorption and residual non-radiative losses, consistent with the Shockley-Queisser model.57,70 In summary, the dispersion demonstrates how the joint variation of thickness, bandgap, and defects controls the efficiency, reproducing the range of values reported experimentally.

This work compared with existing open-access repositories that primarily aggregate experimental perovskite solar cell data (e.g., NREL or the perovskite database project) is that the present synthetic dataset was systematically generated under controlled SCAPS-1D conditions. It provides dense and reproducible coverage of the photovoltaic parameter space, including photoelectrical descriptors that are rarely available in experimental datasets, such as defect density profiles, acceptor/donor density, and dielectric constants of individual layers. These parametric analysis allows to understand the charge transport, recombination dynamics, and interfacial effects that govern device performance, yet they are often missing or inconsistently reported in empirical databases. The dataset enables researchers to perform sensitivity analyses, build predictive ML models, and explore how intrinsic material and interfacial parameters influence key performance metrics. Thus, this resource complements experimental databases by providing a physically interpretable simulation benchmark that bridges the gap between device physics and data-driven optimization strategies.
Considering the dataset’s limitations, it was generated under one-dimensional, steady-state assumptions, with a single absorbing layer and ideal boundary conditions. Some physical phenomena that are not captured include: three dimensional optical, thermal and electrical effects; degradation mechanisms; detailed interfacial chemical variations and trap distributions that affect FF and Voc in experiments; and measurement uncertainties and spectral mismatch in experimental J–V characterization. Prior studies have shown that interface defect density and energy distribution can substantially alter Voc and FF; therefore, differences between reported and simulated FF can frequently be traced to missing interface defects detail in the original publications.
The dataset was stored in OSF HOME, an open-source platform for managing and sharing research data. The project, titled “Synthetic dataset to study the performance of perovskite solar cell simulations”, with DOI: 10.17605/OSF.IO/ZX4AJ includes the file “Synthetic dataset to study the performance of perovskite solar cell simulations.xlsx”, which contains all simulated device configurations and their corresponding performance metrics.
This dataset can be used to analyze, design, and optimize PSC, as it contains a considerable number of simulations (18.570) with their respective photovoltaic performance values, which is useful for studying the relationships between design parameters and electrical performance. Because of its composition and variety in the parameters that constitute the device, the dataset can be used to train different ML models using supervised methods such as random forest, gradient boosting, support vector machines (SVM), and deep neural networks to predict the performance metrics Voc, Jsc, FF, and PCE, or unsupervised techniques like clustering or dimensionality reduction (PCA, t-SNE) that allow discovering patterns or relationships between parameters. Additionally, multi-objective optimization techniques can be implemented, such as genetic algorithms, Bayesian methods, or particle swarm methods.
On the other hand, as each input represents a set of specific parameters along with their performance results, researchers can identify optimal regions of the design space to maximize efficiency, minimize recombination losses, or reduce the use of high-cost materials. Thus, it can be useful in reverse design strategies to determine optimal configurations based on defined objectives.
Additionally, it is important to mention that this database was generated for planar PSC with a single absorbing layer, which may represent limitations that should be considered when using it. Moreover, all the simulations were generated under constant and one-dimensional conditions, so three-dimensional effects, long-term degradation, or real environmental conditions (temperature, humidity, material degradation, etc) are not capture. Although a cross-validation was conducted with data reported in the literature, there may be discrepancies attributable to the lack of detail in the parameters reported by some authors, which prevents an exact replication of the experimental results. Finally, some parametric combinations were discarded due to numerical convergence failures, which may slightly bias the exploration of the design space.
Y.V.-G.: conceptualization, methodology, validation, investigation, data curation, and writing—original draft preparation, E.G.-V.: conceptualization, methodology, validation, formal analysis, investigation, data curation, and writing—original draft preparation, visualization, and supervision. A.S.-S.: conceptualization, formal analysis, investigation, data curation, and writing—review and editing, project administration, supervision, and funding acquisition. N.G.-C.: conceptualization, methodology, formal analysis, investigation, resources, writing—review and editing, supervision, project administration, and funding acquisition. All authors have read and agreed to the published version of the manuscript.
Open Science Framework (OSF). Synthetic dataset to study the performance of perovskite solar cell simulations. DOI: 10.17605/OSF.IO/ZX4AJ.71
This project contains the following underlying data:
Synthetic dataset to study the performance of perovskite solar cell simulations.xlsx. All simulated device configurations and their corresponding photovoltaic performance metrics (Voc, Jsc, FF, PCE) were generated with SCAPS-1D under the conditions described in the methods. Data are available under the terms of the CC-By Attribution 4.0 International.
The authors are grateful to Marc Burgelman and his colleagues at the University of Gent, Belgium, for providing the SCAPS-1D simulator and gratefully acknowledge Prof. Monica Botero Londoño (School of Electrical, Electronic and Telecommunications Engineering, Universidad Industrial de Santander) for her valuable guidance in defining the set of parameters explored in this work.
| Views | Downloads | |
|---|---|---|
| F1000Research | - | - |
|
PubMed Central
Data from PMC are received and updated monthly.
|
- | - |
Competing Interests: No competing interests were disclosed.
Is the rationale for creating the dataset(s) clearly described?
Partly
Are the protocols appropriate and is the work technically sound?
Yes
Are sufficient details of methods and materials provided to allow replication by others?
Partly
Are the datasets clearly presented in a useable and accessible format?
Yes
Competing Interests: No competing interests were disclosed.
Alongside their report, reviewers assign a status to the article:
| Invited Reviewers | |
|---|---|
| 1 | |
|
Version 2 (revision) 22 Oct 25 |
read |
|
Version 1 22 Sep 25 |
read |
Provide sufficient details of any financial or non-financial competing interests to enable users to assess whether your comments might lead a reasonable person to question your impartiality. Consider the following examples, but note that this is not an exhaustive list:
Sign up for content alerts and receive a weekly or monthly email with all newly published articles
Already registered? Sign in
The email address should be the one you originally registered with F1000.
You registered with F1000 via Google, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Google account password, please click here.
You registered with F1000 via Facebook, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Facebook account password, please click here.
If your email address is registered with us, we will email you instructions to reset your password.
If you think you should have received this email but it has not arrived, please check your spam filters and/or contact for further assistance.
Comments on this article Comments (0)