Keywords
Synthetic data generation, Internet of Things, edge analytics, predictive model, machine learning
This article is included in the Research Synergy Foundation gateway.
This article is included in the Artificial Intelligence and Machine Learning gateway.
Synthetic data generation, Internet of Things, edge analytics, predictive model, machine learning
The widespread adoption of the Internet of Things (IoT) in business and industry has resulted in significant investment in advanced applications development (Brous et al., 2020). These applications focus on increasing efficiency and cost reduction while speeding up the analytic process at receiving ends. However, initial IoT data generation and dependence on cloud-based data storage and data processing have limited the success with IoT applications. “Roughly 10% of enterprise-generated data is processed outside of an established centralised data centre or cloud,” according to Gartner. By 2025, it is predicted that this number would have grown to 75% (Van der Meulen, 2018).
As a result, Edge Computing (EC) is emerging as a key enabler technology for network edge-based analytics and real-time decision-making. Edge computing places processing and analytics capability close to the source of the data. While extracting high-level data from the raw sensory input, this strategy reduces network latency. Integration of Machine Learning (ML) capabilities into EC has enabled sensor-based application specific analytics at the IoT network edge. Due to technological advancements in computer processor power, energy efficiency, memory capacity, and device size downsizing, machine learning computation can now be performed at edge nodes (Murshed et al., 2022).
The development of ML-based edge analytics for IoT applications is unlike that of traditional machine learning due to the hardware limitations and lack of sensory data availability associated with ML-based edge analytics (Li et al., 2021). Finding real-world datasets that reflect sensory data for a prediction model is one of the troublesome issues in ML-based edge analytics development (Chen et al., 2020). This issue is undeniably hampering the rapid adoption of machine learning methods in IoT Edge analytic integration.
Thus, realistic synthetic dataset generation can leverage the need to speed up the methodological use of machine learning in edge analytics. Synthetic data is information that is artificially designed to represent real-world events (Nikolenko, 2019). Researchers have used various techniques such as stochastic process (Salim et al., 2018), rule-based data generation (Jeske et al., 2005) and deep generative model (Alzantot et al., 2017) to generate synthetic data for many applications (Anderson et al., 2014). Because of privacy and legality concerns, data science has given rise to synthetic data. Synthetic data has been widely used to supplement the ever-increasing demand for data science to predict various global business and environmental phenomena (Howe et al., 2017). When a need for previously unheard-of real-world datasets arises, the importance of synthetic data increases to criticality. Yet, the use of synthetic datasets has become difficult when compounded by the need for privacy preservation. Due to a scarcity of genuine datasets, medical researchers have been employing synthetic data to evaluate medical applications in conditional and control environments (Azizi et al., 2021). Similarly, self-driving vehicle re-identification technology uses randomized synthetic data as a precursor to training vehicles (Tang et al., 2019).
The scarcity of getting accurate IoT generated datasets to represent localized real-world environments and the inaccuracy of pre-defined or public repository datasets has driven IoT researchers to generate synthetic data to cater to their unique requirements (Sengupta & Chinnasamy, 2020; Tazrin, 2021; Zualkernan et al., 2021).
There are several benefits that the use of synthetic data brings when compensating for unavailable real-world data. Firstly, the advent of personal and community privacy advocacy has limited many developers and researchers in using real datasets (Oh et al., 2020). The creation of synthetic data that represents details of real-world data, including distributions, non-linear relationships, and noise, can eliminate legal issues (Tucker et al., 2020). Instead of using costly real-world data to generate a predictive model, synthetic data are able to cater to various possible desired predictive events found in real datasets (Jordon et al., 2018).
The remainder of the paper is structured as follows. Section 2 discusses related work in the area of synthetic data creation. Section 3 discusses the JSON-based synthetic data generation framework. Section 4 presents the generated synthetic datasets and the results of the synthetic data validation experiment. Section 5 outlines the conclusions.
According to Coimbra et al. (2020), the difficulty in training the machine learning model for detection and classification with suitable datasets that cover all the possibilities of a domain may be expensive and could be associated with privacy concerns. Thus, synthetic data offers a potential answer in terms of lowering the cost of data acquisition while also addressing data privacy concerns.
The bulk of synthetic data creation methods rely on extracting properties from existing real-world datasets. These extracted features from a real dataset are utilised to create a new dataset. Several criteria are used as guidelines to generate synthetic data. Deep Neural Network (DNN)-based methods include Auto-Encoders (AE) and Generative Adversarial Networks (GAN) (Frid-Adar et al., 2018).
However, Torres (2018) point out that the suggested generation process's pattern identification phase states that after determining (quantitatively) the impact of each feature, the dataset columns are ordered according to how important they are. Priority may be assigned in the following order: most influential to least influential, or vice versa. This strategy may not be optimal or time-efficient, and so may have a detrimental influence on training and processing durations if a computationally expensive characteristic is checked at the start of the cue.
However, some researchers create synthetic data based using actual data formats such as Comma-Separated Values (CSV), Java Script Object Notation (JSON), and Extensible Markup Language (XML). Anderson et al. (2014) developed and built a Hadoop-based synthetic IoT data creation system capable of producing vast amounts of data. Using the Document Type Definition (DTD) and the recreation synthesis set, this system extracts data structures from IoT XML data. Through different XML data extensions, this modular system supports many statistical distributions.
JSON data types are human-readable data and may represent metadata and computer-readable data (Sun et al., 2020). The JavaScript Language comprises pre-defined data structures, such as arrays that provide order lists and objects that represent name-value pairs. JSON metadata offer an effective method to examine the structure and variables of any data file, particularly a big dataset. In contrast to XML, which is a native format of the JavaScript language, JSON-based datasets are simple to read and represent a data exchange language. Many computer languages such as C++, C#, Cold Fusion, Java, Perl, PHP, Python and Ruby support it for online data integration.
The features of IoT data are critical in the creation and structure of synthetic data and can be classified based on the method through which the data is collected. Applications such as IoT-based digital healthcare systems (Ed-daoudy & Maalmi, 2019) and autonomous vehicle monitoring systems (Kavitha & Ravikumar, 2020), generate huge amounts of data in continuous streams. Additionally, it is important to consider how the massive data was gathered and stored away from IoT and edge devices. These massive databases are processed and analysed in order to forecast long-term business trends. Streaming datasets, on the other hand, require rapid pre-processing and analytics in order to extract immediate and relevant information for a speedy decision, like in the autonomous vehicle example (Elsaleh et al., 2020). Time-series data is used to generate a large quantity of streaming IoT data (Kumar et al., 2020). Data from streaming IoT devices is collected and analysed depending on the time it was generated. The ability to synchronise time across a pool of IoT devices spread over a large monitoring environment will be critical in evaluating the quality of this data (Ferrari et al., 2020).
The primary objective of this synthetic data creation is to provide an experimental framework that may train a machine learning prediction model. The framework can create correct JSON schema characteristics based on synthetic datasets. This section introduces the IoT synthetic data creation framework, which is divided into two stages. The first stage involves extracting structure and variables from the original time series dataset in comma-separated values (.CSV) format. In the second stage, the extracted structure and variables from the original datasets are used to develop and construct synthetic CSV datasets. Figure 1 shows a schematic diagram that depicts the framework for generating synthetic data.
CSV, Comma-Separated Values.
The original dataset for this experiment (Air Quality Data in India 2015-2020) was obtained from the Kaggle dataset repository (see Underlying data). We choose this dataset based on its data format (CSV) and because it is publicly available online. The dataset contains air quality data and AQI (Air Quality Index) for different stations in 26 Indian cities, taken on an hourly basis. The dataset contains different air quality indices that are rated in parts per million (ppm) as satisfactory (80-99), moderate (100-199), poor (200-299), and very poor (300-400). These categories reflect the quality of the air based on data collected from dataset and will be utilised as predictive components in the trained model.
The dataset is formatted in a CSV file with several columns that influence the type of categories that the air quality will be, such as –StationId” for the station where the sensors were placed, “PM2.5 and PM10” for the number of particles with 2.5 micrometres and 10 micrometres, and the rest are the concentration of chemicals and gases in the air that will influence the AQI. Table 1 shows the data components of the AQI dataset.
The CSV file containing the original AQI dataset is converted to JSON using the freely available online CSV to JSON parser, CSVJSON (Drapeau et al., 2014). After that, the output is saved as a text file (csvjson1.txt). The original dataset's JSON output contains arrays of data objects but no structure or syntax. To enable the structure and variables to be identified, the JSON file is converted to JSON Schema using the freely available Online JSON to JSON Schema Converter (Liquid Technologies Limited, 2001). The JSON schema output demonstrates the syntax and structure's accuracy and the file's entire structure and variable components. The JSON Schema in Figure 2 is derived from the JSON file containing the original AQI dataset.
PM2.5, number of particles that are 2.5 micrometres; PM10, number of particles that are 10 micrometres; NO, nitric oxide; NO2, nitric dioxide; NOx, nitrogen oxides; NH3, ammonia; CO, carbon monoxide; SO2, sulphur dioxide; O3, ozone.
Once the variable components and structure of the AQI dataset JSON schema have been identified, they are mapped and written into the Python Faker (version 8.1.2) data generator application (Faraglia, 2017). The Faker data generator application generates 240000 synthetic data records in CSV format for this experiment. We limited the size to facilitate the extraction and processing of the output structure and variables.
The synthetic dataset's CSV file is then converted to JSON using the online CSV to JSON parser. After that, the output is saved as a text file (csvjson2.txt). The next step is to use the Online-JSON-to-JSON schema-converter to convert the JSON structured text file to JSON Schema.
The JSON schema for this synthetic dataset is verified by ensuring that the structure and variable elements match the JSON schema for the actual AQI dataset. This is the first step in the validation process, and it is used to determine the quality of the generated synthetic dataset. The JSON Schema structure for the generated synthetic dataset is depicted in Figure 3; see Underlying data (Kannan, 2021a).
PM2.5, number of particles that are 2.5 micrometres; PM10, number of particles that are 10 micrometres; NO, nitric oxide; NO2, nitric dioxide; NOx, nitrogen oxides; NH3, ammonia; CO, carbon monoxide; SO2, sulphur dioxide; O3, ozone; AQI, Air Quality Index.
For this experiment, accuracy will be measured as the proportion of accurate predictions of the given air quality index. Precision will be determined by the proportion of accurate classifications. For example, how many of the air quality indicators in the 'poor' category were properly classified? To recall, it will be evaluating the model itself on how much of the data was properly classified given the air quality index. The ideal model would strive for high accuracy, precision, and recall value.
The full analysis code is available in Software availability (Kannan, 2021b). To open the files, use Jupyter. Python Faker generated approximately 240000 synthetic data points for this experiment purpose. The first visible comparison is based on the JSON Schema for the original dataset and the JSON Schema for the synthetic dataset, both of which appear to be quite precise in matching the same structure and variable components. We further validated our experimental approach by training a machine learning model to predict the four air quality categories specified in the dataset using both the actual AQI and a synthetic dataset.
The Python Scikit Learn library (Pedregosa et al., 2011) is used to train the both datasets using the Logistic Regression machine learning model. Logistic Regression appears to be the optimal machine learning model because it accounts for the possibility that the given condition or variable determines the air quality categories. For instance, if the air quality index (AQI) and other variables are between 80 and 99, the air quality is considered satisfactory.
The model's performance characteristics, such as accuracy, precision, and recall, vary according to the AQI dataset used. The effectiveness of the Linear Regression model in predicting both real and synthetic AQI datasets is demonstrated in Table 2. The synthetic dataset appears to outperform the real dataset by 1.03 percent; this difference may be due to the nature of the dataset that we did not remove the missing data. The synthetic dataset performs slightly better than the original dataset because the data is nearly identical to the original, and the original dataset appears to contain an error in which some rows are incomplete. This incomplete data is due to sensor failure at one of the stations that caused partial recording. A confusion matrix is constructed to explain the classification model's performance on the two datasets. Nevertheless, as demonstrated by the confusion matrix, the model is effective at classification. The confusion matrix performance of the classification model is shown in Figure 4 for both the original AQI and synthetic datasets.
Original dataset | Synthetic dataset | |
---|---|---|
Accuracy (%) | 98.18 | 99.21 |
Precision (%) | 98.18 | 99.21 |
Recall (%) | 98.18 | 99.21 |
In this paper, we present the synthetic data generation framework and its experimental results. Based on JSON schema attributes, the framework can produce correct synthetic datasets. We use both the original and synthetic datasets to train the machine learning model, in addition to early schema validation. The logistic regression model appears to be capable of successfully handling the classification process, as the accuracy, precision, and recall scores for both the real and synthetic datasets were approximately 98 percent. A predictive comparison model is developed utilising synthetic and original datasets based on the benchmark dataset. Analysis of the synthetic dataset predictive model shows that it can be successfully deployed to edge analytics to replace real-world datasets. There is no significant difference between real-world datasets compared to the synthetic dataset.
The Air Quality Data in India (2015-2020) dataset used in this study is freely available on Kaggle: https://www.kaggle.com/rohanrao/air-quality-data-in-india. Access requires free registration to Kaggle and agreement to the terms of use.
Zenodo: Synthetic time series data generation for edge analytics. https://doi.org/10.5281/zenodo.5673924 (Kannan, 2021a).
This project contains the following underlying data:
- Json Schema for Original Datasets.docx (JSON Schema showing original sensors data variables).
- Json Schema for Synthetic Dataset.txt (JSON Schema showing generated synthetic data variables).
Data are available under the terms of the Creative Commons Attribution 4.0 International license (CC-BY 4.0).
Source code available from: https://github.com/Subar1/synthetic/tree/1.0
Archived source code at time of publication: https://doi.org/10.5281/zenodo.5726027 (Kannan, 2021b)
License: MIT
Views | Downloads | |
---|---|---|
F1000Research | - | - |
PubMed Central
Data from PMC are received and updated monthly.
|
- | - |
Is the rationale for developing the new method (or application) clearly explained?
Partly
Is the description of the method technically sound?
Partly
Are sufficient details provided to allow replication of the method development and its use by others?
Yes
If any results are presented, are all the source data underlying the results available to ensure full reproducibility?
Yes
Are the conclusions about the method and its performance adequately supported by the findings presented in the article?
No
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Industrial Time Series Analytics. Machine Learning for Time Series Data. Edge Analytics.
Alongside their report, reviewers assign a status to the article:
Invited Reviewers | |
---|---|
1 | |
Version 1 20 Jan 22 |
read |
Provide sufficient details of any financial or non-financial competing interests to enable users to assess whether your comments might lead a reasonable person to question your impartiality. Consider the following examples, but note that this is not an exhaustive list:
Sign up for content alerts and receive a weekly or monthly email with all newly published articles
Already registered? Sign in
The email address should be the one you originally registered with F1000.
You registered with F1000 via Google, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Google account password, please click here.
You registered with F1000 via Facebook, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Facebook account password, please click here.
If your email address is registered with us, we will email you instructions to reset your password.
If you think you should have received this email but it has not arrived, please check your spam filters and/or contact for further assistance.
Comments on this article Comments (0)