<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.2 20190208//EN" "http://jats.nlm.nih.gov/publishing/1.2/JATS-journalpublishing1.dtd"><article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="methods-article" dtd-version="1.2" xml:lang="en">
    <front>
        <journal-meta>
            <journal-id journal-id-type="pmc">F1000Research</journal-id>
            <journal-title-group>
                <journal-title>F1000Research</journal-title>
            </journal-title-group>
            <issn pub-type="epub">2046-1402</issn>
            <publisher>
                <publisher-name>F1000 Research Limited</publisher-name>
                <publisher-loc>London, UK</publisher-loc>
            </publisher>
        </journal-meta>
        <article-meta>
            <article-id pub-id-type="doi">10.12688/f1000research.123616.1</article-id>
            <article-categories>
                <subj-group subj-group-type="heading">
                    <subject>Method Article</subject>
                </subj-group>
                <subj-group>
                    <subject>Articles</subject>
                </subj-group>
            </article-categories>
            <title-group>
                <article-title>An optimized approach for class imbalance problem in heterogeneous cross project defect prediction</article-title>
                <fn-group content-type="pub-status">
                    <fn>
                        <p>[version 1; peer review: 1 approved with reservations]</p>
                    </fn>
                </fn-group>
            </title-group>
            <contrib-group>
                <contrib contrib-type="author" corresp="yes">
                    <name>
                        <surname>Goel</surname>
                        <given-names>Lipika</given-names>
                    </name>
                    <role content-type="http://credit.niso.org/">Methodology</role>
                    <role content-type="http://credit.niso.org/">Writing &#x2013; Original Draft Preparation</role>
                    <uri content-type="orcid">https://orcid.org/0000-0003-1609-2475</uri>
                    <xref ref-type="corresp" rid="c1">a</xref>
                    <xref ref-type="aff" rid="a1">1</xref>
                </contrib>
                <contrib contrib-type="author" corresp="no">
                    <name>
                        <surname>Nandal</surname>
                        <given-names>Neha</given-names>
                    </name>
                    <role content-type="http://credit.niso.org/">Software</role>
                    <role content-type="http://credit.niso.org/">Writing &#x2013; Review &amp; Editing</role>
                    <uri content-type="orcid">https://orcid.org/0000-0003-2566-5925</uri>
                    <xref ref-type="aff" rid="a1">1</xref>
                </contrib>
                <contrib contrib-type="author" corresp="no">
                    <name>
                        <surname>Gupta</surname>
                        <given-names>Sonam</given-names>
                    </name>
                    <role content-type="http://credit.niso.org/">Supervision</role>
                    <role content-type="http://credit.niso.org/">Visualization</role>
                    <xref ref-type="aff" rid="a2">2</xref>
                </contrib>
                <aff id="a1">
                    <label>1</label>Computer Science and Engineering, Gokaraju Rangaraju Institute of Engineering and Technology, Hyderabad, Telangana, 500090, India</aff>
                <aff id="a2">
                    <label>2</label>Computer Science and Engineering, Ajay Kumar Garg Engineering College, Ghaziabad, Uttar Pradesh, 201009, India</aff>
            </contrib-group>
            <author-notes>
                <corresp id="c1">
                    <label>a</label>
                    <email xlink:href="mailto:lipika1670@grietcollege.com">lipika1670@grietcollege.com</email>
                </corresp>
                <fn fn-type="conflict">
                    <p>No competing interests were disclosed.</p>
                </fn>
            </author-notes>
            <pub-date pub-type="epub">
                <day>16</day>
                <month>9</month>
                <year>2022</year>
            </pub-date>
            <pub-date pub-type="collection">
                <year>2022</year>
            </pub-date>
            <volume>11</volume>
            <elocation-id>1060</elocation-id>
            <history>
                <date date-type="accepted">
                    <day>26</day>
                    <month>8</month>
                    <year>2022</year>
                </date>
            </history>
            <permissions>
                <copyright-statement>Copyright: &#x00a9; 2022 Goel L et al.</copyright-statement>
                <copyright-year>2022</copyright-year>
                <license xlink:href="https://creativecommons.org/licenses/by/4.0/">
                    <license-p>This is an open access article distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
                </license>
            </permissions>
            <self-uri content-type="pdf" xlink:href="https://f1000research.com/articles/11-1060/pdf"/>
            <abstract>
                <p>
                    <bold>Background:</bold> In recent studies, Cross Project Defect Prediction (CPDP) has proven to be feasible in software defect prediction. When both the source as well as the target projects have the same metric sets, it is termed as a homogeneous CPDP. Current CPDP strategies are difficult to implement through projects with a variety of different metric sets. Aside from that, training data often has a problem with class imbalance. The number of defective/bug-ridden and non-defective/clean instances of the source class is usually unbalanced. To address this issue, we propose a heterogeneous cross-project defect prediction framework that can predict defects across projects with different metric sets.</p>
                <p>
                    <bold>Methods:</bold> To construct a prediction framework between projects with heterogeneous metric sets, our heterogeneous cross project defect prediction approach uses metric selection, metric matching, class imbalance (CIB) learning followed by ensemble modelling. For our study, we have considered six open-source object-oriented projects.</p>
                <p>
                    <bold>Results:</bold> The proposed model resolved the class imbalance issue and records the highest recall value of 7.5 with f-score value as 7.4 in comparison with other baseline models. The highest AUC (area under curve) value of 0.86 has also been recorded. K fold cross validation was performed to evaluate the training accuracy of the model. The proposed optimized model was validated using the Wilcoxon signed rank test (WSR) with a significance level of 5% (i.e., P-value=0.05).</p>
                <p>
                    <bold>Conclusions:</bold> Our empirical research on these six projects shows that predictions based on our methodology outperform or are statistically comparable to Within-Project Defect Prediction (WPDP) and other heterogeneous CPDP baseline models.</p>
            </abstract>
            <kwd-group kwd-group-type="author">
                <kwd>Cross Project Defect Prediction (CPDP)</kwd>
                <kwd>Class Imbalance (CIB)</kwd>
                <kwd>Ensemble Modeling</kwd>
                <kwd>Heterogeneous</kwd>
                <kwd>Metric Matching</kwd>
            </kwd-group>
            <funding-group>
                <funding-statement>The author(s) declared that no grants were involved in supporting this work.</funding-statement>
            </funding-group>
        </article-meta>
    </front>
    <body>
        <sec id="sec1" sec-type="intro">
            <title>Introduction</title>
            <p>Every Software Quality Model (SQM) mainly includes two operations for achievement of a good quality software; the first is the software quality assurance (SQA) for achieving the best quality and the second is the software defects prediction (SDP) for making predictions for maximum defects. This ensures the best quality software product is used for the prediction. Many SDP models have been built by using various data mining techniques over the last few decades that use the software defect databases.
                <sup>
                    <xref ref-type="bibr" rid="ref1">1</xref>
                </sup>
            </p>
            <sec id="sec2">
                <title>Within-Project Defect Prediction (WPDP)</title>
                <p>Defect prediction models are mainly designed to work within a project defect prediction. In such a scenario, the defect prediction model is trained using a partial dataset (i.e., having defective or non-defective labels) of a project (the training set) and is tested for the remaining dataset (for which the labels are predicted) of the same project (the testing set).
                    <sup>
                        <xref ref-type="bibr" rid="ref2">2</xref>
                    </sup>
                </p>
            </sec>
            <sec id="sec3">
                <title>Cross-Project Defect Prediction (CPDP)</title>
                <p>However, in current times, with increased technologies and demands for applied technologies, the challenge for defect prediction also increases manifold as it is not a cost-effective process to find the labeled dataset for all projects to be applied for a within-project defect detection model training.
                    <sup>
                        <xref ref-type="bibr" rid="ref3">3</xref>
                    </sup> When a software is newly built and there is no historical record of defects of that project, how can the defects be predicted for such a newly-developed project?
                    <sup>
                        <xref ref-type="bibr" rid="ref4">4</xref>
                    </sup>
                </p>
                <p>Approaches have been made for CPDP.
                    <sup>
                        <xref ref-type="bibr" rid="ref5">5</xref>
                    </sup> In such approaches, a prediction model is trained using the dataset instances of one project (training project) that are labeled instances and are tested for unlabeled instances of another project (the testing project). CPDP can be further categorized into the following: 1) The model that is trained using only common metrics that exist in both training as well as in the testing project is called the Homogeneous CPDP (HDP). In such a technique, the predictions are made for unlabeled instances of the testing project. When the metrics/features are specific to the training and the testing project, i.e., the source and the target projects have different feature sets, then it becomes a challenge for the prediction model to predict the defects of the target project. This problem may arise due to project variations written in different languages
                    <sup>
                        <xref ref-type="bibr" rid="ref6">6</xref>
                    </sup> and leads to poor accuracy for defect prediction.
                    <sup>
                        <xref ref-type="bibr" rid="ref7">7</xref>
                    </sup> This is called the Heterogeneous CPDP (HCPDP). 
                    <xref ref-type="fig" rid="f1">Figure 1</xref> gives the description of WPDP, HDP and HCPDP.</p>
                <fig fig-type="figure" id="f1" orientation="portrait" position="float">
                    <label>Figure 1. </label>
                    <caption>
                        <title>Description of WPDP (Within-Project Defect Prediction), HDP (Homogeneous Cross Project Defect Prediction), and HCPDP (Heterogeneous Cross Project Defect Prediction).</title>
                    </caption>
                    <graphic id="gr1" orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/135744/83b99b28-15b9-4f5e-b4ee-28c7e1201eea_figure1.gif"/>
                </fig>
                <p>In the classification problem of HCPDP, the target class has a binary subclass (classes 0 and 1), i.e., defective and non-defective. The total number of instances in these two subclasses are mostly not identical. A class imbalance (CIB) problem occurs when there is a distributional disparity between the defective and non-defective classes, resulting in an unbalanced training dataset. When the distribution of the two classes is imbalanced, it leads to a biased prediction.</p>
                <p>In this paper, an optimized approach to handle the CIB issue in HCPDP is proposed. First, the key idea of this framework is matching the metrics of the source and target datasets. It will then deal with the dataset's disparity by partitioning the training data into data frames with roughly equal numbers of non-defect and defect susceptible groups in each data frame. Second, this proposed framework will also perform HCPDP. Maximum voting is used in the ensemble model. The training accuracy of the proposed framework is assessed using K fold cross validation. Finally, the proposed framework was validated using the Wilcoxon signed rank test.
                    <sup>
                        <xref ref-type="bibr" rid="ref32">32</xref>
                    </sup> This research report addresses the following research questions:
                    <list list-type="bullet">
                        <list-item>
                            <label>&#x2022;</label>
                            <p>Q1. Are the results obtained using optimized approach comparable to WPDP?</p>
                        </list-item>
                        <list-item>
                            <label>&#x2022;</label>
                            <p>Q2. (i) Does the proposed model resolve the CIB problem in the source data?</p>
                            <p>(ii) Does it outperform the HCPDP model having imbalanced dataset?</p>
                        </list-item>
                    </list>
                </p>
                <p>The significant contributions are:
                    <list list-type="bullet">
                        <list-item>
                            <label>&#x2022;</label>
                            <p>To propose an optimized approach for the CIB issue in HCPDP.</p>
                        </list-item>
                        <list-item>
                            <label>&#x2022;</label>
                            <p>To develop a standalone ensemble framework for HCPDP.</p>
                        </list-item>
                    </list>
                </p>
                <p>The following is the paper's classification: The HCPDP literature review is discussed in Section 2. Preliminaries illustrate the datasets, feature selection process, and correlation methodology in Section 3. The proposed ensemble framework, which involves the flow from data acquisition to modelling, is highlighted in Section 4. The experimental setup is outlined in Section 5, and the experimentally observed results are highlighted in Section 6. The validation of the result is covered in Section 7. Section 8 brings the paper to a close.</p>
            </sec>
        </sec>
        <sec id="sec4">
            <title>State of art</title>
            <p>In 2002, researchers presented their work on MARS- Multivariate Adaptive Regression spline which got attention in CPDP.
                <sup>
                    <xref ref-type="bibr" rid="ref8">8</xref>
                </sup> The work done was on data design of Xpose and Jwriter.
                <sup>
                    <xref ref-type="bibr" rid="ref8">8</xref>
                </sup> The model has been trained using Xpose dataset and prediction was done in Jwriter for fault prediction. This work has been compared with Regression on Linear basis in which MARS model provided better results.</p>
            <p>In the year 2009, researchers utilized different discrete sources for the defect datasets of the different software and pre-processed the data for prediction by removing redundant, noisy and irrelevant data for model training. The work has been performed on 10 different projects by using Nearest Neighbor methodology.
                <sup>
                    <xref ref-type="bibr" rid="ref5">5</xref>
                </sup> The conclusion is that the model performed well for WPDP. Work on transformation of log to find similarity in testing and training projects to avoid project dependency was also done in the year 2009.
                <sup>
                    <xref ref-type="bibr" rid="ref7">7</xref>
                </sup> In the year 2011, investigators worked on search engines for defect prediction. Classification has been performed using process parameters and coding standards. The training of the model has been done with the search engine of Mozilla Firefox and prediction was performed on Internet Explorer. The results presented claimed that proposed model performs well with Mozilla Firefox as testing project
                <sup>
                    <xref ref-type="bibr" rid="ref6">6</xref>
                </sup> Also it has been asserted by the authors in this work, that the quality and the accuracy of the defect model can vary when examined with different aspects. The work is being proved with the experiments.
                <sup>
                    <xref ref-type="bibr" rid="ref6">6</xref>
                </sup> The results states that local behavior is better in comparison of global one. In the year 2012, a scholar proposed experimental work by considering evaluation measures including precision, recall and F-measure. The presented work states that the mentioned evaluation measures are not enough to give quality assurance for defect prediction with different models. It has been declared that area under the curve provides efficient measurement in WPDP.
                <sup>
                    <xref ref-type="bibr" rid="ref9">9</xref>
                </sup>
            </p>
            <p>In the year 2013, authors launched a multi-objective approach for overcoming the problems of single objective model. The Logistic Regression model was being trained in this work with NSGA-II (non- dominated sorted Generic Algorithm).
                <sup>
                    <xref ref-type="bibr" rid="ref10">10</xref>
                </sup> In 2014 researchers presented a Universal Defect Prediction Model in which total 1398 projects being utilized from source forge and Google code. The matrices were being compared between test and training projects. The mat of at least 26 matrices was being considered a success and predictions made.
                <sup>
                    <xref ref-type="bibr" rid="ref11">11</xref>
                </sup> Further in this study, authors considered a new metric as characteristic vectors of instances to overcome the upper limitation.
                <sup>
                    <xref ref-type="bibr" rid="ref12">12</xref>
                </sup> The comparison of feature disparity was being done with CPDP and the results found were negative. This experiment was done with 11 projects using three datasets. They also worked on with-in project defect prediction-WPDP and cross project defect prediction-CPDP with feature selection methodology for comparison of performance. It has been observed that higher precision was achieved in WPDP with less training project features. Also, CPDP provided better results of Recall and F-score.</p>
            <p>In the year 2015, investigators presented the work on CCA- Canonical Correlation Analysis for defect prediction. It was the first work to present the concept of Heterogeneous Defect Prediction. The metrices disparity problem was being resolved in this work by embedding dummy metrices with null values. The experiment was done with 14 projects using 4 datasets.
                <sup>
                    <xref ref-type="bibr" rid="ref13">13</xref>
                </sup> Further scholars utilized a novel approach of transfer cost-sensitive boosting methodology which provided CPDP results.
                <sup>
                    <xref ref-type="bibr" rid="ref14">14</xref>
                </sup> Also, an approach of CPDP with multi objective Na&#x00ef;ve Bayes technique was introduced with CIB which performed better than all other models of WPDP along with the single objective models. In further study of HCPDP researchers launched HDP- Heterogeneous Defect Prediction concept.
                <sup>
                    <xref ref-type="bibr" rid="ref15">15</xref>
                </sup> An optimized HDP with metric matching and selection was also proposed.
                <sup>
                    <xref ref-type="bibr" rid="ref16">16</xref>
                </sup> The work has been done on 28 different projects and the results were compared with WPDP which performed better than some statistical projects.</p>
            <p>In the year 2016, authors worked on transfer learning method having 34 projects with five datasets. The metrices with null values were not being embedding as in CCA. The results of the work were being compared with WPDP.
                <sup>
                    <xref ref-type="bibr" rid="ref17">17</xref>
                </sup>
            </p>
            <p>Comparison on filtration methods for defect prediction was done in 2017.
                <sup>
                    <xref ref-type="bibr" rid="ref18">18</xref>
                </sup> It has been stated in the work that choosing right filtration method can highly impact the model&#x2019;s capability. The four methods of filtration (Data Characteristic based Filter-DCBF, Source project data Guided Filter -SGF, Target project data Guided Filter -TGF and Local Cluster based Filter-LCBF) were being compared in this work. HSBF- Hierarchical Selection Based Filter has been proposed in this work to overcome the cons of existing filters in context of large datasets for enhancing scalability.</p>
            <p>Further in 2017 a novel approach of FESCH (Feature Selection using Clusters of Hybrid)-data was proposed which performed better than ALL, WPDAP and TCA+ in different domains. The results achieved were independent of the classifiers utilized and very sustained.
                <sup>
                    <xref ref-type="bibr" rid="ref19">19</xref>
                </sup> In 2018, authors worked on reducing the higher dimension features by using domain adaptation technique. The Dictionary learning techniques has been applied to understand the feature space differences. Three open source projects including NetGene, AEEEM, and Nasa have been utilized with the evaluation measures of F-measure and Recall for comparison of HAD- Heterogeneous Defect Adaptation.
                <sup>
                    <xref ref-type="bibr" rid="ref20">20</xref>
                </sup> Further investigation and comparison of existing models of CPDP was done and authors searched for the best suited methods. AUC (area under curve) was being utilized for performing comparison with other works to check the difference in results. A simple Neural Network was being proposed for CPDP in the year 2019 to tackle HDP in which cross entropy function was applied for classification of error.
                <sup>
                    <xref ref-type="bibr" rid="ref21">21</xref>
                </sup>
            </p>
            <p>In the year 2021, researchers presented the work on semi-supervised learning for tackling heterogeneous defect prediction. The open-source projects were being utilized for the analysis. The metric representation and canonical correlation analysis has been introduced to analyze different company projects.
                <sup>
                    <xref ref-type="bibr" rid="ref22">22</xref>
                </sup> A total of 26,407 modules from GitHub has been collected which was unlabeled dataset and can be extended as per the requirement. Results declared stated that prediction has been optimized through this work. They presented a work for training and prediction by using different companies for which cross project domain prediction task was utilized. The conclusion of the work states that cost of FPR was increasing as DDR (defect detection rate) was increasing. Further in 2021, authors have utilized correlation coefficients for heterogeneous defect prediction.
                <sup>
                    <xref ref-type="bibr" rid="ref23">23</xref>
                </sup> The work has been done completely experimental and it has been compared with the baseline models. Results declared stated that the proposed model performed better than other base models.</p>
            <p>In the year 2022, investigators worked on the concept of heterogeneous feature selection by utilizing nested stacking.
                <sup>
                    <xref ref-type="bibr" rid="ref24">24</xref>
                </sup> The work presented the experiments done on two datasets i.e. KAMEI and PROMISE. The demonstration of the work has been done through two evaluation indicators of Area under the curve and F1-Score. Results presented declared that the proposed model outperformed other baseline models.</p>
            <p>The major gap identified in the state of art is the solution to the class imbalance issue in HCPDP. This issue arises due to imbalance in number of instances in faulty and non-faulty situation. This issue obstructs the effectiveness of defect prediction models in practical life scenario. Although many studies have been performed considering this issue, yet there is the need to take this issue into more consideration for more optimized solution to the CIB problem mainly in HCPDP.</p>
        </sec>
        <sec id="sec5">
            <title>Preliminaries</title>
            <sec id="sec6">
                <title>Datasets</title>
                <p>This HCPDP study relied on well-defined datasets. For the experiment, six open-source projects were considered, and the datasets were taken from the 
                    <ext-link ext-link-type="uri" xlink:href="https://www.openml.org/">OpenML</ext-link> (an open platform for sharing datasets. The various versions of these datasets can be downloaded for free. Please see 
                    <italic toggle="yes">Underlying data</italic>
                    <sup>
                        <xref ref-type="bibr" rid="ref34">34</xref>
                    </sup> for information on where the full data can be accessed).</p>
                <p>Data related to each class of the project was taken for the experiment. 
                    <xref ref-type="table" rid="T1">Table 1</xref> contains a summary of the dataset. We considered the object-oriented projects of NASA and SOFTLAB with different sets of features in our analysis since we concentrated on HCPDP.</p>
                <table-wrap id="T1" orientation="portrait" position="float">
                    <label>Table 1. </label>
                    <caption>
                        <title>Details of the datasets.</title>
                    </caption>
                    <table content-type="article-table" frame="hsides">
                        <thead>
                            <tr>
                                <th align="left" colspan="1" rowspan="1" valign="top">S.No.</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">Group</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">Dataset</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">Total No of instances</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">No. of defective instances</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">No. of non-defective instances</th>
                            </tr>
                        </thead>
                        <tbody>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="middle">1</td>
                                <td align="left" colspan="1" rowspan="3" valign="middle">NASA</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">Pc2</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">705</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">79</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">626</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="middle">2</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">Pc3</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">1077</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">134</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">943</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="middle">3</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">Pc4</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">1458</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">178</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">1280</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="middle">4</td>
                                <td align="left" colspan="1" rowspan="3" valign="middle">SOFTLAB</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">Ar3</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">63</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">8</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">55</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="middle">5</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">Ar4</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">107</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">20</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">87</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="middle">6</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">Ar6</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">101</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">15</td>
                                <td align="left" colspan="1" rowspan="1" valign="middle">86</td>
                            </tr>
                        </tbody>
                    </table>
                </table-wrap>
            </sec>
            <sec id="sec7">
                <title>Feature selection</title>
                <p>The Extra Tree Classifier
                    <sup>
                        <xref ref-type="bibr" rid="ref33">33</xref>
                    </sup> is one of the ensemble learning techniques in which the output produced by several distinct decision trees, which are not mutually correlated, aggregated as a &#x201c;forest,&#x201d; are combined to generate the predictive model. For feature selection, the Gini Index also known as Gini Information of each feature is computed. To perform feature selection, the features are sorted in a descending order of their Gini Importance. The user then selects the top k features as per the experiment.
                    <list list-type="bullet">
                        <list-item>
                            <label>&#x2022;</label>
                            <p>STEP 1: Firstly, build an extra tree forest over the given data set with as many decision trees of choice. For each decision tree we would select the no. of attributes/features in the random sample;</p>
                        </list-item>
                        <list-item>
                            <label>&#x2022;</label>
                            <p>STEP 2: In construction of the forest, for every attribute/feature, the normalized total reduction which is used to split the feature in the decision tree is computed (Gini Importance of the feature).</p>
                        </list-item>
                        <list-item>
                            <label>&#x2022;</label>
                            <p>STEP 3: Now, the features are ranked based upon their Gini Importance value and one can select the top features of his/her choice.</p>
                        </list-item>
                    </list>
                </p>
                <p>The following approach is used in our work to select the k important attributes/features from the target and the source projects.</p>
                <p>For calculating the Gini Importance, the first entropy is calculated as per the following formula:
                    <disp-formula id="e1">
                        <mml:math display="block">
                            <mml:mtext mathvariant="italic">Entropy</mml:mtext>
                            <mml:mfenced close=")" open="(">
                                <mml:mi>S</mml:mi>
                            </mml:mfenced>
                            <mml:mo>=</mml:mo>
                            <mml:munderover>
                                <mml:mo>&#x2211;</mml:mo>
                                <mml:mrow>
                                    <mml:mi>i</mml:mi>
                                    <mml:mo>=</mml:mo>
                                    <mml:mn>1</mml:mn>
                                </mml:mrow>
                                <mml:mi>c</mml:mi>
                            </mml:munderover>
                            <mml:mo>&#x2212;</mml:mo>
                            <mml:msub>
                                <mml:mi>p</mml:mi>
                                <mml:mi>i</mml:mi>
                            </mml:msub>
                            <mml:msub>
                                <mml:mo>log</mml:mo>
                                <mml:mn>2</mml:mn>
                            </mml:msub>
                            <mml:mfenced close=")" open="(">
                                <mml:msub>
                                    <mml:mi>p</mml:mi>
                                    <mml:mi>i</mml:mi>
                                </mml:msub>
                            </mml:mfenced>
                        </mml:math>
                        <label>(1)</label>
                    </disp-formula>where, 
                    <italic toggle="yes">c</italic> is the number of unique class labels and 
                    <italic toggle="yes">p</italic>
                    <sub>
                        <italic toggle="yes">i</italic>
                    </sub> is the proportion of rows with output label as 
                    <italic toggle="yes">i</italic>.
                    <disp-formula id="e2">
                        <mml:math display="block">
                            <mml:mtext mathvariant="italic">Gain</mml:mtext>
                            <mml:mfenced close=")" open="(" separators=",">
                                <mml:mi>S</mml:mi>
                                <mml:mi>A</mml:mi>
                            </mml:mfenced>
                            <mml:mo>=</mml:mo>
                            <mml:mtext mathvariant="italic">Entropy</mml:mtext>
                            <mml:mfenced close=")" open="(">
                                <mml:mi>S</mml:mi>
                            </mml:mfenced>
                            <mml:mo>&#x2212;</mml:mo>
                            <mml:munder>
                                <mml:mo>&#x2211;</mml:mo>
                                <mml:mrow>
                                    <mml:mtext mathvariant="italic">veValues</mml:mtext>
                                    <mml:mfenced close=")" open="(">
                                        <mml:mi>A</mml:mi>
                                    </mml:mfenced>
                                </mml:mrow>
                            </mml:munder>
                            <mml:mfrac>
                                <mml:mrow>
                                    <mml:mi mathvariant="normal">&#x01c0;</mml:mi>
                                    <mml:msub>
                                        <mml:mi>S</mml:mi>
                                        <mml:mi>v</mml:mi>
                                    </mml:msub>
                                    <mml:mi mathvariant="normal">&#x01c0;</mml:mi>
                                </mml:mrow>
                                <mml:mrow>
                                    <mml:mi mathvariant="normal">&#x01c0;</mml:mi>
                                    <mml:mi>S</mml:mi>
                                    <mml:mi mathvariant="normal">&#x01c0;</mml:mi>
                                </mml:mrow>
                            </mml:mfrac>
                            <mml:mtext mathvariant="italic">Entropy</mml:mtext>
                            <mml:mfenced close=")" open="(">
                                <mml:msub>
                                    <mml:mi>S</mml:mi>
                                    <mml:mi>v</mml:mi>
                                </mml:msub>
                            </mml:mfenced>
                        </mml:math>
                        <label>(2)</label>
                    </disp-formula>where, &#x201c;A&#x201d; represents the feature/metrics.</p>
            </sec>
            <sec id="sec8">
                <title>Spearman correlation</title>
                <p>To understand the Spearman correlation,
                    <sup>
                        <xref ref-type="bibr" rid="ref11">11</xref>
                    </sup> we need to infer monotonic function. A monotonic function is one that is completely non-increasing or non-decreasing. A non-parametric statistical test called Spearman correlation ranks the intensity of a monotonic relationship between two variables. The coefficient of correlation is often designated by &#x03c1; or rS which returns a value from -1 to 1. The test does not contain any presumption regarding the distribution of data and returns the correlation analysis for the least ordinal measured variables.</p>
                <p>The Spearman formula to evaluate rank correlation is as follows:</p>
                <p>If there are no tied ranks, then we use the following formula:
                    <disp-formula id="e3">
                        <mml:math display="block">
                            <mml:mi mathvariant="normal">&#x03c1;</mml:mi>
                            <mml:mo>=</mml:mo>
                            <mml:mn>1</mml:mn>
                            <mml:mo>&#x2212;</mml:mo>
                            <mml:mfrac>
                                <mml:mrow>
                                    <mml:mn>6</mml:mn>
                                    <mml:mo>&#x2211;</mml:mo>
                                    <mml:msubsup>
                                        <mml:mi>d</mml:mi>
                                        <mml:mi>i</mml:mi>
                                        <mml:mn>2</mml:mn>
                                    </mml:msubsup>
                                </mml:mrow>
                                <mml:mrow>
                                    <mml:mi>n</mml:mi>
                                    <mml:mfenced close=")" open="(">
                                        <mml:mrow>
                                            <mml:msup>
                                                <mml:mi>n</mml:mi>
                                                <mml:mn>2</mml:mn>
                                            </mml:msup>
                                            <mml:mo>&#x2212;</mml:mo>
                                            <mml:mn>1</mml:mn>
                                        </mml:mrow>
                                    </mml:mfenced>
                                </mml:mrow>
                            </mml:mfrac>
                        </mml:math>
                        <label>(3)</label>
                    </disp-formula>where, 
                    <italic toggle="yes">d</italic>
                    <sub>
                        <italic toggle="yes">i</italic>
                    </sub> = difference in paired ranks and 
                    <italic toggle="yes">n</italic> = number of cases.</p>
                <p>The formula when there are tied ranks is:
                    <disp-formula id="e4">
                        <mml:math display="block">
                            <mml:mi>&#x03c1;</mml:mi>
                            <mml:mo>=</mml:mo>
                            <mml:mfrac>
                                <mml:mrow>
                                    <mml:msub>
                                        <mml:mo>&#x2211;</mml:mo>
                                        <mml:mi>i</mml:mi>
                                    </mml:msub>
                                    <mml:mfenced close=")" open="(">
                                        <mml:mrow>
                                            <mml:msub>
                                                <mml:mi>x</mml:mi>
                                                <mml:mi>i</mml:mi>
                                            </mml:msub>
                                            <mml:mo>&#x2212;</mml:mo>
                                            <mml:mover accent="true">
                                                <mml:mi>x</mml:mi>
                                                <mml:mo stretchy="true">&#x00af;</mml:mo>
                                            </mml:mover>
                                        </mml:mrow>
                                    </mml:mfenced>
                                    <mml:mfenced close=")" open="(">
                                        <mml:mrow>
                                            <mml:msub>
                                                <mml:mi>y</mml:mi>
                                                <mml:mi>i</mml:mi>
                                            </mml:msub>
                                            <mml:mo>&#x2212;</mml:mo>
                                            <mml:mover accent="true">
                                                <mml:mi>y</mml:mi>
                                                <mml:mo stretchy="true">&#x00af;</mml:mo>
                                            </mml:mover>
                                        </mml:mrow>
                                    </mml:mfenced>
                                </mml:mrow>
                                <mml:msqrt>
                                    <mml:mrow>
                                        <mml:msub>
                                            <mml:mo>&#x2211;</mml:mo>
                                            <mml:mi>i</mml:mi>
                                        </mml:msub>
                                        <mml:msup>
                                            <mml:mfenced close=")" open="(">
                                                <mml:mrow>
                                                    <mml:msub>
                                                        <mml:mi>x</mml:mi>
                                                        <mml:mi>i</mml:mi>
                                                    </mml:msub>
                                                    <mml:mo>&#x2212;</mml:mo>
                                                    <mml:mover accent="true">
                                                        <mml:mi>x</mml:mi>
                                                        <mml:mo stretchy="true">&#x00af;</mml:mo>
                                                    </mml:mover>
                                                </mml:mrow>
                                            </mml:mfenced>
                                            <mml:mn>2</mml:mn>
                                        </mml:msup>
                                        <mml:msub>
                                            <mml:mo>&#x2211;</mml:mo>
                                            <mml:mi>i</mml:mi>
                                        </mml:msub>
                                        <mml:msup>
                                            <mml:mfenced close=")" open="(">
                                                <mml:mrow>
                                                    <mml:msub>
                                                        <mml:mi>y</mml:mi>
                                                        <mml:mi>i</mml:mi>
                                                    </mml:msub>
                                                    <mml:mo>&#x2212;</mml:mo>
                                                    <mml:mover accent="true">
                                                        <mml:mi>y</mml:mi>
                                                        <mml:mo stretchy="true">&#x00af;</mml:mo>
                                                    </mml:mover>
                                                </mml:mrow>
                                            </mml:mfenced>
                                            <mml:mn>2</mml:mn>
                                        </mml:msup>
                                    </mml:mrow>
                                </mml:msqrt>
                            </mml:mfrac>
                            <mml:mspace width="0.25em"/>
                        </mml:math>
                        <label>(4)</label>
                    </disp-formula>where, 
                    <italic toggle="yes">i</italic> = paired score.</p>
                <p>In our experiment, we determined the correlation between the training and the testing metrics of the important features. In Section 4, details of metrics matching are stated.</p>
            </sec>
            <sec id="sec9">
                <title>Class imbalance</title>
                <p>Prediction of the target class for a given instance of data is the main objective of classification predictive modeling. When the distribution of the two classes is imbalanced, it leads to biased prediction. The distribution difference in the minority and the majority class leads to the imbalanced classification and is known as CIB problem.
                    <sup>
                        <xref ref-type="bibr" rid="ref25">25</xref>
                    </sup> Most of the machine learning models are designed with an assumption of equal number of instances in each class. The imbalanced nature of the dataset results in poor predictive performance of the model. Spam prediction, defect prediction, churn prediction are some of the real-world classification problems with class imbalance.
                    <sup>
                        <xref ref-type="bibr" rid="ref26">26</xref>
                    </sup> Sampling by oversampling of the minority instances or undersampling of the majority class is one of the simplest way to solve the imbalance nature of the classification problems. Oversampling by introducing the duplicate instances may lead to the problem of overfitting. Undersampling may sometimes lead to loss of important information. Synthetic Minority Oversampling Technique (SMOTE) and Random Undersampling (RUS) are few algorithms that are oversampling and undersampling techniques.
                    <sup>
                        <xref ref-type="bibr" rid="ref27">27</xref>
                    </sup> 
                    <xref ref-type="fig" rid="f2">Figure 2</xref> depicts an imbalance distribution of the two classes.</p>
                <fig fig-type="figure" id="f2" orientation="portrait" position="float">
                    <label>Figure 2. </label>
                    <caption>
                        <title>Imbalance distribution of two classes.</title>
                        <p>Authors&#x2019; own figure.</p>
                    </caption>
                    <graphic id="gr2" orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/135744/83b99b28-15b9-4f5e-b4ee-28c7e1201eea_figure2.gif"/>
                </fig>
            </sec>
            <sec id="sec10">
                <title>Ensemble learning model (random forest &amp; XGBoost)</title>
                <p>Machine learning withholds ample of classifiers. We need to classify observations accurately in order to achieve the required outcome. Random forest classifier is one of the prime classifiers used to convert the unreliable data model like the decision trees to make a more robust model.
                    <sup>
                        <xref ref-type="bibr" rid="ref28">28</xref>
                    </sup> Random forest's building component is the decision tree, which is a spontaneous model. It&#x2019;s an illustrative model achieved through the questionnaire framed about the data until it reaches a decisive point. The random forest is a collection of such trees, each of which reveals the class prediction, with the tree with the most votes becoming the model's forecast.</p>
                <p>Random forest can be converted into the robust model through bagging Random Forest takes benefit of this by having each tree to sample from the data set at random with replacement, resulting in unique trees.
                    <sup>
                        <xref ref-type="bibr" rid="ref29">29</xref>
                    </sup> This is termed as bagging. Because of Bagging variance is reduced through Boosting. It reduces bias by instructing the consecutive model by telling it what errors the preceding models made (the boosting part).</p>
                <p>The two major algorithms used for boosting are:
                    <list list-type="bullet">
                        <list-item>
                            <label>&#x2022;</label>
                            <p>Adaptive boosting: It is basically an algorithm that converts the weaker classifier into the stronger one.</p>
                        </list-item>
                        <list-item>
                            <label>&#x2022;</label>
                            <p>Gradient boosting: It is a method of training each consecutive model based on the past outcome.</p>
                        </list-item>
                    </list>
                </p>
                <p>XGBoost is the technique of boosting the decision tree gradiently, which turns out to be highly flexible and efficient.
                    <sup>
                        <xref ref-type="bibr" rid="ref30">30</xref>
                    </sup> It is an approach for high performance and speed. It's the ability of parallel computation on a single machine that makes this algorithm faster. What it does is in spite of considering the loss for all possible splits to generate a new branch, it considers the entire distribution of the features over the data set and uses this information to drastically improve on the search space accountable for feature split.</p>
            </sec>
        </sec>
        <sec id="sec11">
            <title>Proposed methodology</title>
            <p>The flowchart of the proposed framework is presented in 
                <xref ref-type="fig" rid="f3">Figure 3</xref>. The proposed methodology is discussed under the following heads:
                <list list-type="order">
                    <list-item>
                        <label>1.</label>
                        <p>Data Acquisition and Understanding.</p>
                    </list-item>
                    <list-item>
                        <label>2.</label>
                        <p>Data Preprocessing and Preparation.</p>
                    </list-item>
                    <list-item>
                        <label>3.</label>
                        <p>Modeling.</p>
                    </list-item>
                </list>
            </p>
            <fig fig-type="figure" id="f3" orientation="portrait" position="float">
                <label>Figure 3. </label>
                <caption>
                    <title>Flowchart of the proposed framework.</title>
                </caption>
                <graphic id="gr3" orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/135744/83b99b28-15b9-4f5e-b4ee-28c7e1201eea_figure3.gif"/>
            </fig>
            <sec id="sec12">
                <title>Data acquisition and understanding</title>
                <p>In this study, we have considered six open sources object-oriented projects. The details of the projects are mentioned in 
                    <xref ref-type="table" rid="T1">Table 1</xref>. Various combinations are made to form source&#x2013;target project pairs from existing projects. The feature sets of all the projects considered are not similar. The datasets used are described in detail in Section 3.</p>
            </sec>
            <sec id="sec13">
                <title>Data pre-processing and preparation</title>
                <p>Data pre-processing is the most effective way of improving the model&#x2019;s performance as it enhances the quality of both training and testing data. First, the data are analyzed for missing and null values. We have used the average method to fill in the missing values for any particular attribute. For encoding purposes, we have used a label-encoder. This is followed by the data normalization process. The pre-processing was carried out using 
                    <ext-link ext-link-type="uri" xlink:href="https://www.python.org/">Python</ext-link> 3.8.3.</p>
            </sec>
            <sec id="sec14">
                <title>Modeling</title>
                <p>A set of classifiers are combined in an ensemble learning model, generated in 
                    <ext-link ext-link-type="uri" xlink:href="https://www.python.org/">Python</ext-link> 3.8.3 (see 
                    <italic toggle="yes">Software availability</italic>
                    <sup>
                        <xref ref-type="bibr" rid="ref35">35</xref>
                    </sup>). All experiments in this study are performed on Python 3.8.3. This increases the classification model's overall efficiency. When all of the base classifiers are integrated, a data tuple is chosen. The class label conclusion is made based on the popular vote of all the basic classifiers. The following are the phases of modelling in the proposed framework.</p>
            </sec>
            <sec id="sec15">
                <title>Feature and metric selection</title>
                <p>For each of the training and testing datasets, the Extra Tree Classifier implemented in Python 3.8.3 (see 
                    <italic toggle="yes">software availability</italic>
                    <sup>
                        <xref ref-type="bibr" rid="ref35">35</xref>
                    </sup>) has been used for selection of important features. The working of the Extra Tree Classifier has been discussed Section 3.
                    <sup>
                        <xref ref-type="bibr" rid="ref31">31</xref>
                    </sup> The top 15% of metrics were selected using feature selection technique. Top ten features have been selected with the highest Gini Importance.
                    <sup>
                        <xref ref-type="bibr" rid="ref31">31</xref>
                    </sup> The list of features selected from each dataset is stated in 
                    <xref ref-type="table" rid="T2">Table 2</xref>. 
                    <xref ref-type="fig" rid="f4">Figure 4</xref> is a graph generated for depicting the Gini Importance of top ten features of each of the considered projects in defect prediction.</p>
                <table-wrap id="T2" orientation="portrait" position="float">
                    <label>Table 2. </label>
                    <caption>
                        <title>The list of important features from each open-source project considered in this study.</title>
                    </caption>
                    <table content-type="article-table" frame="hsides">
                        <thead>
                            <tr>
                                <th align="left" colspan="1" rowspan="1" valign="top">PC2</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">PC3</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">PC4</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">AR3</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">AR4</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">AR6</th>
                            </tr>
                        </thead>
                        <tbody>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">Halstead Volume</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Decision Density</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Loc_Code_And_Comment</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Total_Operators</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Halstead Volume</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Halstead Length</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">Num Operands</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Number of Lines</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Percent_Comments</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Halstead_Length</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Halstead Length</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Executable Loc</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">Halstead Length</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Call Pairs</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Loc_Comments</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Halstead_Time</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Halstead Difficulty</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Halstead Error</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">Halstead Prog Time</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">No. of Unique Operands</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Loc_Blank</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Halstead_Effort</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Halstead Effort</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Condition Count</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">Num Unique Operators</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Parameter Count</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Num_Unique_Operators</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Cyclomatic_Complexity</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Halstead Error</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Branch Count</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">Loc Comments</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Halstead Content</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Loc_Total</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Halstead_Vocabulary</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Total Operands</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Decision Count</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">Halstead Difficulty</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Loc Code &amp; Comment</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Halstead_Length</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Design_Density</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Unique Operands</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Halstead Level</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">Percent Comments</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Loc Comments</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Cyclomatic_Density</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Halstead_Error</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Total Loc</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Cyclomatic Density</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">Halstead Effort</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Percent Comments</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Node_Count</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Branch_Count</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Halstead Vocabulary</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Normalized Cyclomatic Complexity</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">Parameter Count</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Loc Blank</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Call_Pairs</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Total_Operands</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Unique Operators</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Cyclomatic Complexity</td>
                            </tr>
                        </tbody>
                    </table>
                </table-wrap>
                <fig fig-type="figure" id="f4" orientation="portrait" position="float">
                    <label>Figure 4. </label>
                    <caption>
                        <title>Gini Importance of features of projects ar4, ar6, pc2, pc3 in a, b, c, d respectively (Y axis: Features and X axis: Gini Importance values).</title>
                    </caption>
                    <graphic id="gr4" orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/135744/83b99b28-15b9-4f5e-b4ee-28c7e1201eea_figure4.gif"/>
                </fig>
                <p>
                    <italic toggle="yes">Matching metrics</italic>
                </p>
                <p>The source and target datasets have different features/metrics. The similarity between each set of source and target features/metrics is examined using the Spearman's correlation technique. The essential concept is to calculate matching/correlation scores for all combinations of source/training and target/testing data. 
                    <xref ref-type="fig" rid="f5">Figure 5</xref> depicts a sample matching.</p>
                <fig fig-type="figure" id="f5" orientation="portrait" position="float">
                    <label>Figure 5. </label>
                    <caption>
                        <title>Sample matching.</title>
                    </caption>
                    <graphic id="gr5" orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/135744/83b99b28-15b9-4f5e-b4ee-28c7e1201eea_figure5.gif"/>
                </fig>
                <p>Let us consider two sources, training/metrics (A1, A2) and target/testing metrics (B1, B2). Thus, there are four possible combinations of matching pair of metrics, i.e., (A1, B1), (A2, B1), (A1, B2) and (A2, B2). A specific threshold value of cutoff is set to discard the poorly correlated/matched metric pairs. In our experiment the threshold cutoff value is set to 0.05 since it is commonly used these days due to its positive impact on predictions.
                    <sup>
                        <xref ref-type="bibr" rid="ref11">11</xref>
                    </sup> Now, we include only those pairs whose matching score is &gt;0.05 and build the ensemble prediction model with a matching score &gt;0.05. In 
                    <xref ref-type="fig" rid="f5">Figure 5</xref>, the correlation between (A2, B2) is discarded since the correlation/matching value is &lt;0.05. Thus, the matching pairs to be considered include (A2, B1), (A1, B1) and (A1, B2). Therefore, in order to convert the relationship to one-to-one, we selected the ones with the maximum matching scores. Besides, the main idea was also to have a maximum no. of selected attributes/features in the source and target pair. 
                    <xref ref-type="table" rid="T3">Table 3</xref> tabulates a few samples of the final selected source and target matching pairs with their matching scores.</p>
                <table-wrap id="T3" orientation="portrait" position="float">
                    <label>Table 3. </label>
                    <caption>
                        <title>(i), (ii), (iii) - Source and target matching pairs with their matching scores.</title>
                    </caption>
                    <table content-type="article-table" frame="hsides">
                        <thead>
                            <tr>
                                <th align="left" colspan="11" rowspan="1" valign="top">(i) (PC2&#x2192;AR4)</th>
                            </tr>
                            <tr>
                                <th align="left" colspan="1" rowspan="1" valign="top">AR4
                                    <break/>PC2</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">Halstead Volume</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">Halstead Length</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">Halstead Difficulty</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">Halstead Effort</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">Halstead Error</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">Total Operands</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">Unique Operands</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">Total Loc</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">Halstead Vocabulary</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">Unique Operators</th>
                            </tr>
                        </thead>
                        <tbody>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">
                                    <bold>Halstead Volume</bold>
                                </td>
                                <td align="left" colspan="1" rowspan="1" valign="top">0.25</td>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">
                                    <bold>Num Operands</bold>
                                </td>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td align="left" colspan="1" rowspan="1" valign="top">0.18</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">
                                    <bold>Halstead Length</bold>
                                </td>
                                <td colspan="1" rowspan="1"/>
                                <td align="left" colspan="1" rowspan="1" valign="top">0.22</td>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">
                                    <bold>Halstead Prog Time</bold>
                                </td>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td align="left" colspan="1" rowspan="1" valign="top">0.19</td>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">
                                    <bold>Num Unique Operators</bold>
                                </td>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td align="left" colspan="1" rowspan="1" valign="top">0.17</td>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">
                                    <bold>Loc Comments</bold>
                                </td>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td align="left" colspan="1" rowspan="1" valign="top">0.18</td>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">
                                    <bold>Halstead Difficulty</bold>
                                </td>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td align="left" colspan="1" rowspan="1" valign="top">0.16</td>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">
                                    <bold>Percent Comments</bold>
                                </td>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td align="left" colspan="1" rowspan="1" valign="top">0.21</td>
                                <td colspan="1" rowspan="1"/>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">
                                    <bold>Halstead Effort</bold>
                                </td>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td align="left" colspan="1" rowspan="1" valign="top">0.18</td>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">
                                    <bold>Parameter Count</bold>
                                </td>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td align="left" colspan="1" rowspan="1" valign="top">0.21</td>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                            </tr>
                        </tbody>
                    </table>
                    <table content-type="article-table" frame="hsides">
                        <thead>
                            <tr>
                                <th align="left" colspan="11" rowspan="1" valign="top">(ii) (PC3&#x2192;AR3)</th>
                            </tr>
                            <tr>
                                <th align="left" colspan="1" rowspan="1" valign="top">AR3
                                    <break/>PC3</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">Total_Operators</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">Halstead_Length</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">Halstead_Time</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">Halstead_Effort</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">Cyclomatic_Complexity</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">Halstead_Vocabulary</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">Design_Density</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">Halstead_Error</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">Branch_Count</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">Total_Operands</th>
                            </tr>
                        </thead>
                        <tbody>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">
                                    <bold>Decision Density</bold>
                                </td>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td align="left" colspan="1" rowspan="1" valign="top">0.16</td>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">
                                    <bold>Number of Lines</bold>
                                </td>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td align="left" colspan="1" rowspan="1" valign="top">0.21</td>
                                <td colspan="1" rowspan="1"/>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">
                                    <bold>Call Pairs</bold>
                                </td>
                                <td colspan="1" rowspan="1"/>
                                <td align="left" colspan="1" rowspan="1" valign="top">0.18</td>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">
                                    <bold>No. of Unique Operands</bold>
                                </td>
                                <td align="left" colspan="1" rowspan="1" valign="top">0.23</td>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">
                                    <bold>Parameter Count</bold>
                                </td>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td align="left" colspan="1" rowspan="1" valign="top">0.18</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">
                                    <bold>Halstead Content</bold>
                                </td>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td align="left" colspan="1" rowspan="1" valign="top">0.19</td>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">
                                    <bold>LOC Code &amp; Comment</bold>
                                </td>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td align="left" colspan="1" rowspan="1" valign="top">0.23</td>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">
                                    <bold>LOC Comments</bold>
                                </td>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td align="left" colspan="1" rowspan="1" valign="top">0.16</td>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">
                                    <bold>Percent Comments</bold>
                                </td>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td align="left" colspan="1" rowspan="1" valign="top">0.21</td>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">
                                    <bold>LOC Blank</bold>
                                </td>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td align="left" colspan="1" rowspan="1" valign="top">0.18</td>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                            </tr>
                        </tbody>
                    </table>
                    <table content-type="article-table" frame="hsides">
                        <thead>
                            <tr>
                                <th align="left" colspan="11" rowspan="1" valign="top">(iii) (PC3&#x2192;AR6)</th>
                            </tr>
                            <tr>
                                <th align="left" colspan="1" rowspan="1" valign="top">AR6
                                    <break/>PC3</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">Halstead Length</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">Executable Loc</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">Halstead Error</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">Condition Count</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">Branch Count</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">Decision Count</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">Halstead Level</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">Cyclomatic Density</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">Normalized Cyclomatic Complexity</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">Cyclomatic Complexity</th>
                            </tr>
                        </thead>
                        <tbody>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">
                                    <bold>Decision Density</bold>
                                </td>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td align="left" colspan="1" rowspan="1" valign="top">0.16</td>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">
                                    <bold>Number of Lines</bold>
                                </td>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td align="left" colspan="1" rowspan="1" valign="top">0.18</td>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">
                                    <bold>Call Pairs</bold>
                                </td>
                                <td align="left" colspan="1" rowspan="1" valign="top">0.19</td>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">
                                    <bold>No. of Unique Operands</bold>
                                </td>
                                <td colspan="1" rowspan="1"/>
                                <td align="left" colspan="1" rowspan="1" valign="top">0.18</td>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">
                                    <bold>Parameter Count</bold>
                                </td>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td align="left" colspan="1" rowspan="1" valign="top">0.18</td>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">
                                    <bold>Halstead Content</bold>
                                </td>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td align="left" colspan="1" rowspan="1" valign="top">0.19</td>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">
                                    <bold>LOC Code &amp; Comment</bold>
                                </td>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td align="left" colspan="1" rowspan="1" valign="top">0.21</td>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">
                                    <bold>LOC Comments</bold>
                                </td>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td align="left" colspan="1" rowspan="1" valign="top">0.22</td>
                                <td colspan="1" rowspan="1"/>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">
                                    <bold>Percent Comments</bold>
                                </td>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td align="left" colspan="1" rowspan="1" valign="top">0.15</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">
                                    <bold>LOC Blank</bold>
                                </td>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td align="left" colspan="1" rowspan="1" valign="top">0.17</td>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                                <td colspan="1" rowspan="1"/>
                            </tr>
                        </tbody>
                    </table>
                </table-wrap>
                <p>
                    <italic toggle="yes">Class imbalance learning</italic>
                </p>
                <p>The datasets used in this study are extremely unbalanced. There is a considerable difference in the overall ratio of defect-prone and non-defect-prone categories. This will lower the model's overall output and is known as the CIB learning issue. The no. of defective and non-defective groups for projects can be found in 
                    <xref ref-type="table" rid="T1">Table 1</xref>. The following steps, as shown in 
                    <xref ref-type="fig" rid="f6">Figure 6</xref>, are taken to overcome this issue.</p>
                <fig fig-type="figure" id="f6" orientation="portrait" position="float">
                    <label>Figure 6. </label>
                    <caption>
                        <title>Optimized approach for resolving class imbalance.</title>
                        <p>XGB, XGBoost; RF, Random Forest.</p>
                    </caption>
                    <graphic id="gr6" orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/135744/83b99b28-15b9-4f5e-b4ee-28c7e1201eea_figure6.gif"/>
                </fig>
                <p>The number of defect prone and non-defect prone groups in the source/training dataset after selecting important and correlated features in the source and target projects are separated.</p>
                <p>Let x represent the no. of defect-prone cases and y represent the number of defect-free instances. The number of non-defective classes is almost alpha (=7 in the experiment) times the number of defective classes, i.e. x/y=&#x03b1;.</p>
                <p>As a result, the training dataset's non-defect prone occurrences were partitioned into frames ((1&#x2026; y, y+1&#x2026;2y, 2y+1&#x2026;..3y, 3y+1&#x2026;..4y&#x2026; (-1)y+1&#x2026;.y). ince the value of =7, the non-defective instances of the training dataset were segmented into 7 frames in this experiment. The training dataset's defect-prone occurrences are attached to each data frame after separation. The training dataset's defect-prone samples are appended to each data frame after segregation.</p>
                <p>As a result, in each frame, the faulty instances, i.e. y, are added to x1&#x2026;x, resulting in an approximately equivalent number of defect prone and non-defect prone occurrences. The ensemble model is now fed with these seven independent data frames, which have been balanced. 
                    <xref ref-type="fig" rid="f7">Figure 7</xref> depicts the ensemble approach of proposed methodology.</p>
                <fig fig-type="figure" id="f7" orientation="portrait" position="float">
                    <label>Figure 7. </label>
                    <caption>
                        <title>Ensemble approach of the proposed methodology.</title>
                        <p>XGB, XGBoost; RF, Random Forest.</p>
                    </caption>
                    <graphic id="gr7" orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/135744/83b99b28-15b9-4f5e-b4ee-28c7e1201eea_figure7.gif"/>
                </fig>
                <p>
                    <italic toggle="yes">Building the prediction model</italic>
                </p>
                <p>The modelling includes a model fitting and voting method. There are seven alternative balanced dataframes after resolving the CIB problem for each training dataset.</p>
                <p>Random Forest and XGBoost are ensemble models and improved the performance of the classification models; therefore, in the experiment the odd-numbered dataframes have been modeled using the Random Forest classifier and the even-numbered dataframes have been modeled using the XGBoost.</p>
                <p>For a specific instance, the outputs of all these classifiers are considered and have been ensembled further using maximum voting. 
                    <xref ref-type="fig" rid="f7">Figure 7</xref> gives the ensemble approach for proposed methodology.</p>
                <p>
                    <italic toggle="yes">Model fitting</italic>
                </p>
                <p>For the seven main dataframes, Random Forest and XGBoost are used as classifiers. To boost performance, the learning model's parameters are set to a certain numeric value. Hyperparameter tuning is the term for this technique. The learning model is hyperparameter optimized in this experiment. These parameters must be precisely specified rather than inferred from data.</p>
                <p>
                    <italic toggle="yes">Voting system</italic>
                </p>
                <p>Maximum voting has been used to develop a new ensemble model. Each model's prediction in relation to the data frames is taken into account. The class with the most votes is chosen as the binary classification's final prediction for any given case.</p>
            </sec>
        </sec>
        <sec id="sec16">
            <title>Experimental setup</title>
            <sec id="sec17">
                <title>Research questions</title>
                <p>To systematically evaluate the proposed HCPDP models, two research questions are set:
                    <list list-type="bullet">
                        <list-item>
                            <label>&#x2022;</label>
                            <p>Q1. Is the proposed model comparable to WPDP?</p>
                        </list-item>
                        <list-item>
                            <label>&#x2022;</label>
                            <p>Q2. (i) Is the proposed model capable of addressing the source dataset's CIB issue?</p>
                            <p>(ii) Does it outperform the HCPDP model with the imbalanced dataset?</p>
                        </list-item>
                        <list-item>
                            <label>&#x2022;</label>
                            <p>Q1 and Q2 lead us to investigate whether our HCPDP model is comparable to WPDP (Baseline1) and CDDP with imbalanced dataset (Baseline2).</p>
                        </list-item>
                    </list>
                </p>
            </sec>
            <sec id="sec18">
                <title>Baselines</title>
                <p>The work in this paper has been compared with two baselines: WPDP (Baseline 1) and HCPDP with imbalance source data (class imbalance). Comparing the novel framework with within project defect prediction gives statistical evidence of the feasibility and applicability of our model. The proposed framework handles the CIB problem and predicts the target class, and hence we have included HCPDP with CIB also as baseline (Baseline 2).</p>
            </sec>
            <sec id="sec19">
                <title>Experimental design</title>
                <p>Three experiments were conducted to evaluate the applicability and predictive performance of the framework. For the machine learning classification model, we have used Random Forest and XGBoost.</p>
                <p>Experiment 1: For WPDP, the source dataset was divided in the ratio of 70:30 (randomly) as training and testing, respectively. The classification model was applied, and the results were noted.</p>
                <p>Experiment 2: After application of a metric selection and matching, the imbalanced source dataset was used for training the classification model. The prediction of the target instance of the testing dataset was noted.</p>
                <p>Experiment 3: The proposed m framework (as discussed in Section 4) was used for training and testing of the source and target project pair, respectively. The prediction results were noted and analyzed.</p>
            </sec>
        </sec>
        <sec id="sec20">
            <title>Experimental results and discussion</title>
            <p>This segment provides a description of the findings from the previous experiments. 
                <xref ref-type="table" rid="T4">Tables 4</xref>, 
                <xref ref-type="table" rid="T5">5</xref>, and 
                <xref ref-type="table" rid="T6">6</xref> summarizes the findings. The Wilcoxon Signed Rank (WSR) test is used to further validate the proposed framework. The values of the performance metrics are tabulated in 
                <xref ref-type="table" rid="T4">Table 4</xref> using the novel/proposed method. 
                <xref ref-type="table" rid="T5">Table 5</xref> shows the results of using HCPDP with an imbalanced source dataset, while 
                <xref ref-type="table" rid="T6">Table 6</xref> shows the results of using WPDP.</p>
            <table-wrap id="T4" orientation="portrait" position="float">
                <label>Table 4. </label>
                <caption>
                    <title>Accuracy, recall, F-Score and AUC (area under curve) using the proposed framework.</title>
                </caption>
                <table content-type="article-table" frame="hsides">
                    <thead>
                        <tr>
                            <th align="left" colspan="2" rowspan="1" valign="top">Dataset</th>
                            <th align="left" colspan="2" rowspan="1" valign="top">Accuracy</th>
                            <th align="left" colspan="2" rowspan="1" valign="top">Recall</th>
                            <th align="left" colspan="2" rowspan="1" valign="top">F-Score</th>
                            <th align="left" colspan="1" rowspan="2" valign="top">AUC</th>
                        </tr>
                        <tr>
                            <th align="left" colspan="1" rowspan="1" valign="top">Training</th>
                            <th align="left" colspan="1" rowspan="1" valign="top">Testing</th>
                            <th align="left" colspan="1" rowspan="1" valign="top">Training</th>
                            <th align="left" colspan="1" rowspan="1" valign="top">Testing</th>
                            <th align="left" colspan="1" rowspan="1" valign="top">True</th>
                            <th align="left" colspan="1" rowspan="1" valign="top">False</th>
                            <th align="left" colspan="1" rowspan="1" valign="top">True</th>
                            <th align="left" colspan="1" rowspan="1" valign="top">False</th>
                        </tr>
                    </thead>
                    <tbody>
                        <tr>
                            <td align="left" colspan="1" rowspan="4" valign="top">PC2</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">AR3</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.81</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.75</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.81</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.75</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.87</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.62</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.775</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">AR4</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.83</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.72</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.79</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.72</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.78</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.75</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.792</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">AR6</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.89</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.78</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.83</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.81</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.81</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.83</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.812</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">Avg.</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.843</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.750</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.810</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.760</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.820</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.733</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.793</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="4" valign="top">PC3</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">AR3</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.84</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.7</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.76</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.71</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.73</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.75</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.679</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">AR4</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.86</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.69</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.71</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.7</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.74</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.72</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.778</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">AR6</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.78</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.74</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.78</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.75</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.69</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.73</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.746</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">Avg.</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.827</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.710</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.750</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.720</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.720</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.733</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.734</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="4" valign="top">PC4</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">AR3</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.84</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.81</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.82</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.81</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.79</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.74</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.894</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">AR4</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.86</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.75</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.78</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.76</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.78</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.76</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.789</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">AR6</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.82</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.68</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.78</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.74</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.74</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.76</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.784</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">Avg.</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.840</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.747</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.793</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.770</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.770</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.753</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.822</td>
                        </tr>
                    </tbody>
                </table>
            </table-wrap>
            <table-wrap id="T5" orientation="portrait" position="float">
                <label>Table 5. </label>
                <caption>
                    <title>Accuracy, recall, F-Score and AUC (area under curve) using HCPDP (Heterogeneous Cross Project Defect Prediction) with class imbalance.</title>
                </caption>
                <table content-type="article-table" frame="hsides">
                    <thead>
                        <tr>
                            <th align="left" colspan="2" rowspan="1" valign="top">Dataset</th>
                            <th align="left" colspan="2" rowspan="1" valign="top">Accuracy</th>
                            <th align="left" colspan="2" rowspan="1" valign="top">Recall</th>
                            <th align="left" colspan="2" rowspan="1" valign="top">F-Score</th>
                            <th align="left" colspan="1" rowspan="2" valign="top">AUC</th>
                        </tr>
                        <tr>
                            <th align="left" colspan="1" rowspan="1" valign="top">Training</th>
                            <th align="left" colspan="1" rowspan="1" valign="top">Testing</th>
                            <th align="left" colspan="1" rowspan="1" valign="top">Training</th>
                            <th align="left" colspan="1" rowspan="1" valign="top">Testing</th>
                            <th align="left" colspan="1" rowspan="1" valign="top">True</th>
                            <th align="left" colspan="1" rowspan="1" valign="top">False</th>
                            <th align="left" colspan="1" rowspan="1" valign="top">True</th>
                            <th align="left" colspan="1" rowspan="1" valign="top">False</th>
                        </tr>
                    </thead>
                    <tbody>
                        <tr>
                            <td align="left" colspan="1" rowspan="4" valign="top">PC2</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">AR3</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.82</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.71</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.79</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.71</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.82</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.63</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.772</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">AR4</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.83</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.74</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.75</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.69</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.77</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.71</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.612</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">AR6</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.76</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.77</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.76</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.76</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.74</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.83</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.685</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">Avg.</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.803</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.740</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.767</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.720</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.777</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.723</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.690</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="4" valign="top">PC3</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">AR3</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.72</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.73</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.75</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.69</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.68</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.72</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.621</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">AR4</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.75</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.65</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.72</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.67</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.74</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.69</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.812</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">AR6</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.79</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.71</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.76</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.72</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.7</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.68</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.712</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">Avg.</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.753</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.697</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.743</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.693</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.707</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.697</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.715</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="4" valign="top">PC4</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">AR3</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.78</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.75</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.83</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.74</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.75</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.75</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.774</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">AR4</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.76</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.78</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.79</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.73</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.74</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.77</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.796</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">AR6</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.81</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.62</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.74</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.75</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.73</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.73</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.785</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">Avg.</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.783</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.717</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.787</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.740</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.740</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.750</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.785</td>
                        </tr>
                    </tbody>
                </table>
            </table-wrap>
            <table-wrap id="T6" orientation="portrait" position="float">
                <label>Table 6. </label>
                <caption>
                    <title>Accuracy, recall, F-Score and AUC (area under curve) using WPDP (Within-Project Defect Prediction).</title>
                </caption>
                <table content-type="article-table" frame="hsides">
                    <thead>
                        <tr>
                            <th align="left" colspan="2" rowspan="1" valign="top">Dataset</th>
                            <th align="left" colspan="2" rowspan="1" valign="top">Accuracy</th>
                            <th align="left" colspan="2" rowspan="1" valign="top">Recall</th>
                            <th align="left" colspan="2" rowspan="1" valign="top">F-Score</th>
                            <th align="left" colspan="1" rowspan="2" valign="top">AUC</th>
                        </tr>
                        <tr>
                            <th align="left" colspan="1" rowspan="1" valign="top">Training</th>
                            <th align="left" colspan="1" rowspan="1" valign="top">Testing</th>
                            <th align="left" colspan="1" rowspan="1" valign="top">Training</th>
                            <th align="left" colspan="1" rowspan="1" valign="top">Testing</th>
                            <th align="left" colspan="1" rowspan="1" valign="top">True</th>
                            <th align="left" colspan="1" rowspan="1" valign="top">False</th>
                            <th align="left" colspan="1" rowspan="1" valign="top">True</th>
                            <th align="left" colspan="1" rowspan="1" valign="top">False</th>
                        </tr>
                    </thead>
                    <tbody>
                        <tr>
                            <td align="left" colspan="1" rowspan="4" valign="top">PC2</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">AR3</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.91</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.85</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.75</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.78</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.82</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.67</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.774</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">AR4</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.92</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.86</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.81</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.74</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.83</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.78</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.784</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">AR6</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.88</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.89</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.79</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.76</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.78</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.84</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.878</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">Avg.</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.903</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.867</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.783</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.760</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.810</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.763</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.812</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="4" valign="top">PC3</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">AR3</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.94</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.84</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.75</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.72</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.75</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.77</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.791</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">AR4</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.83</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.89</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.82</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.76</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.81</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.83</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.765</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">AR6</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.88</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.84</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.83</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.81</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.79</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.79</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.778</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">Avg.</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.883</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.857</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.800</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.763</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.783</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.797</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.778</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="4" valign="top">PC4</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">AR3</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.91</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.84</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.76</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.72</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.78</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.78</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.889</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">AR4</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.93</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.86</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.77</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.75</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.75</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.77</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.852</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">AR6</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.89</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.89</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.82</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.81</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.81</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.79</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.863</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="top">Avg.</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.910</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.863</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.783</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.760</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.780</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.780</td>
                            <td align="left" colspan="1" rowspan="1" valign="top">0.868</td>
                        </tr>
                    </tbody>
                </table>
            </table-wrap>
            <p>In this experiment, we used a 7-fold cross validation method. The value of k = 7 was chosen so that the source dataset's training and testing partitions are broad enough to statistically represent it. However, no standard protocol for determining the value of k has been developed. 
                <xref ref-type="fig" rid="f8">Figure 8</xref> depicts the proposed framework's 7-fold cross validation.</p>
            <fig fig-type="figure" id="f8" orientation="portrait" position="float">
                <label>Figure 8. </label>
                <caption>
                    <title>7-fold cross validation for the proposed framework.</title>
                </caption>
                <graphic id="gr8" orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/135744/83b99b28-15b9-4f5e-b4ee-28c7e1201eea_figure8.gif"/>
            </fig>
            <p>The two research questions as stated in the previous section are analyzed here. The AUC value determines the performance of the model. In defect prediction, the true value of Recall determines the number of defective classes in the project, which were correctly identified as defective. Therefore, this is an important factor in CPDP as identification and testing of the defective class is more vital than identification of the non-defective classes. Owing to the above reason, more focus on the true value of Recall and AUC has been referred to answer the research questions.</p>
            <p>Q1. Is the proposed model comparable to WPDP?</p>
            <p>Ans. From 
                <xref ref-type="table" rid="T4">Table 4</xref> and 
                <xref ref-type="table" rid="T6">Table 6</xref> the following inferences were made:</p>
            <p>Using the WPDP approach, the average true values of Recall of PC2, PC3 and PC4 as source project are 0.783, 0.800, and 0.783, respectively; whereas, using the proposed framework these values are 0.810, 0.750 and 0.793, respectively. Five out of nine (i.e., 55.55%) source target project combinations using the proposed framework have a true Recall value &gt;WPDP. As a result, it produces statistically significant findings that are superior or equivalent to WPDP.</p>
            <p>Similarly, using the WPDP approach, the average true value of AUC of PC2, PC3 and PC4 as source project are 0.812, 0.778, and 0.868, respectively; whereas, using the proposed framework these values are 0.793, 0.734, and 0.822; respectively. Four out of nine (i.e., 44.44%) source target project combination using proposed framework have an AUC value &gt;WPDP. As a result, it produces statistically significant findings that are superior or equivalent to WPDP. From the above two observations, it can be inferred that the proposed framework has comparable performance to WPDP and is feasible.</p>
            <p>Q2.i. Does the proposed model handle the CIB problem of source dataset?</p>
            <p>ii. Does it outperform the HCPDP model with the imbalanced dataset?</p>
            <p>Ans: From 
                <xref ref-type="table" rid="T4">Table 4</xref> and 
                <xref ref-type="table" rid="T5">Table 5</xref> the following inferences were made:</p>
            <p>Using the HCPDP approach with the imbalanced source dataset, the average true value of Recall of PC2, PC3 and PC4 as source project are 0.767, 0.743, and 0.787, respectively; whereas, using the proposed framework these values are 0.810, 0.750 and 0.793, respectively. Seven out of nine (i.e., 77.7%) source target project combination using proposed framework have true Recall value &gt;WPDP. Hence, it leads to better results against HCPDP approach with imbalanced source dataset with statistical significance. Similarly, using the HCPDP approach with the imbalanced source dataset approach, the average true value of AUC of PC2, PC3 and PC4 as source project are 0.690, 0.715, and 0.736, respectively; whereas, using the proposed framework these values are 0.793, 0.734, and 0.822, respectively. Six out of nine(i.e., 66.6%) source target project combination using proposed framework have AUC value greater than that of the HCPDP approach with imbalanced source dataset. Hence, it leads to better results against the HCPDP approach with the imbalanced source dataset with statistical significance.</p>
            <p>From the above two observations, it can be inferred that the proposed framework handles the CIB problem thereby leading to better results than the HCPDP model with the imbalanced dataset approach. 
                <xref ref-type="fig" rid="f9">Figure 9</xref> plots the comparative graph of AUC, Recall (True), F-Score and Accuracy using Proposed Framework, WPDP and HDP.</p>
            <fig fig-type="figure" id="f9" orientation="portrait" position="float">
                <label>Figure 9. </label>
                <caption>
                    <title>Comparative graph of (a) accuracy and recall (true), and (b) AUC (area under curve) and F-Score using proposed framework, WPDP (Within-Project Defect Prediction) and HCPDP (Heterogeneous Cross Project Defect Prediction).</title>
                </caption>
                <graphic id="gr9" orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/135744/83b99b28-15b9-4f5e-b4ee-28c7e1201eea_figure9.gif"/>
            </fig>
            <p>
                <xref ref-type="table" rid="T7">Table 7</xref> shows the comparison among the results of some of the existing models and the proposed model, in terms of average F-score and Recall values. Also, the clear view of the same comparison is illustrated in line graph shown in 
                <xref ref-type="fig" rid="f10">Figure 10</xref>.</p>
            <table-wrap id="T7" orientation="portrait" position="float">
                <label>Table 7. </label>
                <caption>
                    <title>Comparison of proposed framework with the existing models.</title>
                    <p>CPDP, Cross Project Defect Prediction; TNB, transfer na&#x00ef;ve bayes; CCA, Canonical Correlation Analysis; WPDP, Within-Project Defect Prediction; HCPDP, Heterogeneous Cross Project Defect Prediction.</p>
                </caption>
                <table content-type="article-table" frame="hsides">
                    <thead>
                        <tr>
                            <th align="left" colspan="3" rowspan="1" valign="top">Heterogeneous CPDP</th>
                        </tr>
                        <tr>
                            <th align="left" colspan="1" rowspan="1" valign="top"/>
                            <th align="left" colspan="2" rowspan="1" valign="top">AVERAGE</th>
                        </tr>
                        <tr>
                            <th align="left" colspan="1" rowspan="1" valign="top"/>
                            <th align="left" colspan="1" rowspan="1" valign="top">PD (recall)</th>
                            <th align="left" colspan="1" rowspan="1" valign="top">FSCORE</th>
                        </tr>
                    </thead>
                    <tbody>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="bottom">TNB</td>
                            <td align="left" colspan="1" rowspan="1" valign="bottom">0.59</td>
                            <td align="left" colspan="1" rowspan="1" valign="bottom">0.33</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="bottom">NN FILTER</td>
                            <td align="left" colspan="1" rowspan="1" valign="bottom">0.55</td>
                            <td align="left" colspan="1" rowspan="1" valign="bottom">0.45</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="bottom">TCA+</td>
                            <td align="left" colspan="1" rowspan="1" valign="bottom">0.47</td>
                            <td align="left" colspan="1" rowspan="1" valign="bottom">0.36</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="bottom">CCA</td>
                            <td align="left" colspan="1" rowspan="1" valign="bottom">0.7</td>
                            <td align="left" colspan="1" rowspan="1" valign="bottom">0.74</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="bottom">PROPOSED MODEL</td>
                            <td align="left" colspan="1" rowspan="1" valign="bottom">0.75</td>
                            <td align="left" colspan="1" rowspan="1" valign="bottom">0.74</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="bottom">WPDP</td>
                            <td align="left" colspan="1" rowspan="1" valign="bottom">0.76</td>
                            <td align="left" colspan="1" rowspan="1" valign="bottom">0.78</td>
                        </tr>
                        <tr>
                            <td align="left" colspan="1" rowspan="1" valign="bottom">HCPDP (WITH CLASS IMBALANCE)</td>
                            <td align="left" colspan="1" rowspan="1" valign="bottom">0.75</td>
                            <td align="left" colspan="1" rowspan="1" valign="bottom">0.72</td>
                        </tr>
                    </tbody>
                </table>
            </table-wrap>
            <fig fig-type="figure" id="f10" orientation="portrait" position="float">
                <label>Figure 10. </label>
                <caption>
                    <title>Graphical representation of the comparison of proposed framework with the existing baseline models.</title>
                    <p>CPDP, Cross Project Defect Prediction; TNB, transfer na&#x00ef;ve bayes; CCA, Canonical Correlation Analysis; WPDP, Within-Project Defect Prediction; HDP, Homogeneous Cross Project Defect Prediction.</p>
                </caption>
                <graphic id="gr10" orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/135744/83b99b28-15b9-4f5e-b4ee-28c7e1201eea_figure10.gif"/>
            </fig>
            <p>TNB (transfer na&#x00ef;ve bayes) model showed the least f-score but significant rise in Recall value. TCA+ showed the least Recall value. The proposed model exceeds the performance from all the existing CPDP models including the HDP task in terms of both scores having a significant surge to near about 0.75. However, the proposed model is comparable to the WPDP results with a minor difference of 0.1</p>
            <sec id="sec21">
                <title>Wilcoxon signed rank test</title>
                <p>The proposed optimized model was validated using the Wilcoxon signed rank test (WSR)
                    <sup>
                        <xref ref-type="bibr" rid="ref32">32</xref>
                    </sup> with a significance level of 5% (i.e., P-value=0.05). This test determines if our proposed ensemble framework and the Random Forest classifier have any major differences. In order to conduct the evaluation, we took into account the Recall and AUC measures of both versions. The null hypothesis considered in this scenario is:</p>
                <p>H0: The performance of the two models is identical.</p>
                <p>H1: The performance of the two models differs.</p>
                <p>If P-value &lt;0.05 then H0 is rejected.</p>
                <p>When examining Recall for the proposed ensemble framework and Random Forest, the null hypothesis for WSR test is rejected, with a P-value of 0.00172. When AUC was taken into account for the proposed ensemble framework and Random Forest, similar results were obtained. The Ho was again discarded, with a P-value of 0.0004. The denied Ho claims that the two models work differently, and thus the two modes are distinct.</p>
            </sec>
        </sec>
        <sec id="sec22" sec-type="conclusions">
            <title>Conclusions</title>
            <p>Cross-project defect prediction with heterogeneous metric sets has received little attention. A novel HCPDP ensemble architecture based on metric matching has been proposed in this work. An optimized approach to resolve the CIB issue in HCPDP has been implemented and validated. It's a decision-making tool for predicting software project defects using minimal historical data and a variety of metrics. Six open-source object-oriented projects have been used to test the architecture. The databases were highly unbalanced. The proposed ensemble framework is used for dual purposes in our scenario. First, it corrects the issue of CIB in the source dataset by balancing the number of defective and non-defective instances. The framework is then used to predict defect-prone classes in the second step. Based on our statistical review, the proposed system appears to be feasible and yields promising results. The framework&#x2019;s training accuracy was assessed using k-fold cross validation. We used the WSR test with a significance level of 5% (i.e., P-value=0.05) to demonstrate the validity of the proposed framework. We concluded that the proposed model is different from the baseline models.</p>
            <p>Our future work will be focused on generalization of the proposed approach using more defect data with heterogeneous metrics and implementation of further complex learning approaches for defect predictions with a better performance.</p>
        </sec>
        <sec id="sec23">
            <title>Data availability</title>
            <sec id="sec24">
                <title>Underlying data</title>
                <p>The datasets from the six open-source projects used in this study are freely available from OpenML:</p>
                <p>PC2: 
                    <ext-link ext-link-type="uri" xlink:href="https://www.openml.org/search?type=data&amp;status=active&amp;id=1069">https://www.openml.org/search?type=data&amp;status=active&amp;id=1069</ext-link>
                </p>
                <p>PC3: 
                    <ext-link ext-link-type="uri" xlink:href="https://www.openml.org/search?type=data&amp;status=active&amp;id=1050">https://www.openml.org/search?type=data&amp;status=active&amp;id=1050</ext-link>
                </p>
                <p>PC4: 
                    <ext-link ext-link-type="uri" xlink:href="https://www.openml.org/search?type=data&amp;status=active&amp;id=1049">https://www.openml.org/search?type=data&amp;status=active&amp;id=1049</ext-link>
                </p>
                <p>AR3: 
                    <ext-link ext-link-type="uri" xlink:href="https://www.openml.org/search?type=data&amp;status=active&amp;id=1060">https://www.openml.org/search?type=data&amp;status=active&amp;id=1060</ext-link>
                </p>
                <p>AR4: 
                    <ext-link ext-link-type="uri" xlink:href="https://www.openml.org/search?type=data&amp;status=active&amp;id=1061">https://www.openml.org/search?type=data&amp;status=active&amp;id=1061</ext-link>
                </p>
                <p>AR6: 
                    <ext-link ext-link-type="uri" xlink:href="https://www.openml.org/search?type=data&amp;status=active&amp;id=1064">https://www.openml.org/search?type=data&amp;status=active&amp;id=1064</ext-link>
                </p>
                <p>Figshare: software defect prediction dataset. 
                    <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.6084/m9.figshare.20209142.v1">https://doi.org/10.6084/m9.figshare.20209142.v1</ext-link>.
                    <sup>
                        <xref ref-type="bibr" rid="ref34">34</xref>
                    </sup>
                </p>
                <p>This project contains the following underlying data, which can be directly read using the python source code provided in 
                    <italic toggle="yes">Software availability</italic>:
                    <list list-type="bullet">
                        <list-item>
                            <label>&#x2010;</label>
                            <p>ar3.arff (class level defect data from the ar3 project).</p>
                        </list-item>
                        <list-item>
                            <label>&#x2010;</label>
                            <p>ar4.arff (class level defect data from the ar4 project).</p>
                        </list-item>
                        <list-item>
                            <label>&#x2010;</label>
                            <p>ar6.arff (class level defect data from the ar6 project).</p>
                        </list-item>
                        <list-item>
                            <label>&#x2010;</label>
                            <p>pc2.arff (class level defect data from the pc2 project).</p>
                        </list-item>
                        <list-item>
                            <label>&#x2010;</label>
                            <p>pc3.arff (class level defect data from the pc3 project).</p>
                        </list-item>
                        <list-item>
                            <label>&#x2010;</label>
                            <p>pc4.arff (class level defect data from the pc4 project).</p>
                        </list-item>
                    </list>
                </p>
                <p>Data are available under the terms of the 
                    <ext-link ext-link-type="uri" xlink:href="https://creativecommons.org/publicdomain/zero/1.0/">Creative Commons Zero "No rights reserved" data waiver</ext-link> (CC0 1.0 Public domain dedication).</p>
            </sec>
        </sec>
        <sec id="sec25">
            <title>Software availability</title>
            <p>Source code available from: 
                <ext-link ext-link-type="uri" xlink:href="https://github.com/lipika-amity/Heterogeneous-CPDP">https://github.com/lipika-amity/Heterogeneous-CPDP</ext-link>
            </p>
            <p>Archived source code at time of publication: 
                <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.5281/zenodo.6961342">https://doi.org/10.5281/zenodo.6961342</ext-link>.
                <sup>
                    <xref ref-type="bibr" rid="ref35">35</xref>
                </sup>
            </p>
            <p>License: 
                <ext-link ext-link-type="uri" xlink:href="https://opensource.org/licenses/Apache-2.0">Apache-2.0</ext-link>
            </p>
        </sec>
    </body>
    <back>
        <ref-list>
            <title>References</title>
            <ref id="ref1">
                <label>1</label>
                <mixed-citation publication-type="book">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Han</surname>
                            <given-names>D</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Hoh</surname>
                            <given-names>IP</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Kim</surname>
                            <given-names>S</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <source>

                        <italic toggle="yes">Micro interaction metrics for defect prediction, in Proceedings of the 16th ACM SIGSOFT international Symposium on Foundations of software engineering.</italic>
</source>
                    <publisher-loc>New York, USA</publisher-loc>:
                    <publisher-name>ACM</publisher-name>;<year>2011</year>.</mixed-citation>
            </ref>
            <ref id="ref2">
                <label>2</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>D&#x2019;Ambros</surname>
                            <given-names>M</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Lanza</surname>
                            <given-names>M</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Robbes</surname>
                            <given-names>R</given-names>
                        </name>
</person-group>:
                    <article-title>Evaluating defect prediction approaches: a benchmark and an extensive comparison.</article-title>
                    <source>

                        <italic toggle="yes">Empir. Softw. Eng.</italic>
</source>
                    <year>2012</year>;<volume>17</volume>(<issue>4-5</issue>):<fpage>531</fpage>&#x2013;<lpage>577</lpage>.
                    <pub-id pub-id-type="doi">10.1007/s10664-011-9173-9</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref3">
                <label>3</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Goel</surname>
                            <given-names>L</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Sharma</surname>
                            <given-names>M</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Khatri</surname>
                            <given-names>SK</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>An empirical analysis of the statistical learning models for different categories of Cross Project Defect Prediction.</article-title>
                    <source>

                        <italic toggle="yes">Int. J. Comput. Aided Eng. Technol.</italic>
</source>
                    <year>2021</year>;<volume>14</volume>(<issue>2</issue>):<fpage>233</fpage>.
                    <pub-id pub-id-type="doi">10.1504/IJCAET.2021.113549</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref4">
                <label>4</label>
                <mixed-citation publication-type="other">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Canfora</surname>
                            <given-names>G</given-names>
                        </name>

                        <name name-style="western">
                            <surname>De Lucia</surname>
                            <given-names>A</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Oliveto</surname>
                            <given-names>R</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>Multiobjective cross-project defect prediction.</article-title>
                    <italic toggle="yes">IEEE Sixth International Conference on Verification and Validation in Software Testing.</italic>IEEE, Luxembourg, Luxembourg.ISSN 2159-4848.<year>2013</year>.</mixed-citation>
            </ref>
            <ref id="ref5">
                <label>5</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Bener</surname>
                            <given-names>AB</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Menzies</surname>
                            <given-names>T</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Di Stefano</surname>
                            <given-names>J</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>On the relative value of crosscompany and within-company data for defect prediction.</article-title>
                    <source>

                        <italic toggle="yes">Empir. Softw. Eng.</italic>
</source>
                    <year>2009</year>;<volume>14</volume>:<fpage>540</fpage>&#x2013;<lpage>578</lpage>.
                    <pub-id pub-id-type="doi">10.1007/s10664-008-9103-7</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref6">
                <label>6</label>
                <mixed-citation publication-type="other">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Butcher</surname>
                            <given-names>A</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Cok</surname>
                            <given-names>DR</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Marcus</surname>
                            <given-names>A</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>Local vs. global models for effort estimation and defect prediction.</article-title>
                    <italic toggle="yes">In 26th IEEE/ACM International Conference on Automated Software Engineering ASE 2011.</italic>IEEE, Lawrence, KS, USA.pp.<fpage>343</fpage>&#x2013;<lpage>351</lpage>
                    <year>2011</year>.</mixed-citation>
            </ref>
            <ref id="ref7">
                <label>7</label>
                <mixed-citation publication-type="other">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Camargo Cruz</surname>
                            <given-names>AE</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Ochimizu</surname>
                            <given-names>K</given-names>
                        </name>
</person-group>:
                    <article-title>Towards logistic regression models for predicting fault- prone code across software projects.</article-title>
                    <source>

                        <italic toggle="yes">Proceedings of the Third International Symposium on Empirical Software Engineering and Measurement (ESEM), Lake Buena Vista, Florida, USA.</italic>
</source>
                    <year>2009</year>; pp.<fpage>460</fpage>&#x2013;<lpage>463</lpage>.</mixed-citation>
            </ref>
            <ref id="ref8">
                <label>8</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Briand</surname>
                            <given-names>LC</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Melo</surname>
                            <given-names>WL</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Wurst</surname>
                            <given-names>J</given-names>
                        </name>
</person-group>:
                    <article-title>Assessing the applicability of fault- proneness models across object-oriented software projects.</article-title>
                    <source>

                        <italic toggle="yes">IEEE Trans. Softw. Eng.</italic>
</source>
                    <year>2002</year>;<volume>28</volume>:<fpage>706</fpage>&#x2013;<lpage>720</lpage>.
                    <pub-id pub-id-type="doi">10.1109/TSE.2002.1019484</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref9">
                <label>9</label>
                <mixed-citation publication-type="other">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Devanbu</surname>
                            <given-names>P</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Posnett</surname>
                            <given-names>D</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Rahman</surname>
                            <given-names>F</given-names>
                        </name>
</person-group>:
                    <article-title>Recalling the imprecision of cross- project defect prediction.</article-title>
                    <italic toggle="yes">In Proceedings of the ACM-Sigsoft 20th International Symposium on the Foundations of Software Engineering (FSE- 20).</italic>ACM, Research Triangle Park, NC, USA.<year>2012</year>; pp.<fpage>61</fpage>&#x2013;<lpage>65</lpage>.</mixed-citation>
            </ref>
            <ref id="ref10">
                <label>10</label>
                <mixed-citation publication-type="other">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Canfora</surname>
                            <given-names>G</given-names>
                        </name>

                        <name name-style="western">
                            <surname>De Lucia</surname>
                            <given-names>A</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Oliveto</surname>
                            <given-names>R</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>Multiobjective cross-project defect prediction.</article-title>
                    <italic toggle="yes">IEEE Sixth International Conference on Verification and Validation in Software Testing.</italic>IEEE, Luxembourg, Luxembourg.ISSN 2159-4848.<year>2013</year>.</mixed-citation>
            </ref>
            <ref id="ref11">
                <label>11</label>
                <mixed-citation publication-type="other">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Jing</surname>
                            <given-names>X</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Wu</surname>
                            <given-names>F</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Dong</surname>
                            <given-names>X</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>Heterogeneous cross company defect prediction by unifiedmetric representation andCCA-based transfer learning.</article-title>
                    <source>

                        <italic toggle="yes">Proceedings of the 10th JointMeeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of SoftwareEngineering, ESEC/FSE 2015, ita.</italic>
</source>
                    <year>September 2014</year>; pp.<fpage>496</fpage>&#x2013;<lpage>507</lpage>.</mixed-citation>
            </ref>
            <ref id="ref12">
                <label>12</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>He</surname>
                            <given-names>P</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Li</surname>
                            <given-names>B</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Ma</surname>
                            <given-names>Y</given-names>
                        </name>
</person-group>:
                    <article-title>Towards cross-project defect prediction with imbalanced feature sets.</article-title>
                    <source>

                        <italic toggle="yes">CoRR.</italic>
</source>
                    <year>2014</year>;<volume>abs/1411.4228</volume>.</mixed-citation>
            </ref>
            <ref id="ref13">
                <label>13</label>
                <mixed-citation publication-type="other">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Jing</surname>
                            <given-names>X</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Wu</surname>
                            <given-names>F</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Dong</surname>
                            <given-names>X</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>Heterogeneous cross company defect prediction by unifiedmetric representation and CCA-based transfer learning.</article-title>
                    <source>

                        <italic toggle="yes">Proceedings of the 10th JointMeeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of SoftwareEngineering, ESEC/FSE 2015, ita.</italic>
</source>
                    <year>September 2015</year>; pp.<fpage>496</fpage>&#x2013;<lpage>507</lpage>
                </mixed-citation>
            </ref>
            <ref id="ref14">
                <label>14</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Ryu</surname>
                            <given-names>D</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Jang</surname>
                            <given-names>J-I</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Baik</surname>
                            <given-names>J</given-names>
                        </name>
</person-group>:
                    <article-title>A transfer cost-sensitive boostingapproach for cross-project defect prediction.</article-title>
                    <source>

                        <italic toggle="yes">Softw. Qual. J</italic>
</source>
                    <year>2015</year>;<volume>25</volume>(<issue>1</issue>):<fpage>235</fpage>&#x2013;<lpage>272</lpage>.
                    <pub-id pub-id-type="doi">10.1007/s11219-015-9287-1</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref15">
                <label>15</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Xinglong Yin</surname>
                            <given-names>X</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Liu</surname>
                            <given-names>L</given-names>
                        </name>
</person-group>:
                    <article-title>Huaxiao Liu, Qi Wu. Heterogeneous cross-project defect prediction with multiple source projects based on transfer learning [J].</article-title>
                    <source>

                        <italic toggle="yes">Math. Biosci. Eng.</italic>
</source>
                    <year>2020</year>;<volume>17</volume>(<issue>2</issue>):<fpage>1020</fpage>&#x2013;<lpage>1040</lpage>.
                    <pub-id pub-id-type="doi">10.3934/mbe.2020054</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref16">
                <label>16</label>
                <mixed-citation publication-type="other">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Fu</surname>
                            <given-names>W</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Kim</surname>
                            <given-names>S</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Menzies</surname>
                            <given-names>T</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>Heterogeneous defect prediction.</article-title>
                    <source>

                        <italic toggle="yes">Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering, ser. ESEC/FSE, ACM, New York, NY, USA.</italic>
</source>
                    <year>2015</year>; pp.<fpage>508</fpage>&#x2013;<lpage>519</lpage>.</mixed-citation>
            </ref>
            <ref id="ref17">
                <label>17</label>
                <mixed-citation publication-type="other">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Fu</surname>
                            <given-names>W</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Kim</surname>
                            <given-names>S</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Menzies</surname>
                            <given-names>T</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>Heterogeneous defect prediction.</article-title>
                    <source>

                        <italic toggle="yes">Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering, ser. ESEC/FSE, ACM, New York, NY, USA.</italic>
</source>
                    <year>2016</year>; pp.<fpage>508</fpage>&#x2013;<lpage>519</lpage>.</mixed-citation>
            </ref>
            <ref id="ref18">
                <label>18</label>
                <mixed-citation publication-type="journal">
                    <collab>Choosing software metrics for defect prediction</collab>:
                    <article-title>An investigation: comparison and improvements.</article-title>
                    <source>

                        <italic toggle="yes">IEEE Access.</italic>
</source>
                    <year>2017</year>;<volume>5</volume>:<fpage>25646</fpage>&#x2013;<lpage>25656</lpage>.
                    <pub-id pub-id-type="doi">10.1109/ACCESS.2017.2771460</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref19">
                <label>19</label>
                <mixed-citation publication-type="other">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Ni</surname>
                            <given-names>C</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Liu</surname>
                            <given-names>W</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Gu</surname>
                            <given-names>Q</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>FeSCH: A FeatureSelection Method using Clusters of Hybrid-data for Cross-Project Defect Prediction.</article-title>
                    <source>

                        <italic toggle="yes">Proceedings of the 41st IEEE Annual Computer Software and Applications Conference, COMPSAC 2017, ita.</italic>
</source>
                    <year>July 2017</year>; pp.<fpage>51</fpage>&#x2013;<lpage>56</lpage>.</mixed-citation>
            </ref>
            <ref id="ref20">
                <label>20</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Xu</surname>
                            <given-names>Z</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Yuan</surname>
                            <given-names>P</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Zhang</surname>
                            <given-names>T</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>HDA: Cross Project Defect Prediction via Heterogeneous Domain Adaptation With Dictionary Learning.</article-title>
                    <source>

                        <italic toggle="yes">IEEE Access.</italic>
</source>
                    <year>2018</year>;<volume>6</volume>:<fpage>57597</fpage>&#x2013;<lpage>57613</lpage>.
                    <pub-id pub-id-type="doi">10.1109/ACCESS.2018.2873755</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref21">
                <label>21</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Gong</surname>
                            <given-names>L</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Jiang</surname>
                            <given-names>S</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Yu</surname>
                            <given-names>Q</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>Unsupervised Deep Domain Adaptation for Heterogeneous Defect Prediction.</article-title>
                    <source>

                        <italic toggle="yes">IEICE Trans. Info. And Syst.</italic>
</source>
                    <year>2019</year>;<volume>E102.D</volume>(<issue>3</issue>):<fpage>537</fpage>&#x2013;<lpage>549</lpage>.
                    <pub-id pub-id-type="doi">10.1587/transinf.2018EDP7289</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref22">
                <label>22</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Sun</surname>
                            <given-names>Y</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Jing</surname>
                            <given-names>X-Y</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Fei</surname>
                            <given-names>W</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>Semi-supervised Heterogeneous Defect Prediction with Open-source Projects on GitHub.</article-title>
                    <source>

                        <italic toggle="yes">Int. J. Softw. Eng. Knowl. Eng.</italic>
</source>
                    <year>2021</year>;<volume>31</volume>(<issue>06</issue>):<fpage>889</fpage>&#x2013;<lpage>916</lpage>.
                    <pub-id pub-id-type="doi">10.1142/s0218194021500273</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref23">
                <label>23</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Kim</surname>
                            <given-names>E</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Baik</surname>
                            <given-names>J</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Ryu</surname>
                            <given-names>D</given-names>
                        </name>
</person-group>:
                    <article-title>A Selection Technique of Source Project in Heterogeneous Defect Prediction based on Correlation Coefficients.</article-title>
                    <source>

                        <italic toggle="yes">J. KIISE.</italic>
</source>
                    <year>2021</year>;<volume>48</volume>(<issue>8</issue>):<fpage>920</fpage>&#x2013;<lpage>927</lpage>.
                    <pub-id pub-id-type="doi">10.5626/jok.2021.48.8.920</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref24">
                <label>24</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Chen</surname>
                            <given-names>L</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Wang</surname>
                            <given-names>C</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Song</surname>
                            <given-names>S</given-names>
                        </name>
</person-group>:
                    <article-title>Software defect prediction based on nested-stacking and heterogeneous feature selection.</article-title>
                    <source>

                        <italic toggle="yes">Complex IntellSyst.</italic>
</source>
                    <year>2022</year>;<volume>8</volume>:<fpage>3333</fpage>&#x2013;<lpage>3348</lpage>.
                    <pub-id pub-id-type="doi">10.1007/s40747-022-0676-y</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref25">
                <label>25</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Goel</surname>
                            <given-names>L</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Sharma</surname>
                            <given-names>M</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Khatri</surname>
                            <given-names>SK</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>A Framework for Homogeneous Cross Project Defect Prediction.</article-title>
                    <source>

                        <italic toggle="yes">Int. J. Softw. Innov.</italic>
</source>
                    <year>2020</year>;<volume>9</volume>(<issue>1</issue>):<fpage>52</fpage>&#x2013;<lpage>68</lpage>.
                    <pub-id pub-id-type="doi">10.4018/IJSI.2021010105</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref26">
                <label>26</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Goel</surname>
                            <given-names>L</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Sharma</surname>
                            <given-names>M</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Khatri</surname>
                            <given-names>SK</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>Cross-project defect prediction using data sampling for class imbalance learning: an empirical study.</article-title>
                    <source>

                        <italic toggle="yes">Int. J. Parallel Emergent Distrib. Syst.</italic>
</source>
                    <year>2021</year>;<volume>36</volume>(<issue>2</issue>):<fpage>130</fpage>&#x2013;<lpage>143</lpage>.
                    <pub-id pub-id-type="doi">10.1080/17445760.2019.1650039</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref27">
                <label>27</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Malhotra</surname>
                            <given-names>R</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Kamal</surname>
                            <given-names>S</given-names>
                        </name>
</person-group>:
                    <article-title>An Empirical Study to Investigate Oversampling Methods for Improving Software Defect Prediction Using Imbalanced Data.</article-title>
                    <source>

                        <italic toggle="yes">Neurocomputing.</italic>
</source>
                    <year>2019</year>;<volume>343</volume>:<fpage>120</fpage>&#x2013;<lpage>140</lpage>.
                    <pub-id pub-id-type="doi">10.1016/j.neucom.2018.04.090</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref28">
                <label>28</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Nandal</surname>
                            <given-names>N</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Tanwar</surname>
                            <given-names>R</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Pruthi</surname>
                            <given-names>J</given-names>
                        </name>
</person-group>:
                    <article-title>Machine learning based aspect level sentiment analysis for Amazon products.</article-title>
                    <source>

                        <italic toggle="yes">Spat. Inf. Res.</italic>
</source>
                    <year>2020</year>;<volume>28</volume>:<fpage>601</fpage>&#x2013;<lpage>607</lpage>.
                    <pub-id pub-id-type="doi">10.1007/s41324-020-00320-2</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref29">
                <label>29</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Liaw</surname>
                            <given-names>A</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Wiener</surname>
                            <given-names>M</given-names>
                        </name>
</person-group>:
                    <article-title>Classification and Regression by RandomForest.</article-title>
                    <source>

                        <italic toggle="yes">Forest.</italic>
</source>
                    <year>2001</year>;<volume>23</volume>.</mixed-citation>
            </ref>
            <ref id="ref30">
                <label>30</label>
                <mixed-citation publication-type="other">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Chen</surname>
                            <given-names>T</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Guestrin</surname>
                            <given-names>C</given-names>
                        </name>
</person-group>:
                    <article-title>XGBoost: A Scalable Tree Boosting System.</article-title>
                    <year>2016</year>;<fpage>785</fpage>&#x2013;<lpage>794</lpage>.
                    <pub-id pub-id-type="doi">10.1145/2939672.2939785</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref31">
                <label>31</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Gao</surname>
                            <given-names>K</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Khoshgoftaar</surname>
                            <given-names>TM</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Wang</surname>
                            <given-names>H</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>Choosing software metrics for defect prediction: An investigation on feature selection techniques.</article-title>
                    <source>

                        <italic toggle="yes">Softw Pract. Exper.</italic>
</source>
                    <year>Apr. 2011</year>;<volume>41</volume>(<issue>5</issue>):<fpage>579</fpage>&#x2013;<lpage>606</lpage>.
                    <pub-id pub-id-type="doi">10.1002/spe.1043</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref32">
                <label>32</label>
                <mixed-citation publication-type="other">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Durango</surname>
                            <given-names>A</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Refugio</surname>
                            <given-names>C</given-names>
                        </name>
</person-group>:
                    <article-title>An empirical study on Wilcoxon Signed Ranked Test.</article-title>
                    <year>2018</year>.
                    <pub-id pub-id-type="doi">10.13140/RG.2.2.13996.51840</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref33">
                <label>33</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Ampomah</surname>
                            <given-names>EK</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Qin</surname>
                            <given-names>Z</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Nyame</surname>
                            <given-names>G</given-names>
                        </name>
</person-group>:
                    <article-title>Evaluation of Tree-Based Ensemble Machine Learning Models in Predicting Stock Price Direction of Movement.</article-title>
                    <source>

                        <italic toggle="yes">Information.</italic>
</source>
                    <year>2020</year>;<volume>11</volume>(<issue>6</issue>):<fpage>332</fpage>.
                    <pub-id pub-id-type="doi">10.3390/info11060332</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref34">
                <label>34</label>
                <mixed-citation publication-type="other">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Goel</surname>
                            <given-names>L</given-names>
                        </name>
</person-group>:
                    <article-title>software defect prediction dataset. figshare [Dataset].</article-title>
                    <year>2022</year>.
                    <pub-id pub-id-type="doi">10.6084/m9.figshare.20209142.v1</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref35">
                <label>35</label>
                <mixed-citation publication-type="other">
                    <collab>Lipika-amity</collab>:
                    <article-title>lipika-amity/Heterogeneous-CPDP: (v1.0). Zenodo. [Software].</article-title>
                    <year>2022</year>.
                    <pub-id pub-id-type="doi">10.5281/zenodo.6961342</pub-id>
                </mixed-citation>
            </ref>
        </ref-list>
    </back>
    <sub-article article-type="reviewer-report" id="report173231">
        <front-stub>
            <article-id pub-id-type="doi">10.5256/f1000research.135744.r173231</article-id>
            <title-group>
                <article-title>Reviewer response for version 1</article-title>
            </title-group>
            <contrib-group>
                <contrib contrib-type="author">
                    <name>
                        <surname>Rathore</surname>
                        <given-names>Santosh</given-names>
                    </name>
                    <xref ref-type="aff" rid="r173231a1">1</xref>
                    <role>Referee</role>
                </contrib>
                <aff id="r173231a1">
                    <label>1</label>Atal Bihari Vajpayee Indian Institute of Information Technology and Management, Gwalior, Madhya Pradesh, India</aff>
            </contrib-group>
            <author-notes>
                <fn fn-type="conflict">
                    <p>
                        <bold>Competing interests: </bold>No competing interests were disclosed.</p>
                </fn>
            </author-notes>
            <pub-date pub-type="epub">
                <day>18</day>
                <month>7</month>
                <year>2023</year>
            </pub-date>
            <permissions>
                <copyright-statement>Copyright: &#x00a9; 2023 Rathore S</copyright-statement>
                <copyright-year>2023</copyright-year>
                <license xlink:href="https://creativecommons.org/licenses/by/4.0/">
                    <license-p>This is an open access peer review report distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
                </license>
            </permissions>
            <related-article ext-link-type="doi" id="relatedArticleReport173231" related-article-type="peer-reviewed-article" xlink:href="10.12688/f1000research.123616.1"/>
            <custom-meta-group>
                <custom-meta>
                    <meta-name>recommendation</meta-name>
                    <meta-value>approve-with-reservations</meta-value>
                </custom-meta>
            </custom-meta-group>
        </front-stub>
        <body>
            <p>The paper focuses on class imbalance learning in the CPDP environment. However, several comments can be made regarding the paper: 
                <list list-type="order">
                    <list-item>
                        <p>The motivation behind the presented work is unclear. The authors should provide a more explicit explanation as to why class imbalance learning in the CPDP environment is important and relevant.</p>
                    </list-item>
                    <list-item>
                        <p>It is worth noting that most existing works in the field of SDP already address data imbalance during the pre-processing step. The authors fail to introduce any significant contribution in their research.</p>
                    </list-item>
                    <list-item>
                        <p>The paper lacks new and meaningful conclusions. It is important for the authors to draw insightful and relevant conclusions from their research findings, providing valuable insights to the reader.</p>
                    </list-item>
                    <list-item>
                        <p>The overall writing quality of the paper is poor, with numerous typos and grammatical errors throughout. As an example, in the abstract, recall and f1-score are incorrectly reported as 7.4 and 7.5, which is clearly erroneous. The authors should thoroughly revise and proofread their paper to ensure clarity, accuracy, and professionalism in their writing.</p>
                    </list-item>
                </list>
            </p>
            <p>Is the rationale for developing the new method (or application) clearly explained?</p>
            <p>Partly</p>
            <p>Is the description of the method technically sound?</p>
            <p>No</p>
            <p>Are the conclusions about the method and its performance adequately supported by the findings presented in the article?</p>
            <p>Partly</p>
            <p>If any results are presented, are all the source data underlying the results available to ensure full reproducibility?</p>
            <p>Yes</p>
            <p>Are sufficient details provided to allow replication of the method development and its use by others?</p>
            <p>Partly</p>
            <p>Reviewer Expertise:</p>
            <p>Software fault prediction, applied machine learning</p>
            <p>I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.</p>
        </body>
    </sub-article>
</article>
