A Machine Learning Approach to Predictive Modelling of Student Performance

Background - Many factors affect student performance such as the individual’s background, habits, absenteeism and social activities. Using these factors, corrective actions can be determined to improve their performance. This study looks into the effects of these factors in predicting student performance from a data mining approach. This study presents a data mining approach in identify significant factors and predict student performance, based on two datasets collected from two secondary schools in Portugal. Methods – In this study, two datasets are augmented to increase the sample size by merging them. Following that, data pre-processing is performed and the features are normalized with linear scaling to avoid bias on heavy weighted attributes. The selected features are then assigned into four groups comprising of student background, lifestyle, history of grades and all features. Next, Boruta feature selection is performed to remove irrelevant features. Finally, the classification models of Support Vector Machine (SVM) , Naïve Bayes (NB) , and Multilayer Perceptron (MLP) origins are designed and their performances evaluated. Results - The models were trained and evaluated on an integrated dataset comprising 1044 student records with 33 features, after feature selection. The classification was performed with SVM, NB and MLP with 60-40 and 50-50 train-test splits and 10-fold cross validation. GridSearchCV was applied to perform hyperparameter tuning. The performance metrics were accuracy, precision, recall and F1-Score. SVM obtained the highest accuracy with scores of 77%, 80%, 91% and 90% on background, lifestyle, history of grades and all features respectively in 50-50 train-test splits for binary levels classification . SVM also obtained highest accuracy for five levels classification with 39%, 38%, 73% and 71% for the four categories respectively. The results show that the history of grades form significant influence on the student performance.


Introduction
The definition of a student is a person who attends school or any education institution level to achieve a certain level of knowledge or skill set in a course under the supervision of an educator. Almost everyone was once a student with responsibilities to acquire proper education. Acquiring knowledge by means of getting the right education is of utmost importance and each person should have basic equality in receiving education.
When discussing education at the secondary level, a vital aspect to consider is student performance. Student performance can be assessed in a variety of dimensions, either through exam-based assessment or participation-based assessment. Exam-based assessment includes quizzes, midterms, and final exams, while participation-based assessment is a two-way communication during learning and group activities.
Apart from the obvious, there are so many factors that can affect student performance, such as individual habits, absenteeism, social activities after school and others. This gives way to having machines to learn patterns from data so that they can predict how well a student performs; by acknowledging these factors and subsequently detecting and improve their performance as early as possible.
The contributions of this paper are the identification of significant features that influence student assessment, which in turn can be used to develop various predictive models to ascertain student performance. This will assist educators to form corrective or remedial actions can help to improve student performance. In addition, this may also assist in formulating curriculums that may direct students to career pathways that are most suitable for them.
The paper is structured as follows: Related Works, Methodology and Results, followed by Conclusions.

Related Works
Student performance is an essential part in a secondary-level education as it will show where the student stands when continuing to higher education. Daud et al. 1 noted that the ability to predict the success of a student is essential and seems to be a fascinating area to dig into.
Sokkhey et al. 2 found out that mathematics is one of the subjects that has scientific progression on students. All aspects of human life at various levels are influenced by mathematics and there are no instances in life where mathematics is not used.
• to find errors.
• to check assumptions.
• to determine the tentative choice of suitable models and tools.
• to determine the relationship between the dependent and independent variables.
• to detect the directions and size of the relationship between variables.
Feature selection is a component of dimensionality reduction where it reduces the number of features to maximize the performance of a machine learning model. Too many features in a dataset can overwhelm a machine learning classifier and potentially reduce the efficacy. 12 The Boruta feature algorithm is a wrapper algorithm that underpins the random forest model. From the results yielded by Tang et al., 12 feature selection is able to effectively recognize and improve overall evaluation metrics on their medical dataset research. Support Vector Machine (SVM) is able to build the best possible boundary of a line called hyperplanes, which can segregate dimensional spaces into classes. In the work of Sekeroglu et al., 13 they achieved good results with SVM on Mathematics and Portuguese subjects from two secondary schools.
Naïve Bayes (NB) is based on Bayes rule of conditional probability and has high capabilities in dealing big datasets. 4 The method is used to estimate the probability of a property given set of data as proof and Bayes' theorem. The posterior is calculated from the product of likelihood and prior and divisible by its evidence.
Multilayer perceptron (MLP) underpins the artificial neutral network (ANN). 4 It has an interconnection of perceptron in which it flows from the input to the output in a single direction with multiple routes.

Methodology and Results
In this research work, the approach consists of seven stages, namely data acquisition, data processing, data integration, data discretization, data transformation, feature selection and classification. The flow of the research is shown in Figure 1.

a) Data acquisition
The dataset of student performance is taken from a population of two Portuguese secondary schools namely Gabriel Pereira Secondary School (395 students) 14 and Mousinho da Silveira Secondary School (649 students). 15 In the survey, the students were taking the subjects, Mathematics and Portuguese. The two datasets were combined and consisted of 1044 students' personal data and scores for the two subjects. The datasets are visualizations and shown in Figures 2 to 6.

b) Data processing
This process helps to validate the two datasets by making sure there is no missing term in any feature.

c) Data integration
The two datasets were combined and consisted of 1044 students' records with 33 features. By adopting EDA, 11 the selected features are then assigned into four groups comprising of student background (12 features), lifestyle (18 features), history of grades (three features) and all features. Tables 1 to 3 shown the features in student background, lifestyle, history of grades respectively. The category 'all' consists of the entire 33 features. Tables 4 and 5 show the binary levels and 5 levels 5 after discretization, representing the grades of the students.        Table 4. Binary levels classification.

g) Classification
Three supervised machine learning techniques were implemented which are support vector machine, naïve Bayes, and multilayer perceptron 60-40 and 50-50 train-test splits and 10-fold cross validation. Four categories that comprise of student background, student lifestyle, student history of grades (history) and all features. Experiments are carried out on binary levels and five level classification. Binary levels classification will indicate fail or pass, meanwhile for the five levels classification is for student scores F, D, C, B and A.
GridSearchCV is applied to perform hyperparameter tuning. The performance metrics are accuracy, precision, recall and F1-Score. The experiments results are shown from Tables 6 to 11. SVM obtained the highest accuracy, with scores of 77%, 80%, 91% and 90% on background, lifestyle, history of grades and all features respectively in 50-50 train-test splits for binary classification (pass or fail). SVM also obtained highest accuracy for the five-class classification (grade A, B, C, D and F) with 39%, 38%, 73% and 71% for the four categories respectively. Based on the results, history of student grades shows significant contribution to a good student performance,  Table 12 shows the comparison of our models with other research work in 50-50 train-test splits for binary classification (pass or fail) on the dataset with population of two Portuguese secondary schools.

Conclusions
The paper presented predictive modelling of student performance based on four categories. Based on the results, history of student grades shows significant contribution to a good student performance. SVM obtained the highest accuracy with scores of 77%, 80%, 91% and 90% on background, lifestyle, history of grades and all features respectively in 50-50 traintest splits for binary classification (pass or fail). SVM also obtained highest accuracy for five class classification (grade A, B, C, D and F) with 39%, 38%, 73% and 71% for the four categories respectively. The results show that the history of grades form significant influence on the student performance. The study looks at data only from Portugal and may not reflect a general view of the case. Future work will include more datasets from different countries. Also, other classifiers will be explored and investigated.

Open Peer Review
The title of the fourth part of the paper, "Methodology", should be changed to "Methodology and Experimental Results".

4.
The number of students aged 20 and 21 is not given in Figure 2, is it a problem with the data set？ 5.
With the experimental results in Table 6-11, there are differences in the results obtained by different classifiers. What is the theoretical basis of the paper for the choice of classifier? 6.
In the conclusion, the contributions and flaws of the proposed method are not discussed. 7.

Hu NG, Multimedia University, Cyberjaya, Malaysia
Dear Prof Huiling Chen, We are greatly appreciative of the insightful comments and helpful suggestions that you have provided. The following are our response on the issues that you have highlighted: Your comment: In the results of the Abstract, the authors summarize the classification results of SVM on the dataset. However, we do not see the impact and contribution of the proposed method on the experimental results. 1.

Our response:
Thank you for the comments. From this research work, we found out that history of grades forms significant influence on the student performance. This is the main impact and contribution.
Your comment: 2. Introduction -include the description of the innovations, contributions, and the structure of the article.

Our response:
The last 2 paragraphs of the Introduction have been rewritten to reflect this.
The contributions of this paper are the identification of significant features that influence student assessment, which in turn can be used to develop various predictive models to ascertain student performance. This will assist educators to form corrective or remedial actions can help to improve student performance. In addition, this may also assist in formulating curriculums that may direct students to career pathways that are most suitable for them. The paper is structured as follows: Related Works, Methodology and Results, followed by Conclusions.
Your comment: 3. In the introduction, I also suggest the authors make a comprehensive investigation on the machine learning method such as the works by Your comment: 4. The title of the fourth part of the paper, "Methodology", should be changed to "Methodology and Experimental Results".

Our response:
The title has been rephrased to 'Methodology and Results'.
Your comment: 5. The number of students aged 20 and 21 is not given in Figure 2, is it a problem with the data set？ Our response: Figure 2 had been edited to show the number of students aged 20 and aged 21.
Your comment: 6. With the experimental results in Table 6-11, there are differences in the results obtained by different classifiers. What is the theoretical basis of the paper for the choice of classifier?
Our response: Due to previous work, we found out that these classifiers work well for our use cases, that is why in this work we have only applied these. We will compare other classifiers in future work.
complete from an experimental perspective, but it needs improvement in related works and the introduction section. After these modifications, the study will be approvable. There are grammatical errors in this paper, which should be revised. 1.
The literature review should be more in detail and add at least five more papers. 2.
The conclusion should also be elaborated a little more. The major findings in their study should be discussed. For example, out of the classifiers applied, which classifier demonstrated the best accuracy? An evaluation of the methodology that the authors deployed would be welcome.

3.
The introduction section should also focus on the research problem. Why this kind of research is beneficial, and to whom? How can management take advantage of it and how can companies evaluate these results to find the best students/universities for job placements? expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

Author Response 17 May 2022
Hu NG, Multimedia University, Cyberjaya, Malaysia Dear Prof Sadiq Hussain, We are greatly appreciative of the insightful comments and helpful suggestions that you have provided.
The following are our response on the issues that you have highlighted: Your comment: a) There are grammatical errors in this paper, which should be revised.

Our response:
The paper has been proofread for grammatical errors. Thank you for pointing this out.
Your comment: b) The literature review should be more in detail and add at least five more papers.

Our response:
Six papers on student dropout, interpersonal influences, as well as career decisions have been added to the Literature Review. g) What were the most influential features in student performance? Was there any unnecessary feature that was taken into account?

Our response:
The most influential features come from the history of student grades ( Table 3). The following statement can be found in the discussions of the classification result: Based on the results, history of student grades shows significant contribution to a good student performance, where the classification rates obtained are the highest among the four respective categories in each respective classifier.
No unnecessary features were taken into account.
Competing Interests: No competing interests were disclosed.
The benefits of publishing with F1000Research: Your article is published within days, with no editorial bias • You can publish traditional articles, null/negative results, case reports, data notes and more • The peer review process is transparent and collaborative • Your article is indexed in PubMed after passing peer review • Dedicated customer support at every stage • For pre-submission enquiries, contact research@f1000.com