ALL Metrics
-
Views
-
Downloads
Get PDF
Get XML
Cite
Export
Track
Research Article
Revised

A Machine Learning Approach to Predictive Modelling of Student Performance

[version 2; peer review: 2 approved]
PUBLISHED 23 May 2022
Author details Author details
OPEN PEER REVIEW
REVIEWER STATUS

This article is included in the Research Synergy Foundation gateway.

Abstract

Background - Many factors affect student performance such as the individual’s background, habits, absenteeism and social activities. Using these factors, corrective actions can be determined to improve their performance. This study looks into the effects of these factors in predicting student performance from a data mining approach. This study presents a data mining approach in identify significant factors and predict student performance, based on two datasets collected from two secondary schools in Portugal.
Methods – In this study, two datasets  are augmented to increase the sample size by merging them.  Following that, data pre-processing is performed and the features are normalized with linear scaling to avoid bias on heavy weighted attributes.  The selected features are then assigned into four groups comprising of student background, lifestyle, history of grades and all features. Next, Boruta feature selection is performed to remove irrelevant features. Finally, the classification models of Support Vector Machine (SVM) , Naïve Bayes (NB) , and Multilayer Perceptron (MLP)  origins are designed and their performances evaluated.
Results - The models were trained and evaluated on an integrated dataset comprising 1044 student records with 33 features, after feature selection. The classification was performed with SVM, NB and MLP with 60-40 and 50-50 train-test splits and 10-fold cross validation. GridSearchCV was applied to perform hyperparameter tuning. The performance metrics were accuracy, precision, recall and F1-Score. SVM obtained the highest accuracy with scores of 77%, 80%, 91% and 90% on background, lifestyle, history of grades and all features respectively in 50-50 train-test splits for binary levels classification . SVM also obtained highest accuracy for five levels  classification  with 39%, 38%, 73% and 71% for the four categories respectively. The results show that the history of grades form significant influence on the student performance.

Keywords

Student performance, data mining, support vector machine, naïve bayes, multilayer perceptron

Revised Amendments from Version 1

Referring to comments from the reviewers, we have made the following changes: 1) We have added papers into the Literature Review. 2) We have also amended the Introduction to better show the contributions, significant findings as well as the structure of the paper. 3) We have also added statements that better present the research problem in the Introduction. 4) To improve the flow of the paper, we have also changed the title of the Section 'Methodology' to 'Methodology and Results'. 5) We have also emphasized the significant findings in the discussions and the Conclusions. 6) Updates have been made to Fig 2 to show missing values. 7) The Conclusions have been revised to include the significant results, contribution and the weakness, which will be addressed in future work.

See the authors' detailed response to the review by Huiling Chen
See the authors' detailed response to the review by Sadiq Hussain

Introduction

The definition of a student is a person who attends school or any education institution level to achieve a certain level of knowledge or skill set in a course under the supervision of an educator. Almost everyone was once a student with responsibilities to acquire proper education. Acquiring knowledge by means of getting the right education is of utmost importance and each person should have basic equality in receiving education.

When discussing education at the secondary level, a vital aspect to consider is student performance. Student performance can be assessed in a variety of dimensions, either through exam-based assessment or participation-based assessment. Exam-based assessment includes quizzes, midterms, and final exams, while participation-based assessment is a two-way communication during learning and group activities.

Apart from the obvious, there are so many factors that can affect student performance, such as individual habits, absenteeism, social activities after school and others. This gives way to having machines to learn patterns from data so that they can predict how well a student performs; by acknowledging these factors and subsequently detecting and improve their performance as early as possible.

The contributions of this paper are the identification of significant features that influence student assessment, which in turn can be used to develop various predictive models to ascertain student performance. This will assist educators to form corrective or remedial actions can help to improve student performance. In addition, this may also assist in formulating curriculums that may direct students to career pathways that are most suitable for them.

The paper is structured as follows: Related Works, Methodology and Results, followed by Conclusions.

Related Works

Student performance is an essential part in a secondary-level education as it will show where the student stands when continuing to higher education. Daud et al.1 noted that the ability to predict the success of a student is essential and seems to be a fascinating area to dig into.

Sokkhey et al.2 found out that mathematics is one of the subjects that has scientific progression on students. All aspects of human life at various levels are influenced by mathematics and there are no instances in life where mathematics is not used.

Akhtar et al.3 discovered that social status is correlated with family’s social and monetary wealth. They managed to find the effect of monetary wealth on students’ grade in Pakistan.

Amazona et al.4 as well as Hussain et al.5 have adopted educational data mining (EDM) methods to perform gathering, achieving, and studying of information concerning student’s assessment and learning.

On another note, researchers have also looked at student dropout,6 interpersonal influences7 as well as career decisions after graduation albeit at the tertiary level.810

Exploratory data analysis (EDA)11 is a method of analyzing dataset to summarize the important features via visualization. EDA helps:

  • to find errors.

  • to check assumptions.

  • to determine the tentative choice of suitable models and tools.

  • to determine the relationship between the dependent and independent variables.

  • to detect the directions and size of the relationship between variables.

Feature selection is a component of dimensionality reduction where it reduces the number of features to maximize the performance of a machine learning model. Too many features in a dataset can overwhelm a machine learning classifier and potentially reduce the efficacy.12

The Boruta feature algorithm is a wrapper algorithm that underpins the random forest model. From the results yielded by Tang et al.,12 feature selection is able to effectively recognize and improve overall evaluation metrics on their medical dataset research.

Support Vector Machine (SVM) is able to build the best possible boundary of a line called hyperplanes, which can segregate dimensional spaces into classes. In the work of Sekeroglu et al.,13 they achieved good results with SVM on Mathematics and Portuguese subjects from two secondary schools.

Naïve Bayes (NB) is based on Bayes rule of conditional probability and has high capabilities in dealing big datasets.4 The method is used to estimate the probability of a property given set of data as proof and Bayes’ theorem. The posterior is calculated from the product of likelihood and prior and divisible by its evidence.

Multilayer perceptron (MLP) underpins the artificial neutral network (ANN).4 It has an interconnection of perceptron in which it flows from the input to the output in a single direction with multiple routes.

Methodology and Results

In this research work, the approach consists of seven stages, namely data acquisition, data processing, data integration, data discretization, data transformation, feature selection and classification. The flow of the research is shown in Figure 1.

c1fcf586-a40f-4e3b-840c-d4a9d8199239_figure1.gif

Figure 1. Flow of the processes.

a) Data acquisition

The dataset of student performance is taken from a population of two Portuguese secondary schools namely Gabriel Pereira Secondary School (395 students)14 and Mousinho da Silveira Secondary School (649 students).15 In the survey, the students were taking the subjects, Mathematics and Portuguese. The two datasets were combined and consisted of 1044 students’ personal data and scores for the two subjects. The datasets are visualizations and shown in Figures 2 to 6.

c1fcf586-a40f-4e3b-840c-d4a9d8199239_figure2.gif

Figure 2. Distribution of age.

c1fcf586-a40f-4e3b-840c-d4a9d8199239_figure3.gif

Figure 3. Distribution of gender.

c1fcf586-a40f-4e3b-840c-d4a9d8199239_figure4.gif

Figure 4. Distribution of subject.

c1fcf586-a40f-4e3b-840c-d4a9d8199239_figure5.gif

Figure 5. Distribution of student accommodation.

c1fcf586-a40f-4e3b-840c-d4a9d8199239_figure6.gif

Figure 6. Distribution of relationship with parents.

b) Data processing

This process helps to validate the two datasets by making sure there is no missing term in any feature.

c) Data integration

The two datasets were combined and consisted of 1044 students’ records with 33 features. By adopting EDA,11 the selected features are then assigned into four groups comprising of student background (12 features), lifestyle (18 features), history of grades (three features) and all features. Tables 1 to 3 shown the features in student background, lifestyle, history of grades respectively. The category ‘all’ consists of the entire 33 features.

Table 1. Student background.

FeatureDescriptionValue
sexGender of studentMale or Female
ageAge of student15–22
schoolSchool of studentGabriel Pereira or Mousinho da Silveira
addressType of student’s home addressUrban or Rural
famsizeSize of family≤3 or >3
PstatusParent’s cohabitation statusLiving together or apart
MeduEducation of parentsNone, Primary education, 5th to 9th grade, Secondary education, Higher education
Fedu
MjobJob of parentsAt home, Civil services, Teacher, Healthcare related, Other
Fjob
reasonReason to choose the schoolClose to home, School reputation, Course preference, Other
guardianGuardian of studentFather or mother, Other

Table 2. Student lifestyle.

FeatureDescriptionValue
traveltimeTravel time from home to school<15 minutes
15 to 30 minutes
30 minutes to 1 hour
>1 hour
studytimeWeekly study time<2 hours
2 to 5 hours
5 to 10 hours
>10 hours
failuresNumber of past class failuresn if 1 ≤ n < 3, else 4
schoolsupExtra educational school supportYes or no
famsupEducational support from family
paidExtra paid classes within the course subject
activitesExtra-curricular activities
nurseryAttended nursery school
higherPlans for higher education
internetHave internet access at home
romanticIn a romantic relationship
famrelQuality relationship with familyVery low (1) to very high (5)
freetimeFree time after school
gooutGoing out with friends
DalcWeekday alcohol consumption
WalcWeekend alcohol consumption
healthCurrent health status
absencesNumber of school absences0–93

Table 3. Student history of grades.

FeatureDescriptionValue
G1First period grade0–20
G2Second period grade
G3Final grade

d) Data discretization

Tables 4 and 5 show the binary levels and 5 levels5 after discretization, representing the grades of the students.

Table 4. Binary levels classification.

Ordinal categoricalValue
Pass10–20
Fail0–9

Table 5. 5 Levels classification.

Ordinal categoricalValue
A15–20
B13–14
C10–12
D8–9
F0–7

e) Data transformation

The features are normalized with linear scaling to avoid bias on heavy weighted attributes.

f) Feature selection

Next, Boruta feature selection was performed to remove irrelevant features.

g) Classification

Three supervised machine learning techniques were implemented which are support vector machine, naïve Bayes, and multilayer perceptron 60–40 and 50–50 train-test splits and 10-fold cross validation. Four categories that comprise of student background, student lifestyle, student history of grades (history) and all features. Experiments are carried out on binary levels and five level classification. Binary levels classification will indicate fail or pass, meanwhile for the five levels classification is for student scores F, D, C, B and A.

GridSearchCV is applied to perform hyperparameter tuning. The performance metrics are accuracy, precision, recall and F1-Score. The experiments results are shown from Tables 6 to 11.

Table 6. SVM (Binary levels).

MetricsBackgroundLifestyleHistoryAll
60 Train – 40 Test
Accuracy0.7680.7890.8990.895
Precision0.7680.8040.9280.934
Recall1.0000.9580.9410.929
F1 Score0.8690.8740.9340.931
50 Train – 50 Test
Accuracy0.7720.7980.9080.900
Precision0.7720.8090.9320.931
Recall0.9990.9660.9480.941
F1 Score0.8710.8800.9400.936

Table 7. SVM (5 Levels).

MetricsBackgroundLifestyleHistoryAll
60 Train – 40 Test
Accuracy0.3940.3880.7420.716
Precision0.1950.3220.7500.715
Recall0.3940.3880.7420.716
F1 Score0.2460.2860.7420.708
50 Train – 50 Test
Accuracy0.3890.3810.7290.708
Precision0.1910.3290.7350.708
Recall0.3890.3810.7290.708
F1 Score0.2300.3000.7110.699

Table 8. NB (Binary levels).

MetricsBackgroundLifestyleHistoryAll
60 Train – 40 Test
Accuracy0.7600.7850.9030.891
Precision0.7750.8260.9540.948
Recall0.9680.9130.9180.907
F1 Score0.8610.8670.9350.927
50 Train – 50 Test
Accuracy0.7610.7870.9070.894
Precision0.7790.8300.9580.948
Recall0.9640.9110.9220.912
F1 Score0.8610.8680.9390.930

Table 9. NB (5 Levels).

MetricsBackgroundLifestyleHistoryAll
60 Train – 40 Test
Accuracy0.3730.2500.7480.515
Precision0.2830.2630.7520.515
Recall0.3730.2500.7480.515
F1 Score0.3060.1660.7470.497
50 Train – 50 Test
Accuracy0.3700.2570.7190.542
Precision0.2840.2610.7470.537
Recall0.3700.2570.7390.542
F1 Score0.3100.1730.7400.523

Table 10. MLP (Binary levels).

MetricsBackgroundLifestyleHistoryAll
60 Train – 40 Test
Accuracy0.7670.7870.8990.886
Precision0.7690.8030.9280.921
Recall0.9950.9570.9410.934
F1 Score0.8680.8730.9340.927
50 Train – 50 Test
Accuracy0.7670.7850.9060.886
Precision0.7730.7910.9320.919
Recall0.9870.9820.9480.936
F1 Score0.8670.7860.9400.927

Table 11. MLP (5 Levels).

MetricsBackgroundLifestyleHistoryAll
60 Train – 40 Test
Accuracy0.3860.3830.7440.715
Precision0.2360.3050.7510.707
Recall0.3860.3830.7440.715
F1 Score0.2640.3010.7350.700
50 Train – 50 Test
Accuracy0.3710.3750.7200.705
Precision0.2130.3610.7210.708
Recall0.3910.3850.7200.715
F1 Score0.2390.3260.6920.706

SVM obtained the highest accuracy, with scores of 77%, 80%, 91% and 90% on background, lifestyle, history of grades and all features respectively in 50–50 train–test splits for binary classification (pass or fail). SVM also obtained highest accuracy for the five-class classification (grade A, B, C, D and F) with 39%, 38%, 73% and 71% for the four categories respectively. Based on the results, history of student grades shows significant contribution to a good student performance, where the classification rates obtained are the highest among the four respective categories in each respective classifier. This finding is consistent with the observations from Hwang et al.,16 Mega et al.17 and Waheed et al.,18 that the students’ performance is highly related to the history of grades.

Table 12 shows the comparison of our models with other research work in 50–50 train–test splits for binary classification (pass or fail) on the dataset with population of two Portuguese secondary schools.

Table 12. Comparison of our models with others research work on two Portuguese secondary schools.

Model and featuresData
Mathematics (395 students)Portuguese (649 students)Mathematics and Portuguese (1044 students)
SVM on all features [our model]--0.90
SVM on history of grades [our model]--0.91
SVM on all features40.89--
Naive predictor on all features170.920.90-
SVM on all features170.860.91-

Conclusions

The paper presented predictive modelling of student performance based on four categories. Based on the results, history of student grades shows significant contribution to a good student performance. SVM obtained the highest accuracy with scores of 77%, 80%, 91% and 90% on background, lifestyle, history of grades and all features respectively in 50-50 train-test splits for binary classification (pass or fail). SVM also obtained highest accuracy for five class classification (grade A, B, C, D and F) with 39%, 38%, 73% and 71% for the four categories respectively. The results show that the history of grades form significant influence on the student performance. The study looks at data only from Portugal and may not reflect a general view of the case. Future work will include more datasets from different countries. Also, other classifiers will be explored and investigated.

Data availability

Underlying data

Kaggle: A machine learning approach to predictive modelling of student performance

https://www.kaggle.com/larsen0966/student-performance-data-set

and

https://archive.ics.uci.edu/ml/datasets/Student+Performance

Data are available under the terms of the Creative Commons Zero “No rights reserved” data waiver (CC0 1.0 Public domain dedication).

Ethics approval

Ethical Approval Number: EA1612021 (From Technology Transfer Office (TTO), Multimedia University).

Comments on this article Comments (0)

Version 2
VERSION 2 PUBLISHED 11 Nov 2021
Comment
Author details Author details
Competing interests
Grant information
Copyright
Download
 
Export To
metrics
Views Downloads
F1000Research - -
PubMed Central
Data from PMC are received and updated monthly.
- -
Citations
CITE
how to cite this article
Ng H, bin Mohd Azha AA, Yap TTV and Goh VT. A Machine Learning Approach to Predictive Modelling of Student Performance [version 2; peer review: 2 approved]. F1000Research 2022, 10:1144 (https://doi.org/10.12688/f1000research.73180.2)
NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.
track
receive updates on this article
Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status: ?
Key to Reviewer Statuses VIEW
ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions
Version 2
VERSION 2
PUBLISHED 23 May 2022
Revised
Views
7
Cite
Reviewer Report 13 Jun 2022
Huiling Chen, College of Computer Science and Artificial Intelligence, Wenzhou University, Wenzhou, China 
Approved
VIEWS 7
The authors have solved all the comments, ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Chen H. Reviewer Report For: A Machine Learning Approach to Predictive Modelling of Student Performance [version 2; peer review: 2 approved]. F1000Research 2022, 10:1144 (https://doi.org/10.5256/f1000research.134172.r138617)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
  • Author Response 16 Jun 2022
    Hu NG, Faculty of Computer and Informatics, Multimedia University, Cyberjaya, 63100, Malaysia
    16 Jun 2022
    Author Response
    Thank you for the approval.

    Cheers,
    nghu
    Competing Interests: No competing interests were disclosed.
COMMENTS ON THIS REPORT
  • Author Response 16 Jun 2022
    Hu NG, Faculty of Computer and Informatics, Multimedia University, Cyberjaya, 63100, Malaysia
    16 Jun 2022
    Author Response
    Thank you for the approval.

    Cheers,
    nghu
    Competing Interests: No competing interests were disclosed.
Views
7
Cite
Reviewer Report 24 May 2022
Sadiq Hussain, System Administrator, Dibrugarh University, Dibrugarh, Assam, India 
Approved
VIEWS 7
All my queries and concerns were handled ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Hussain S. Reviewer Report For: A Machine Learning Approach to Predictive Modelling of Student Performance [version 2; peer review: 2 approved]. F1000Research 2022, 10:1144 (https://doi.org/10.5256/f1000research.134172.r138618)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
  • Author Response 24 May 2022
    Hu NG, Faculty of Computer and Informatics, Multimedia University, Cyberjaya, 63100, Malaysia
    24 May 2022
    Author Response
    Dearest Reviewer,

    Thank you for the comments. Thank you for everything.
    Competing Interests: No competing interests were disclosed.
COMMENTS ON THIS REPORT
  • Author Response 24 May 2022
    Hu NG, Faculty of Computer and Informatics, Multimedia University, Cyberjaya, 63100, Malaysia
    24 May 2022
    Author Response
    Dearest Reviewer,

    Thank you for the comments. Thank you for everything.
    Competing Interests: No competing interests were disclosed.
Version 1
VERSION 1
PUBLISHED 11 Nov 2021
Views
12
Cite
Reviewer Report 12 May 2022
Huiling Chen, College of Computer Science and Artificial Intelligence, Wenzhou University, Wenzhou, China 
Approved with Reservations
VIEWS 12
The authors propose a data mining approach to identify significant factors and predict student performance, based on two datasets collected from two secondary schools in Portugal. The overall structure of the article is also reasonable. The interpretation and description of ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Chen H. Reviewer Report For: A Machine Learning Approach to Predictive Modelling of Student Performance [version 2; peer review: 2 approved]. F1000Research 2022, 10:1144 (https://doi.org/10.5256/f1000research.76815.r136540)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
  • Author Response 23 May 2022
    Hu NG, Faculty of Computer and Informatics, Multimedia University, Cyberjaya, 63100, Malaysia
    23 May 2022
    Author Response
    Dear Prof Huiling Chen,
    We are greatly appreciative of the insightful comments and helpful suggestions that you have provided.
    The following are our response on the issues that you have ... Continue reading
COMMENTS ON THIS REPORT
  • Author Response 23 May 2022
    Hu NG, Faculty of Computer and Informatics, Multimedia University, Cyberjaya, 63100, Malaysia
    23 May 2022
    Author Response
    Dear Prof Huiling Chen,
    We are greatly appreciative of the insightful comments and helpful suggestions that you have provided.
    The following are our response on the issues that you have ... Continue reading
Views
20
Cite
Reviewer Report 18 Nov 2021
Sadiq Hussain, System Administrator, Dibrugarh University, Dibrugarh, Assam, India 
Approved with Reservations
VIEWS 20
The paper presented a predictive model on student performance based on the data from schools in Portugal. Student grades are the most important feature observed in the study. The study is complete from an experimental perspective, but it needs improvement in ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Hussain S. Reviewer Report For: A Machine Learning Approach to Predictive Modelling of Student Performance [version 2; peer review: 2 approved]. F1000Research 2022, 10:1144 (https://doi.org/10.5256/f1000research.76815.r99883)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
  • Author Response 23 May 2022
    Hu NG, Faculty of Computer and Informatics, Multimedia University, Cyberjaya, 63100, Malaysia
    23 May 2022
    Author Response
    Dear Prof Sadiq Hussain,
    We are greatly appreciative of the insightful comments and helpful suggestions that you have provided.
    The following are our response on the issues that you have ... Continue reading
COMMENTS ON THIS REPORT
  • Author Response 23 May 2022
    Hu NG, Faculty of Computer and Informatics, Multimedia University, Cyberjaya, 63100, Malaysia
    23 May 2022
    Author Response
    Dear Prof Sadiq Hussain,
    We are greatly appreciative of the insightful comments and helpful suggestions that you have provided.
    The following are our response on the issues that you have ... Continue reading

Comments on this article Comments (0)

Version 2
VERSION 2 PUBLISHED 11 Nov 2021
Comment
Alongside their report, reviewers assign a status to the article:
Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions
Sign In
If you've forgotten your password, please enter your email address below and we'll send you instructions on how to reset your password.

The email address should be the one you originally registered with F1000.

Email address not valid, please try again

You registered with F1000 via Google, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Google account password, please click here.

You registered with F1000 via Facebook, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Facebook account password, please click here.

Code not correct, please try again
Email us for further assistance.
Server error, please try again.