Keywords
Student performance, data mining, support vector machine, naïve bayes, multilayer perceptron
This article is included in the Research Synergy Foundation gateway.
Student performance, data mining, support vector machine, naïve bayes, multilayer perceptron
The definition of a student is a person who attends school or any education institution level to achieve a certain level of knowledge or skill set in a course under the supervision of an educator. Almost everyone was once a student with responsibilities to acquire proper education. Acquiring knowledge by means of getting the right education is of utmost importance and each person should have basic equality in receiving education.
When discussing education at the secondary level, a vital aspect to consider is student performance. Student performance can be assessed in a variety of dimensions, either through exam-based assessment or participation-based assessment. Exam-based assessment includes quizzes, midterms, and final exams, while participation-based assessment is a two-way communication during learning and group activities.
Apart from the obvious, there are so many factors that can affect student performance, such as individual habits, absenteeism, social activities after school and others. This gives way to having machines to learn patterns from data so that they can predict how well a student performs; by acknowledging these factors and subsequently detecting and improve their performance as early as possible. The objectives of this paper are to determine significant features with exploratory data analysis, which in turn can be used to develop various predictive models to ascertain student performance.
Student performance is an essential part in a secondary-level education as it will show where the student stands when continuing to higher education. Daud et al.1 noted that the ability to predict the success of a student is essential and seems to be a fascinating area to dig into.
Sokkhey et al.2 found out that mathematics is one of the subjects that has scientific progression on students. All aspects of human life at various levels are influenced by mathematics and there are no instances in life where mathematics is not used.
Akhtar et al.3 discovered that social status is correlated with family’s social and monetary wealth. They managed to find the effect of monetary wealth on students’ grade in Pakistan.
Amazona et al.4 adopted an educational data mining (EDM) method to perform gathering, achieving, and studying of information concerning student’s assessment and learning.
Exploratory data analysis (EDA)5 is a method of analyzing dataset to summarize the important features via visualization. EDA helps:
• to find errors.
• to check assumptions.
• to determine the tentative choice of suitable models and tools.
• to determine the relationship between the dependent and independent variables.
• to detect the directions and size of the relationship between variables.
Feature selection is a component of dimensionality reduction where it reduces the number of features to maximize the performance of a machine learning model. Too many features in a dataset can overwhelm a machine learning classifier and potentially reduce the efficacy.6
The Boruta feature algorithm is a wrapper algorithm that underpins the random forest model. From the results yielded by Tang et al.,6 feature selection is able to effectively recognize and improve overall evaluation metrics on their medical dataset research.
SVM is able to build the best possible boundary of a line called hyperplanes, which can segregate dimensional spaces into classes. In the work of Sekeroglu et al.,7 they achieved good results with SVM on Mathematics and Portuguese subjects from two secondary schools.
Naïve Bayes is based on Bayes rule of conditional probability and has high capabilities in dealing big datasets.4 The method is used to estimate the probability of a property given set of data as proof and Bayes’ theorem. The posterior is calculated from the product of likelihood and prior and divisible by its evidence.
Multilayer perceptron underpins the artificial neutral network (ANN).4 It has an interconnection of perceptron in which it flows from the input to the output in a single direction with multiple routes.
In this research work, the approach consists of seven stages, namely data acquisition, data processing, data integration, data discretization, data transformation, feature selection and classification. The flow of the research is shown in Figure 1.
a) Data acquisition
The dataset of student performance is taken from a population of two Portuguese secondary schools namely Gabriel Pereira Secondary School (395 students)8 and Mousinho da Silveira Secondary School (649 students).9 In the survey, the students were taking the subjects, Mathematics and Portuguese. The two datasets were combined and consisted of 1044 students’ personal data and scores for the two subjects. The datasets are visualizations and shown in Figures 2 to 6.
b) Data processing
This process helps to validate the two datasets by making sure there is no missing term in any feature.
c) Data integration
The two datasets were combined and consisted of 1044 students’ records with 33 features. By adopting EDA,5 the selected features are then assigned into four groups comprising of student background (12 features), lifestyle (18 features), history of grades (three features) and all features. Tables 1 to 3 shown the features in student background, lifestyle, history of grades respectively. While the all the features group consists of the entire 33 features.
Feature | Description | Value |
---|---|---|
G1 | First period grade | 0–20 |
G2 | Second period grade | |
G3 | Final grade |
d) Data discretization
The values for G1, G2 and G3 in Table 3 are discretized into binary and 5-level classifications. Tables 4 and 5 show the binary classification and 5-level classification.
e) Data transformation
The features are normalized with linear scaling to avoid bias on heavy weighted attributes.
f) Data selection
Next, Boruta feature selection was performed to remove irrelevant features.
g) Classification
Three supervised machine learning techniques were implemented which are support vector machine, naïve Bayes, and multilayer perceptron 60–40 and 50–50 train-test splits and 10-fold cross validation. Four categories that comprise of student background, student lifestyle, student history of grades (history) and all features. Experiments are carried out on binary and five-level classification. Binary classification will indicate fail or pass, meanwhile for the five-level classification is for student scores F, D, C, B and A.
GridSearchCV is applied to perform hyperparameter tuning. The performance metrics are accuracy, precision, recall and F1-Score. The experiments results are shown from Tables 6 to 11.
SVM obtained the highest accuracy, with scores of 77%, 80%, 91% and 90% on background, lifestyle, history of grades and all features respectively in 50–50 train–test splits for binary classification (pass or fail). SVM also obtained highest accuracy for the five-class classification (grade A, B, C, D and F) with 39%, 38%, 73% and 71% for the four categories respectively. Based on the results, history of student grades shows significant contribution to a good student performance, where the classification rates obtained are the highest among the four respective categories in each respective classifier. This finding is consistent with the observations from Hwang et al.,10 Mega et al.11 and Waheed et al.,12 that the students’ performance is highly related to the previous results obtained.
Table 12 shows the comparison of our models with other research work in 50–50 train–test splits for binary classification (pass or fail) on the dataset with population of two Portuguese secondary schools.
Model and features | Data | ||
---|---|---|---|
Mathematics (395 students) | Portuguese (649 students) | Mathematics and Portuguese subject | |
SVM on all features [our model] | - | - | 0.90 |
SVM on history of grades [our model] | - | - | 0.91 |
SVM on all features4 | 0.89 | - | - |
naive predictor on all features11 | 0.92 | 0.90 | - |
SVM on all features11 | 0.86 | 0.91 | - |
The paper presented predictive modelling of student performance based on four categories. Based on the results, history of student grades shows significant contribution to a good student performance. The study looks at data only from Portugal and may not reflect a general view of the case. Future work will include more datasets from difference country.
Kaggle: A machine learning approach to predictive modelling of student performance
https://www.kaggle.com/larsen0966/student-performance-data-set
and
https://archive.ics.uci.edu/ml/datasets/Student+Performance
Data are available under the terms of the Creative Commons Zero “No rights reserved” data waiver (CC0 1.0 Public domain dedication).
Ethical Approval Number: EA1612021 (From Technology Transfer Office (TTO), Multimedia University).
Views | Downloads | |
---|---|---|
F1000Research | - | - |
PubMed Central
Data from PMC are received and updated monthly.
|
- | - |
Is the work clearly and accurately presented and does it cite the current literature?
Yes
Is the study design appropriate and is the work technically sound?
Yes
Are sufficient details of methods and analysis provided to allow replication by others?
Yes
If applicable, is the statistical analysis and its interpretation appropriate?
Yes
Are all the source data underlying the results available to ensure full reproducibility?
Yes
Are the conclusions drawn adequately supported by the results?
Yes
References
1. Wang Z, Liang G, Chen H: Tool for Predicting College Student Career Decisions: An Enhanced Support Vector Machine Framework. Applied Sciences. 2022; 12 (9). Publisher Full TextCompeting Interests: No competing interests were disclosed.
Reviewer Expertise: data mining
Is the work clearly and accurately presented and does it cite the current literature?
Partly
Is the study design appropriate and is the work technically sound?
Yes
Are sufficient details of methods and analysis provided to allow replication by others?
Yes
If applicable, is the statistical analysis and its interpretation appropriate?
Yes
Are all the source data underlying the results available to ensure full reproducibility?
Yes
Are the conclusions drawn adequately supported by the results?
Partly
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Educational Data Mining, Medical Analytics
Alongside their report, reviewers assign a status to the article:
Invited Reviewers | ||
---|---|---|
1 | 2 | |
Version 2 (revision) 23 May 22 |
read | read |
Version 1 11 Nov 21 |
read | read |
Provide sufficient details of any financial or non-financial competing interests to enable users to assess whether your comments might lead a reasonable person to question your impartiality. Consider the following examples, but note that this is not an exhaustive list:
Sign up for content alerts and receive a weekly or monthly email with all newly published articles
Already registered? Sign in
The email address should be the one you originally registered with F1000.
You registered with F1000 via Google, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Google account password, please click here.
You registered with F1000 via Facebook, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Facebook account password, please click here.
If your email address is registered with us, we will email you instructions to reset your password.
If you think you should have received this email but it has not arrived, please check your spam filters and/or contact for further assistance.
Comments on this article Comments (0)