ALL Metrics
-
Views
-
Downloads
Get PDF
Get XML
Cite
Export
Track
Research Article

Application of K-Means Clustering for Job Applicant Analysis in Construction Firms Using R

[version 1; peer review: awaiting peer review]
PUBLISHED 10 Dec 2025
Author details Author details
OPEN PEER REVIEW
REVIEWER STATUS AWAITING PEER REVIEW

This article is included in the RPackage gateway.

Abstract

This study applies the K-Means Clustering algorithm to analyse job applicant data at a construction consulting firm. Considering three main variables—AutoCAD drawing skills, planning and supervision report writing skills, and adaptability—this study aims to categorize job applicants into three groups: “Rejected,” “Under Consideration,” and “Accepted.” The clustering process was conducted using the R program with an initial centroid-based approach and visualized through 2D and 3D scatter plots to map the distribution patterns of applicants based on their attributes. The results of the study show that this clustering not only provides a deep understanding of the characteristics of applicants but also supports the company in optimizing data-driven recruitment strategies. Applicants with high scores in all three variables tend to fall into the “Accepted” category, while those with moderate scores are categorized as “Under Consideration.” The visualization of results offers sharper insights into applicant distribution, which can be used to identify training needs or improvements to the selection process. This research contributes to the development of an efficient and objective data-driven recruitment system in the construction sector.

Keywords

K-Means Clustering; data-driven recruitment; workforce selection; cluster visualization; construction competencies

1. Introduction

1.1 Research background

In the modern workplace, an effective workforce selection process is one of the key factors in supporting human resource development. Career transformation, as part of this development, is influenced not only by technical skills but also by an individual’s adaptability and interpersonal competencies. As explained by Pala (2021), data-driven workforce characteristic grouping can provide deep insights into how workers adapt to various work environments. The recruitment process not only involves searching for candidates from external sources but also requires effective and efficient decision-making, considering recruitment sources, techniques used, and the potential of available local labor (Widodo, 2018). Recruitment begins with job analysis, which helps describe the tasks, responsibilities, and qualification requirements for each position, thereby facilitating the selection process and ensuring the suitability of applicants for the positions offered (Widodo, 2018).

In the context of job searching, individuals need to understand their personal assets, study job opportunities, develop career plans, and build networks to obtain jobs that match their skills and needs ( London, 1973). This process reflects the importance of self-development to compete in an increasingly complex job market. Gangl (2003) emphasizes that job skills are the primary factor determining an individual’s chances of securing employment, particularly in sectors requiring specific technical expertise. In the construction industry, technical skills such as the ability to use AutoCAD for drafting, prepare planning reports, and adapt to collaborate with various stakeholders are highly valued attributes.

The increase in infrastructure development in Indonesia, such as the Nusantara Capital City (Ibu Kota Negara, here-after IKN) project in East Kalimantan, is driving the need for competent construction workers. This national strategic project aims not only to relocate the country’s administrative functions from Jakarta but also to promote economic equality and development outside Java (Irmawan et al., 2023). The IKN project requires over 260,000 construction workers by 2024, spanning various educational levels and backgrounds, to meet the demands of technical work and social adaptation on-site (Supriyanti et al., 2023). However, the labor selection process for large projects like this often faces challenges in managing applicant data efficiently, given the large and diverse volume of data.

The cluster analysis-based approach offers a solution to help companies understand the characteristics of job applicants. Clustering is a statistical technique used to divide data into several groups based on the similarity of certain attributes, thereby facilitating interpretation and decision-making (Jain et al., 1999). One of the most commonly used clustering algorithms is K-Means Clustering, which divides data into several clusters based on the distance to the centroid, which is the center of the data. This algorithm works through iterations to minimize the distance between the data and the cluster centroid until an optimal result is achieved. In the context of workforce selection, K-Means allows companies to group applicants into specific categories such as Rejected, Under Consideration, and Accepted, based on their specific characteristics.

This process also opens up opportunities to identify applicants with the highest potential for further development, as outlined in the literature on career transformation that emphasizes mapping individual competency strengths (El Achmar & Bhagat, 2023). Through this mapping, companies can not only screen candidates more objectively but also guide future workforce development. The 2D and 3D scatter plots generated from this clustering provide an intuitive visual representation, making it easier to make data-driven decisions. This research is unique because it applies the K-Means algorithm directly to job applicant data obtained from a construction consulting company. The data includes three main variables: ability to draw using AutoCAD, ability to create planning and supervision reports, and social adaptability. 2D and 3D scatter plots are used to visualize the clustering results, providing deeper insights into the distribution patterns of job applicants in the construction sector.

Previous studies have demonstrated the effectiveness of the K-Means algorithm in various applications. Chen Yu (2020) explains that K-Means can help identify relevant patterns and trends in various fields, including education and industry. Wiharto and Suryani (2020) compared the K-Means algorithm with Fuzzy C-Means and found that K-Means is more efficient in terms of computation time. Additionally, Agustina and Prihandoko (2018) applied K-Means to group employees based on their level of discipline and performance, providing a basis for decision-making in human resource management.

Through this research, it is hoped that the clustering results will not only help companies screen job applicants more objectively but also serve as a basis for developing data-driven recruitment policies. This research also provides insights that can be utilized by educational institutions such as vocational schools and universities to align their curricula with the specific needs of the construction sector. Thus, this research contributes to enhancing the efficiency and accuracy of workforce selection while strengthening the connection between education and the construction industry, particularly in addressing the significant challenges of national development.

1.2 Literature review

Clustering is a technique used to group objects into specific clusters based on attribute similarities. Objects grouped into one cluster have a high degree of similarity compared to objects in other clusters. This process enables the separation of data regions so that objects with similar characteristics are in the same cluster region (Jain et al., 1999). With this method, data variation within a group can be minimized, while differences between groups can be maximized, thereby facilitating data analysis and interpretation. Clustering is a statistical technique used to group objects into specific groups based on attribute similarity. Objects within a cluster have a higher degree of relatedness compared to objects in other clusters (Manikandan et al., 2018). By grouping data or objects, variation within a group can be minimized, while differences between groups can be maximized. This process facilitates the interpretation of complex data, provides deeper insights, and aids in data-driven decision-making (Darmi & Setiawan, 2016).

One of the most commonly used clustering methods is K-Means Clustering, which works by dividing data into several clusters based on attribute similarities. This method uses centroids as the center of data in each cluster. The position of the centroid will continue to be updated through iterations until stability is achieved, to minimize the distance between each data point and the cluster centroid (Jain et al., 1999). Fadhli (2017) explains that K-Means is very effective in clustering high-dimensional data, making it a powerful tool for understanding and visualizing complex data. In the context of education, Firza and Sarjono (2020) show that the K-Means algorithm can be applied to analyze student data, such as major preferences and learning achievement evaluations.

In previous research, Wiharto and Suryani (2020) compared the K-Means algorithm with Fuzzy C-Means. The results showed that K-Means was more efficient in terms of computation time, although Fuzzy C-Means could produce higher accuracy in certain cases. Additionally, Agustina and Prihandoko (2018) applied K-Means to cluster employees based on discipline levels and performance, providing a basis for decision-making in human resource management. These studies demonstrate the flexibility of the K-Means algorithm across various applications, including in the fields of education and industry.

1.2.1 K-Means algorithm

K-Means is one of the partitioning clustering methods used in data analysis to divide data into several groups based on similar characteristics. This clustering process is performed iteratively, to minimize the average distance between each data point and its cluster center (centroid). K-Means is highly effective in grouping data because it can create clusters with a high degree of similarity within the group while distinguishing them from other clusters (Widiyaningtyas et al., 2017).

The K-Means algorithm process begins by determining the desired number of clusters (k). Next, the cluster center or initial centroid is selected randomly to start the iteration. After the first iteration, the centroid value is updated based on the average position of the members in the cluster until the centroid position is stable. The stages of the K-Means algorithm process are generally as follows (Purba et al., 2018):

  • a. Determine the number of clusters

    The first step in the K-Means algorithm is to determine the number of clusters to be formed based on the analysis requirements. The value of k is the main parameter that must be set at the beginning of the process.

  • b. Determining the initial centroid value

    The centroid, or cluster center point, for the initial iteration is selected randomly as the first step in clustering. Once the iteration begins, the centroid position will be recalculated using the following formula.

    vij¯=1Nii=0Nixij

Explanation:

vij¯ = centroid or average for cluster i on the variable j

Ni = number of data points belonging to a cluster

i = cluster index

j = variable index

xij = data value i in the cluster for variable j

  • c. Calculate the distance between the centroid and the object

    This stage calculates the distance between each data point and the cluster centroid using Euclidean Distance, which is formulated as:

    De=(xisi)2+(yiti)2

Explanation:

De = distance between objects and centroid

x,y = object coordinates

s,t = centroid coordinates

i = the number of objects

  • d. Grouping objects into clusters

    Object data will be allocated to clusters based on the minimum distance to the centroid. Each object will have a membership value in the distance matrix, with a value of 1 if the data is allocated to a specific cluster, and 0 if it is allocated to another cluster. After the objects are grouped, the algorithm will return to the second stage to recalculate the centroid position. This process is repeated until the centroid position no longer changes and the data does not move between clusters, indicating that the grouping process is complete.

  • e. Repeat steps 3 and 4

    The purpose of this repetition is to achieve a centroid position that no longer changes or to find that the centroid position difference is below the specified threshold.

1.2.2 Worker recruitment

Recruitment is a strategic process carried out by companies to obtain quality workers in line with organizational needs. This process involves searching for candidates from internal and external sources and making effective decisions based on job analysis. Job analysis is an important first step in describing the tasks, responsibilities, and qualifications required for each position, thereby helping to match applicants with company needs (Widodo, 2018).

According to Smith and Todaro (2015), labor absorption is greatly influenced by the availability of jobs and the quality of human resources. Technical competencies such as the ability to draw using AutoCAD, prepare planning reports, and adaptability are highly valued attributes in the construction sector (Gangl, 2003). Additionally, Green et al. (2011) state that the job search process aims to match job seekers with suitable job opportunities, which can be facilitated through technology or data-driven methods.

In facing the challenges of managing large and diverse applicant data, algorithms such as K-Means Clustering can be a solution to simplify the analysis of job applicant data. This method divides data into clusters based on attribute similarities, enabling companies to categorize applicants into groups such as Rejected, Under Consideration, and Accepted (Jain et al., 1999). With this data-driven approach, recruitment can be conducted more efficiently and objectively, supporting companies in selecting the best candidates for their needs.

2. Methods

The research methodology consists of several stages designed to collect and analyze data on job applicants for the Construction Consulting Company.

The overall workflow for data collection, preprocessing, clustering, and evaluation is summarized in Figure 1. This research methodology is designed to analyse job applicant data at a construction consulting firm using the K-Means Clustering algorithm. The research process begins with the collection of information and a literature review on K-Means Clustering, the use of the R program, and an analysis of relevant skills in the construction sector. The literature used includes reliable sources such as Google Scholar, research journals, and reference books.

853e28f4-9e86-4f7b-93b4-6274bd97556e_figure1.gif

Figure 1. Workflow research diagram.

This figure illustrates the overall research workflow, including data collection, literature review, data selection, centroid initialization, K-Means clustering (Excel), and subsequent evaluation and visualization using R software.

Data collection was conducted directly from construction consulting companies, covering applicants’ ability to draw using AutoCAD, their ability to compile planning reports, and their social adaptability. Irrelevant or incomplete data was filtered through a data cleaning process to ensure the accuracy of the analysis. According to Kassambara (2017), the data cleaning step is very important in cluster analysis to produce valid and reliable results.

In this study, the K-Means Clustering-based clustering method was applied to analyze job applicant data. This algorithm is highly relevant due to its ability to group applicants based on characteristics such as drawing skills, planning reports, and adaptability. As explained by Lynn et al. (2023), this type of clustering approach is often used in the context of career development to map individuals into groups based on similar attributes.

The K-Means algorithm was applied using the R program. The number of clusters was determined using the elbow method, which is a common technique for determining the optimal number of clusters in K-Means analysis. The clustering process involved calculating the Euclidean distance between the data and the cluster centroid, with iterations continuing until the cluster center stabilized. The clustering results are then evaluated using the Davies-Bouldin index to assess the validity and quality of the clusters (Gie & Jollyta, 2020). This evaluation aims to ensure that the resulting grouping is relevant and appropriate for the analysis needs.

The clustering results are visualized using 2D and 3D scatter plots to provide a clear picture of the distribution of applicants in each cluster. Further analysis is conducted to understand the characteristics of each cluster, including the dominant abilities and work preferences of applicants. With this approach, construction consulting companies can be more effective in developing data-driven recruitment strategies.

The final stage of this research is the preparation of a report containing findings, conclusions, and recommendations. This report is expected to make a significant contribution to companies in improving the efficiency of workforce selection, as well as supporting the construction sector in facing national development challenges.

3. Results and discussion

Data were obtained from CV Ardantama Putra Perkasa as part of their internal recruitment records. The company had posted a job vacancy on JobStreet Indonesia as the external recruitment channel; however, all applicant data used in this study originated from the company’s internal testing and selection process, not from the JobStreet platform.

In total, 161 applicants applied to the vacancy, and 30 candidates who met the minimum requirements were invited for in-person testing. The assessment consisted of three evaluation components: AutoCAD Drawing Skills (X), Ability to Prepare Planning and Supervision Reports (Y), and Adaptability (Z). Descriptive information for the 30 tested applicants is provided in Table 1.

Table 1. Applicant demographic data.

Respondent code Gender AutoCAD drawing skills (X)Ability to prepare planning and monitoring reports (Y) Adaptability (Z)
Resp1Female927568
Resp2Male686566
Resp3Male738687
Resp4Male697473
Resp5Male787291
Resp6Female849092
Resp7Male697687
Resp8Female957376
Resp9Female908085
Resp10Male688268
Resp11Male637571
Resp12Male759377
Resp13Female627268
Resp14Male906172
Resp15Female846390
Resp16Female947089
Resp17Female738780
Resp18Female717395
Resp19Female936270
Resp20Male906889
Resp21Female879487
Resp22Male609064
Resp23Female656493
Resp24Male698475
Resp25Male666372
Resp26Male958593
Resp27Male758083
Resp28Male928593
Resp29Male717185
Resp30Male926188

With this data, the objects were grouped into three clusters with the attributes Rejected, Under Consideration, and Accepted. Next, determine the Centroid Value. The centroid value is chosen randomly, as follows: Rejected = (60,75,85), Considered = (62,77,88), Accepted = (70,84,92). The next step is to calculate the distance based on the determined Centroid point. Initial centroid selection and the corresponding Euclidean distances for each applicant are reported in Table 2.

Table 2. Initial centroid distances.

Respondent dataRejectedUnder considerationAccepted
C1(x1,y1,z1)C2(x2,y2,z2)C3(x3,y3,z3)
NameMALPPKA607585627788708492
Resp192756836,2436,1133,78
Resp268656622,9125,7732,26
Resp373868717,1514,256,16
Resp469747315,0316,8221,49
Resp578729119,2117,0314,46
Resp684909229,1525,8715,23
Resp76976879,277,149,49
Resp895737636,1935,3431,65
Resp990808530,4128,3221,56
Resp1068826820,0521,4724,17
Resp1163757114,3217,1523,90
Resp1275937724,7623,3718,19
Resp1362726817,3820,6228,00
Resp1490617235,5736,0036,46
Resp1584639027,2926,1525,32
Resp1694708934,6032,7727,95
Resp1773878018,3816,8812,73
Resp1871739515,0012,0811,45
Resp1993627038,5138,8638,69
Resp2090688931,0629,4325,79
Resp2187948733,0830,2520,35
Resp2260906425,8127,3730,33
Resp2365649314,4914,2520,64
Resp2469847516,1916,3417,03
Resp2566637218,6821,6329,27
Resp2695859337,2734,3225,04
Resp2775808315,9414,2511,05
Resp2892859334,4731,4522,05
Resp2971718511,7011,2214,80
Resp3092618835,0634,0032,08

After calculating the distance matrix by entering the formula:

C1=(60+92)2+(75+75)2+(85+68)2=36,24
C2=(62+92)2+(77+75)2+(88+68)2=36,11
C3=(70+92)2+(84+75)2+(92+68)2=33,78

This formula is repeated or iterated until the 30th data point is filled, then the cluster members are determined according to the minimum distance from the centroid. Referring to the distance matrix, this can be seen in the values obtained in Table 3 above, which are colored red for Cluster 1, yellow for Cluster 2, and green for Cluster 3. Next, each cluster is combined or grouped into one cluster, and the second centroid point and its distance are calculated. Updated cluster memberships and second-stage centroid distances are summarized in Table 3.

Table 3. Second-stage centroid values and distances.

Respondent dataRejectedUnder considerationAccepted
C1(x1,y1,z1)C2(x2,y2,z2)C3(x3,y3,z3)
NameMALPPKA70,8072,8069,9068,3370,3388,3384,7178,5386,06
Resp26865669,1622,9629,40
Resp46974733,7815,7820,92
Resp106882689,8023,4424,84
Resp116375718,1818,7326,65
Resp136272689,0421,3629,74
Resp1490617222,6328,6923,09
Resp1993627024,6931,8424,49
Resp2260906421,1532,3835,05
Resp2469847512,4419,1019,97
Resp2566637211,1118,0628,08
Resp769768717,495,8615,94
Resp2365649325,398,5425,45
Resp2971718515,214,3215,67
Resp192756821,4031,5519,79
Resp373868721,7116,4013,92
Resp578729122,3110,1710,58
Resp684909230,9625,4112,94
Resp895737624,9629,5015,42
Resp990808525,4723,965,60
Resp1275937721,8226,2019,64
Resp1584639025,9717,3816,04
Resp1694708930,1825,6812,95
Resp1773878017,5619,2115,67
Resp1871739525,107,6617,27
Resp2090688927,5021,8012,15
Resp2187948731,6930,1715,67
Resp2695859335,6130,7914,00
Resp2775808315,5312,9010,28
Resp2892859333,6428,2311,97
Resp3092618830,2725,4419,09

Based on the table above, the results of the calculation of the centroid point and distance point for the second stage have been obtained. It turns out that there is still a shift in cluster data, so it is necessary to calculate the centroid point and distance point for the third stage. The third-stage centroid updates and distances are presented in Table 4.

Table 4. Third-stage centroid values.

Respondent dataRejectedUnder considerationAccepted
C1(x1,y1,z1)C2(x2,y2,z2)C3(x3,y3,z3)
NameMALPPKA68,3374,0069,8970,80 71,20 90,2086,50 78,25 84,19
Resp26865669,8125,1429,13
Resp46974733,1817,5221,20
Resp106882688,2324,8524,87
Resp116375715,5421,0727,14
Resp136272686,9023,8930,02
Resp1490617225,3628,3521,41
Resp2260906418,9834,0135,32
Resp2469847511,2519,9520,58
Resp2566637211,4420,5328,31
Resp769768717,246,0417,87
Resp2365649325,409,6627,26
Resp2971718515,645,2117,13
Resp578729123,317,2912,56
Resp1871739525,275,1319,61
Resp1993627027,4331,3922,53
Resp192756823,7630,9317,40
Resp373868721,4115,3015,82
Resp684909231,4723,0414,33
Resp895737627,3828,1212,92
Resp990808527,0921,754,00
Resp1275937721,3525,8320,04
Resp1584639027,7715,5416,51
Resp1694708932,2523,2612,14
Resp1773878017,1218,9316,62
Resp2090688929,5119,5011,85
Resp2187948732,2728,1516,01
Resp2695859336,9628,0013,98
Resp2775808315,8912,1211,69
Resp2892859334,8625,4512,39
Resp3092618832,5123,6318,50

Based on the table above, the results of the calculation of the centroid point and distance point for the third stage have been obtained, but it turns out that there is still a shift in cluster data, so it is necessary to calculate the centroid point and distance point for the fourth stage. The fourth-stage centroid updates and distances are presented in Table 5.

Table 5. Fourth-stage centroid values.

Respondent dataRejectedUnder considerationAccepted
C1(x1,y1,z1)C2(x2,y2,z2)C3(x3,y3,z3)
NameMALPPKA65,6375,6369,6373,00 72,14 89,7187,80 77,60 82,80
Resp268656611,4725,2728,86
Resp46974735,0417,2921,50
Resp106882686,9924,3725,11
Resp116375713,0321,4127,59
Resp136272685,3824,3430,27
Resp2260906416,4333,9035,78
Resp2469847510,5119,3221,34
Resp2566637212,8521,1328,37
Resp769768717,706,1819,33
Resp2365649326,1111,8828,44
Resp2971718516,935,2518,18
Resp578729124,965,1613,95
Resp1871739526,075,7221,27
Resp373868721,5414,1217,53
Resp1584639030,2014,3116,72
Resp1490617228,5226,9619,93
Resp1993627030,5829,8620,84
Resp192756826,4328,9915,60
Resp684909232,3321,1015,90
Resp895737630,1725,9410,92
Resp990808529,1519,313,93
Resp1275937721,0824,5120,85
Resp1694708934,8221,1211,60
Resp1773878017,0717,7517,75
Resp2090688932,0617,5111,64
Resp2187948733,1126,1016,95
Resp2695859338,6925,6914,51
Resp2775808316,9110,5313,02
Resp2892859336,4723,1813,28
Resp3092618835,3222,0917,90

Based on the table above, the results of the calculation of the centroid point and distance point for the fourth stage were obtained, but it was found that there was still a shift in the cluster data, so it was necessary to calculate the centroid point and distance point for the fifth stage. The fifth-stage centroid updates and distances are presented in Table 6.

Table 6. Fifth-stage centroid values.

Respondent dataRejectedUnder considerationAccepted
C1(x1,y1,z1)C2(x2,y2,z2)C3(x3,y3,z3)
NameMALPPKA66,44 76,89 70,7873,25 73,13 88,8889,92 76,69 83,00
Resp268656612,9124,8430,11
Resp46974734,4516,4623,35
Resp106882686,0223,2827,09
Resp116375713,9320,6929,52
Resp136272687,1723,7432,04
Resp2260906416,1032,8537,86
Resp246984758,6618,1323,56
Resp2566637213,9520,9729,68
Resp1773878015,1716,4720,04
Resp769768716,455,4621,31
Resp2365649325,7312,9729,70
Resp2971718516,054,9619,86
Resp578729123,805,3215,11
Resp1871739524,956,5322,71
Resp373868719,7313,0119,72
Resp1584639029,5114,8116,48
Resp2775808315,249,2115,29
Resp1490617228,4426,6919,16
Resp1993627030,4529,5019,86
Resp192756825,7828,1215,24
Resp684909230,5020,2517,12
Resp895737629,2925,289,40
Resp990808527,6918,523,87
Resp1275937719,2723,2222,91
Resp1694708933,7520,989,87
Resp2090688931,0817,5210,56
Resp2187948731,2825,0718,00
Resp2695859337,0825,1213,96
Resp2892859334,8222,5713,17
Resp3092618834,6722,3516,60

Based on the table above, the results of the centroid point and distance point calculations for the fifth stage were obtained, but it was found that there was still a shift in the cluster data, so it was necessary to calculate the centroid point and distance point for the sixth stage. The sixth-stage centroid updates and distances are presented in Table 7.

Table 7. Sixth-stage centroid values.

Respondent dataRejectedUnder considerationAccepted
C1(x1,y1,z1)C2(x2,y2,z2)C3(x3,y3,z3)
NameMALPPKA67,3078,5071,4073,25 73,13 88,8891,17 75,33 83,50
Resp268656614,5624,8430,82
Resp46974735,0716,4624,56
Resp106882684,9323,2828,66
Resp116375715,5620,6930,82
Resp136272689,0523,7433,20
Resp2260906415,5032,8539,58
Resp246984756,7918,1325,27
Resp2566637215,5720,9730,29
Resp1773878013,3716,4721,87
Resp1275937717,3523,2224,81
Resp769768715,895,4622,45
Resp2365649326,1212,9730,06
Resp2971718515,974,9620,68
Resp578729123,265,3215,52
Resp1871739524,516,5323,33
Resp373868718,2213,0121,36
Resp1584639029,4114,8115,68
Resp2775808314,009,2116,83
Resp1490617228,6726,6918,41
Resp1993627030,5729,5019,06
Resp192756825,1828,1215,53
Resp684909228,9120,2518,40
Resp895737628,6125,288,74
Resp990808526,5018,525,04
Resp1694708933,0920,988,17
Resp2090688930,5817,529,24
Resp2187948729,5225,0719,44
Resp2695859335,7225,1214,09
Resp2892859333,4522,5713,58
Resp3092618834,5222,3515,05

Based on the table above, red represents Cluster 1, which represents the group of rejected individuals, yellow represents Cluster 2, which represents the group of individuals under consideration, and green represents Cluster 3, which represents the group of accepted individuals. Final cluster assignments for all applicants are listed in Table 8.

Table 8. Final clustering results.

Respondent dataRejectedUnder considerationAcceptedClustering
C1(x1,y1,z1)C2(x2,y2,z2) C3(x3,y3,z3)
NameMALPPKA67,3078,5071,4073,2573,1388,8891,1775,3383,50
Resp268656614,5624,8430,82Cluster 1 (Rejected)
Resp46974735,0716,4624,56
Resp106882684,9323,2828,66
Resp116375715,5620,6930,82
Resp136272689,0523,7433,20
Resp2260906415,5032,8539,58
Resp246984756,7918,1325,27
Resp2566637215,5720,9730,29
Resp1773878013,3716,4721,87
Resp1275937717,3523,2224,81
Resp769768715,895,4622,45Cluster 2 (Under Consideration)
Resp2365649326,1212,9730,06
Resp2971718515,974,9620,68
Resp578729123,265,3215,52
Resp1871739524,516,5323,33
Resp373868718,2213,0121,36
Resp1584639029,4114,8115,68
Resp2775808314,009,2116,83
Resp1490617228,6726,6918,41Cluster 3 (Accepted)
Resp1993627030,5729,5019,06
Resp192756825,1828,1215,53
Resp684909228,9120,2518,40
Resp895737628,6125,288,74
Resp990808526,5018,525,04
Resp1694708933,0920,988,17
Resp2090688930,5817,529,24
Resp2187948729,5225,0719,44
Resp2695859335,7225,1214,09
Resp2892859333,4522,5713,58
Resp3092618834,5222,3515,05

The results of this study indicate that the application of K-Means Clustering to the test results of job applicants at a construction consulting company can be identified into several groups with different characteristics. After conducting the analysis, it was found that there were three (3) main clusters formed based on the test scores of the applicants. The first cluster consisted of applicants who exceeded the requirements and were declared Accepted, while the second cluster included applicants with test scores that met the criteria and were declared Under Consideration, and the third cluster was dominated by applicants who did not meet the criteria.

After obtaining Cluster data using K-Means in Excel, the next step is to code it in the R programming language to see a picture of multivariate statistics. The analysis process begins with processing the data of participants in the selection test for a construction consulting company. The dataset consists of three main variables, namely AutoCAD Drawing, Supervision Planning Report, and Adaptability, each of which is assessed on a scale of 0 to 100. These three variables describe the technical competencies and soft skills relevant to job selection. The next stage is the initiation of the initial centroid. The initial centroids were determined based on prior calculations taken from the analysis in Microsoft Excel. Three initial centroids were set to begin the clustering process. These initial centroid values represent the initial average values for the three planned cluster groups, namely clusters with Rejected, Under Consideration, and Accepted statuses. These initial centroids will be used as a reference to determine the membership of each data point in the cluster.

To group data into clusters, the k-means_manual function is used. This function works iteratively to calculate the Euclidean distance between each data point and the predetermined initial centroid. Based on the distance calculation results, each data point will be assigned to the cluster with the minimum distance from the centroid. Once cluster membership is determined, a new centroid is calculated as the average value of the data points included in that cluster. This process is repeated until the centroid stabilizes, i.e., when the difference between the new centroid and the previous centroid becomes very small (converges), or until the iteration reaches the specified maximum limit. This process does not yet produce a final output in the form of cluster labels or visualizations, but it has completed an important step in the K-Means algorithm, namely determining the optimal centroid. Thus, the distribution of data in each cluster can better reflect the patterns or structures that exist in the dataset.

At this stage, the clustering process of the data of participants in the entrance test for construction consulting companies is continued by adding the clustering results to the dataset. The clustering results, in the form of cluster numbers, are stored in a new column called Cluster. In addition, to facilitate interpretation of the results, categorization labels are added based on these cluster numbers. These labels are defined using the factor() function with three categories, namely “Rejected,” “Under Consideration,” and “Accepted,” according to the predetermined cluster order. Next, the serial number or respondent identity is added to the dataset in a new column named No, which is generated using the 1:nrow (participant_data) function. This is done to ensure that each row of data has a unique identity that facilitates the tracking of clustering results. The data is then rearranged to be more structured by only displaying relevant columns, namely the respondent number (No), variable values (AutoCAD Drawing, Planning and Supervision Report, Adaptability), cluster number (Cluster), and category label (Category). This rearrangement process is done using the select() function from the dplyr library. As a final result of this stage, the updated dataset with additional columns and a neat layout is printed to the console using the print() function. This dataset provides a clear overview of each respondent along with variable values and clustering results in a simple table format. With this format, the data can be used for further analysis or visualized as needed. The tidy R output showing the variables, cluster IDs, and category labels is shown in Table 9.

Table 9. R-generated data table.

NoAutoCAD_Drafting Planning_Supervision_ Reports AdaptabilityCluster Category
16865661Rejected
26974731Rejected
36882681Rejected
46375711Rejected
56272681Rejected
66090641Rejected
76984752Under Consideration
86663721Rejected
97387802Under Consideration
107593772Under Consideration
116976873Accepted
126564931Rejected
137171852Under Consideration
147872913Accepted
157173952Under Consideration
167386872Under Consideration
178463903Accepted
187580832Under Consideration
199061722Under Consideration
209362702Under Consideration
219275682Under Consideration
228490923Accepted
239573763Accepted
249080853Accepted
259470893Accepted
269068943Accepted
278784873Accepted
289585933Accepted
299285933Accepted
309261882Under Consideration

The image above shows the results of running the R program after initializing the data and adding several instructions to generate clusters according to the K-Means algorithm. A two-dimensional visualization of the clustering based on AutoCAD Drawing and Planning/Supervision Reports is shown in Figure 2. Then, visualize the K-Means clusters with a 2D scatter plot. This stage visualizes the clustering results in a 2D scatter plot using ggplot2. The X-axis represents AutoCAD Drawing and the Y-axis represents Supervision Planning Report, with the color of the points indicating the clustering category: Rejected, Under Consideration, or Accepted. The plot title and axis labels were added for clarity, and a minimalist theme was applied for a clean look. This visualization helps to understand the distribution of data between clusters visually. Below is an image of the visualization.

853e28f4-9e86-4f7b-93b4-6274bd97556e_figure2.gif

Figure 2. K-means clustering visualization in a 2D scatter plot.

This plot displays job applicants based on AutoCAD Drawing scores (X-axis) and Planning/Supervision Report scores (Y-axis). Data points are assigned to three clusters—Rejected, Under Consideration, and Accepted—highlighting the separation of applicant competency profiles.

This image is a scatter plot of the clustering results using the K-Means method. The horizontal axis (X) shows the AutoCAD Drawing score, while the vertical axis (Y) represents the Supervision Planning Report score. The points on the graph are grouped into three cluster categories:

  • a. Red (Rejected): Indicates participants whose cluster scores are in the lowest criteria group.

  • b. Yellow (Under Consideration): Indicates participants with intermediate scores who are under further consideration.

  • c. Green (Accepted): Indicates participants with the highest scores who are in the highest cluster.

This distribution provides a visual representation of the distribution and relationship between the two main variables based on the clustering results categories. After 2D visualization, 3D visualization is performed. This visualization utilizes three main variables, namely AutoCAD Drawing as the x-axis, Supervision Planning Report as the y-axis, and Adaptability as the z-axis. The color of each data point is determined based on the cluster category that has been generated, with red for the “Rejected” category, orange for “Under Consideration,” and green for “Accepted.”

A three-dimensional visualization using AutoCAD Drawing (X), Planning/Supervision Reports (Y), and Adaptability (Z) is displayed in Figure 3. This 3D scatter plot depicts the clustering results of selection participants based on three main variables: AutoCAD Drawing (MA), Supervision Planning Report (LPP), and Adaptability (KA). Each variable is represented on the X, Y, and Z axes, reflecting participant scores in the range of 60 to 100. This visualization maps participants into a multidimensional space to reveal distribution patterns and relationships between variables. The points in the plot are colored according to the clusterization result category: red for participants in the “Rejected” category, orange for “Under Consideration,” and green for “Accepted.” These colors facilitate interpretation by providing a quick overview of how participants are grouped based on similar characteristics.

853e28f4-9e86-4f7b-93b4-6274bd97556e_figure3.gif

Figure 3. K-means clustering visualization in a 3D scatter plot.

This three-dimensional plot represents AutoCAD Drawing (AD), Planning/Supervision Reports (PSR), and Adaptability (A) simultaneously. The spatial separation of the three clusters demonstrates how multivariate skill combinations distinguish applicants across the Rejected, Under Consideration, and Accepted groups.

The distribution in the graph shows that participants with high scores on all three variables tend to be in the “Accepted” category, while lower scores are more often associated with the ‘Rejected’ category. The “Under Consideration” category is in the middle, reflecting participants with moderate scores who have the possibility of moving to either category depending on additional criteria. The separation between clusters is relatively clear, indicating that clustering has successfully grouped participants with similar characteristics. Some points that appear far from the center of their cluster can be identified as outliers, providing additional information about participants with unique characteristics. This visualization provides deep insights into data structures, patterns of relationships between variables, and clustering results relevant to decision making. Next, the participants’ clusters are sorted in order. The dataset sorted by cluster and category is provided in Table 10.

Table 10. Sorted dataset by cluster categories.

NoAutoCAD_Drafting Planning_Supervision_ Report AdaptabilityCluster Category
1 6865661Rejected
2 6974731Rejected
3 6882681Rejected
4 6375711Rejected
5 6272681Rejected
6 6090641Rejected
7 6663721Rejected
8 6564931Rejected
9 6984752Under Consideration
10 7387802Under Consideration
11 7593772Under Consideration
12 7171852Under Consideration
13 7173952Under Consideration
14 7386872Under Consideration
15 7580832Under Consideration
16 9061722Under Consideration
17 9362702Under Consideration
18 9275682Under Consideration
19 9261882Under Consideration
20 6976873Accepted
21 7872913Accepted
22 8463903Accepted
23 8490923Accepted
24 9573763Accepted
25 9080853Accepted
26 9470893Accepted
27 9068893Accepted
28 8794873Accepted
29 9585933Accepted
30 9285933Accepted

The hierarchical clustering (PCA biplot with convex hulls) summarizing the separation of the three groups is depicted in Figure 4. After the data cluster sequence table appears, the hierarchical clustering results are displayed visually. Hierarchical clustering visualization is important because it helps to understand the hierarchical structure of the data, allowing the identification of relationships between groups based on distance or similarity. The resulting dendrogram facilitates analysis, such as determining the optimal number of clusters by cutting the tree at a certain level. In addition, this approach is useful for describing data in an intuitive way, especially when the relationships between data are not linear, thus providing deeper insights into the patterns or distribution of data in the dataset.

853e28f4-9e86-4f7b-93b4-6274bd97556e_figure4.gif

Figure 4. Hierarchical clustering visualization using PCA-projected dimensions.

This figure maps hierarchical clustering results onto principal component space, with convex hulls outlining the boundaries of the three clusters. The grouping pattern confirms consistency between hierarchical clustering and the K-Means classification of job applicants.

This hierarchical clustering visualization provides an overview of the grouping of construction consultant job applicants based on three main dimensions: AutoCAD Drawing (MA), Planning and Supervision Reports (LPP), and Adaptability (KA). These results illustrate the differences in characteristics among applicants in three clusters, which are relevant for determining job acceptance.

  • a. Cluster 1 (Red):

    • i. This cluster has characteristics that are concentrated on the left side of the graph, with scores that tend to be lower than the other clusters.

    • ii. Participants in this cluster are likely to have suboptimal performance on the main criteria tested.

    • iii. It can be interpreted that these participants fall into the “Rejected” category, as they do not meet the minimum standards for acceptance.

  • b. Cluster 2 (Yellow):

    • i. Located in the middle of the graph, this cluster shows participants with average performance on the main criteria.

    • ii. Participants in this group have potential, but may require further evaluation or additional training to meet admission standards.

    • iii. These participants fall into the “Under Consideration” category, meaning they still have a chance of being accepted.

  • c. Cluster 3 (Green):

    • i. Located at the bottom right of the graph, this cluster consists of participants with higher scores across all dimensions.

    • ii. These participants demonstrate the best and most consistent performance on the main criteria, thus meeting the admission requirements.

    • iii. Participants in this cluster are categorized as “Accepted.”

The results of this study show that the K-Means Clustering algorithm is an effective tool for grouping job applicants based on three main competencies: AutoCAD Drawing, Supervision Planning Reports, and Adaptability. This clustering produces three categories: “Rejected,” “Considered,” and “Accepted.” Applicants with high adaptability scores tend to fall into the ‘Considered’ or “Accepted” categories, which is in line with the findings of Brown & Hesketh (2005), who emphasize the importance of adaptability in supporting successful career transitions. Groups with strong technical skills in AutoCAD drawing and planning reports also show a higher tendency to be accepted, in line with the needs of the modern skills-based labor market.

Cluster visualization through 2D and 3D scatter plots provides intuitive insights into the distribution patterns of applicants based on their main characteristics. This graph shows that applicants with high scores on technical and adaptability variables are more likely to be in the “Accepted” cluster. As explained by Zhang et al. (2024), data visualization allows for a deeper understanding of the relationships between variables, thereby supporting evidence-based decision making. Additionally, the distribution in this graph provides a clear separation between clusters, allowing companies to focus on the group of applicants with the greatest potential. The “Under Consideration” cluster offers a unique opportunity for companies to identify candidates with great potential who require further evaluation or additional training. As highlighted by Rawat et al. (2024), targeted training can improve applicants’ abilities, ensuring that they can meet company standards in the long term. Thus, this approach not only helps companies understand the technical strengths of applicants but also provides insight into how these individuals can sustainably contribute to the organization, as also explained by Hurbean, et al. (2023).

The data-driven approach applied in this study aligns with the views of Akkermans et al. (2024), who emphasize that data-driven analysis provides a more objective and strategic basis for decision-making in the field of human resources. By basing decisions on clustering data, companies can improve the efficiency of the selection process and focus on the most suitable candidates to meet the needs of construction projects.

Overall, this study confirms that technical and adaptive abilities evaluated through clustering contribute significantly to individual and organizational success. As stated by Donald et al. (2024), this approach not only identifies applicants’ strengths but also provides strategic guidance on how to integrate those skills into organizational needs. Therefore, this clustering is relevant not only for the selection process but also for data-driven workforce development.

4. Conclusions

The results of this study confirm that the K-Means Clustering algorithm is an effective tool for grouping job applicants at construction consulting firms based on three key competencies: AutoCAD Drawing, Supervision Planning Reports, and Adaptability. These three variables reflect the primary needs of the construction industry, where technical skills and soft skills are important indicators in determining the suitability of applicants.

The clustering process resulted in three groups: “Rejected,” “Considered,” and “Accepted.” The “Rejected” cluster includes applicants with overall scores below the expected standard, indicating significant deficiencies in one or more of the core competencies evaluated. The “Under Consideration” cluster identifies applicants with average scores, indicating potential but requiring further evaluation or additional training to meet professional standards. Meanwhile, the “Accepted” cluster includes applicants with the highest scores, indicating readiness for immediate placement in a professional work environment, as they have fully met competency expectations.

This clustering process has direct implications for decision-making in recruitment. By understanding the distribution of applicants across these three clusters, management can focus more on candidates with the best potential to contribute effectively to construction projects. Additionally, visualizing the clustering results in the form of 2D and 3D scatter plots provides deep insights into the relationships between variables, enabling companies to identify priority competencies that need to be improved through targeted training programs.

The application of the K-Means Clustering method to job applicant data offers strategic insights into the workforce selection process. These results align with Diván’s (2017) perspective that data-driven analysis can strengthen the recruitment process, particularly in identifying groups of individuals most suited to the company’s needs. Additionally, as emphasized by Rawat et al. (2024), structured training programs can play a crucial role in enhancing applicants’ capabilities within the “Considered” cluster, ensuring their alignment with the organization’s long-term needs. These findings also reflect the importance of adaptability in facing job insecurity, as outlined by Van der Heijden et al. (2024), who show that this ability contributes significantly to career sustainability and reduced labor instability.

Although this clustering approach has proven effective in identifying potential applicants, its success is highly dependent on the accuracy and completeness of the data used. Further research could explore additional variables or integrate other clustering algorithms to improve the categorization process.

The unique contribution of this research lies in the application of K-Means Clustering to job applicant data in the construction sector—an industry that heavily relies on a balance between technical skills and adaptability. This focused approach provides a framework that can be adapted by other sectors requiring alignment between recruitment processes and specific competency needs. Beyond its technical contributions, this research introduces a data-driven approach that enables companies to enhance hiring accuracy, selection efficiency, and the development of competencies aligned with industry needs. As highlighted by Donald et al. (2024) and Akkermans et al. (2024), a data-driven approach provides a solid foundation for aligning individual potential with organizational goals, supporting both short-term recruitment needs and long-term sustainable workforce management. Therefore, this approach not only addresses challenges in evaluating applicants but also builds a foundation for sustainable talent management in the construction industry.

Ethical approval

Ethical review and approval were not required for this study because the researchers analyzed fully anonymized secondary data that had been lawfully transferred by CV Ardantama Putra Perkasa under a formal Data Usage Agreement (No. 12/X/S-K/APP/2024). According to Indonesian national research ethics regulations (Permenkes RI No. 74/2016, Article 11) and the general principles of the Declaration of Helsinki, research involving secondary anonymized non-clinical data that cannot identify individuals is exempt from institutional ethical review. Therefore, this study qualifies for an ethics exemption.

Informed consent

Informed consent for data use was not obtained directly by the researchers, as all data were collected by CV Ardantama Putra Perkasa under standard recruitment procedures.

The company confirmed, through the Data Usage Agreement (No. 12/X/S-K/APP/2024), that job applicants had authorized the use of their anonymized recruitment test results for evaluation and administrative purposes in accordance with Indonesian data protection regulations (UU ITE and PP 71/2019). Because the researchers received only anonymized secondary data and had no access to identifiable information, this study meets the criteria for consent exemption.

Clinical trial registration

Not applicable.

Comments on this article Comments (0)

Version 1
VERSION 1 PUBLISHED 10 Dec 2025
Comment
Author details Author details
Competing interests
Grant information
Copyright
Download
 
Export To
metrics
Views Downloads
F1000Research - -
PubMed Central
Data from PMC are received and updated monthly.
- -
Citations
CITE
how to cite this article
Jaya DJ, Ramdhani WM, Wati E et al. Application of K-Means Clustering for Job Applicant Analysis in Construction Firms Using R [version 1; peer review: awaiting peer review]. F1000Research 2025, 14:1388 (https://doi.org/10.12688/f1000research.172383.1)
NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.
track
receive updates on this article
Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status:
AWAITING PEER REVIEW
AWAITING PEER REVIEW
?
Key to Reviewer Statuses VIEW
ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions

Comments on this article Comments (0)

Version 1
VERSION 1 PUBLISHED 10 Dec 2025
Comment
Alongside their report, reviewers assign a status to the article:
Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions
Sign In
If you've forgotten your password, please enter your email address below and we'll send you instructions on how to reset your password.

The email address should be the one you originally registered with F1000.

Email address not valid, please try again

You registered with F1000 via Google, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Google account password, please click here.

You registered with F1000 via Facebook, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Facebook account password, please click here.

Code not correct, please try again
Email us for further assistance.
Server error, please try again.