Introduction

F1000Research

2046-1402

F1000 Research Limited

London, UK

10.12688/f1000research.130245.1

Research Article

Articles

Modeling document labels using Latent Dirichlet allocation for archived documents in Integrated Quality Assurance System (IQAS)

[version 1; peer review: 1 approved with reservations]

Prianes

Freddie

Conceptualization Writing – Original Draft Preparation https://orcid.org/0000-0003-0168-6182 a 1 Palaoag

Thelma

Writing – Review & Editing https://orcid.org/0000-0002-5474-7260 b 2 1College of Computer Studies, Camarines Sur Polytechnic Colleges, Nabua, Camarines Sur, 4432, Philippines 2College of Information Technology and Computer Science, University of the Cordilleras, Baguio City, Benguet, 2600, Philippines

a fprianes@cspc.edu.ph b tdpalaoag@uc-bcf.edu.ph

No competing interests were disclosed.

27 1 2023

2023

105

18 1 2023

2023

This is an open access article distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Background: As part of the transition of every higher education institution into an intelligent campus here in the Philippines, the Commission of Higher Education has launched a program for the development of smart campuses for state universities and colleges to improve operational efficiency in the country. With regards to the commitment of Camarines Sur Polytechnic Colleges in improving the accreditation operation and to resolve the evident problems in the accreditation process, the researchers propose this study as part of an Integrated Quality Assurance System that aims to develop an intelligent model that will be used in categorizing and automating tagging of archived documents used during accreditation.

Methods: As a guide in modeling the study, the researchers use an agile method as it promotes flexibility, speed, and, most importantly, continuous improvement in developing, testing, documenting, and even after delivery of the software. This method helped the researchers in designing the prototype with the implementation of the said model to aid the process in file searching and label tagging. Moreover, a computational analysis is also included to further understand the result from the devised model.

Results: As a result, from the processed sample corpus, the document labels are faculty, activities, library, research, and materials. The labels generated are based on the total relative frequencies which are 0.009884, 0.008825, 0.007413, 0.007413, 0.006354, respectively, that have been computed between the ratio on how many times the term was used in the document and the total word count of the whole document.

Conclusions: The devised model and prototype support the organization in file storing and categorization of accreditation documents. Through this, it is easier to retrieve and classify the data, which is the main problem for the task group. Further, other patterns in clustering, modeling, and text classification can be integrated in the prototype.

Latent Dirichlet allocation document labels natural language processing accreditation quality assurance Intelligent model CSPC

The author(s) declared that no grants were involved in supporting this work.

Introduction

The creation of a smart campus is a step toward the creation of a smart city. Teaching and learning will be more difficult in the future as a result of the rapid advancements in information and communication technology ( Kwok, 2015). With this rapid advancement, there is already a shift from the “smart” era to the “intelligent” era. A “smart phone”, “smart building” or “smart home” is one that is capable of adapting to changing conditions. The term “intelligent,” on the other hand, refers to more than just being smart; rather, it refers to having the ability to think, reason, and understand, as well as being able to adapt to changing conditions. If you apply this to a device example, “smart devices” can perform tricks, but “intelligent devices” can learn new tricks in response to their changing surroundings ( Ng et al., 2010).

As part of the transition of every Higher Educational Institution (HEI) to being an intelligent campus, the Commission of Higher Education (CHED) has launched a program under CHED Memorandum Order No. 9 s. 2020 for the development of smart campuses for State Universities and Colleges (SUCs). In fact, CHED releases a budget to assist SUCs in the development of smart campuses in which HEIs use next-generation digital technologies woven seamlessly within a well-architected infrastructure in developing tools to enhance teaching and learning, research, and extension as well as to improve operational efficiency. On the other hand, as a requirement by CHED and maintaining the quality of education in HEIs, CHED gives the accountability and responsibility to the accrediting body, such as the Accrediting Agency of Chartered Colleges and Universities of the Philippines (AACCUP), Philippines Association of Colleges and Universities Commission on Accreditation (PACU-COA), Philippine Accrediting Association of Schools, Colleges, and Universities (PAASCU), and many others to assess and provide certifications of quality education in the accredited program/institution as stated in the CHED Memorandum Order No. 1 s. 200.

Achieving a smart/intelligent campus requires consideration of different areas by the institution. Based on the study of Ng et al., there are six main areas of intelligence, namely (1) iLearning, (2) iManagement, (3) iGovernance, (4) iSocial, (5) iHealth, and (6) iGreen. The accreditation process alone will fall under iManagement, however that entire aspect and purpose of accreditation falls in all the areas.

Camarines Sur Polytechnic Colleges (CSPC) as a state college will be one of the settings for the initial implementation of the system. As part of the goal of CSPC to be the center for development and center of excellence, the institution opted to go along with the launch of the CHED program to become one of the smart campuses in the region. In connection to this, the institution also undergoes continuous accreditation through AACUP, as depicted in Figure 1, and ISO quality assurance to achieve the goal and gain the university status as the Polytechnic University of Bicol.

Figure 1. Agency of Chartered Colleges and Universities of the Philippines (AACCUP) accreditation process.

The accreditation process, as shown in Figure 1, passes through various phases or actions: (a) Application: An educational institution submits an application to AACCUP for accreditation. (b) Institutional self-survey: After the application has been approved, the applicant institution is expected to conduct an internal evaluation by its internal accreditors to evaluate whether the program is ready for an external review. (c) Preliminary survey visit: This is when external accreditors evaluate the program for the first time. The program is eligible to receive a Candidate status that is good for two years after passing the assessment. (d) The first formal survey visit reviews the program that has obtained Candidate status, and if it has met a higher standard of excellence, it is given a Level I Accredited status, which is valid for three years. (e) The second survey visit entails evaluating an accredited program, and if it has met the standards for a greater degree of quality than the survey visit that came before it, the program may be eligible to get a Level II Re-accreditation status that is valid for five years. (f) During the third survey visit, the accreditation level is completed by a program after five years of holding Level II Re-accreditation status. The program is reviewed and must perform exceptionally in four categories, namely instruction and extension, which are essential; and two other areas, which must be selected from among research, performance in licensure exams, faculty development, and links. (g) The fourth survey visit is a more difficult level that, if passed, may grant the organization institutional accreditation status.

Accompanied with the tedious accreditation process are many documents that needs to be produced. For most experiences in the current accreditation undertakings in CSPC, the majority of the tasks have been done manually. Though there are tools available for cloud storage and automation like Google Drive, Dropbox, etc., problems such as repetition of work, invalid instruments, inefficient resource utilization, and inefficient monitoring before, during, and after the accreditation are still experienced by the personnel. With this perceived problem, an integrated system dedicated to quality assurance processes is a must.

Upon the CSPC’s goal of becoming a university and becoming a smart/intelligent campus, the researchers propose a centralized system that will cater to the needs of the institution in the process of accreditation, which is part of quality assurance. Through this study, CSPC will benefit from being a smart/intelligent campus by means of utilization of the system in the iManagement area and, at the same time, it addresses the problems encountered during the accreditation processes.

Based on the problems identified and the commitment of the institution to be a smart/intelligent campus, the researchers propose this study as a component in the Integrated Quality Assurance System (IQAS) (RRID:SCR_023146). The study focuses on the documents archive needed for the accreditation process. The system will have a document repository of archived documents and these documents will be analyzed by the system through the use of intelligent modeling. Through the use of this, the documents will be categorized by means of the extracted labels.

In general, the study aims to create a model in support of the categorization and automated tagging of the archived documents used during accreditation.

Related works

Unstructured data make it more difficult and time-consuming to find a relevant document due to the exponential growth of electronic documents. Text document classification, which organizes unstructured documents into pre-defined classifications, is crucial to information processing and retrieval ( Akhter et al., 2020). The text documents provide a number of difficult problems for data processing in order to retrieve the pertinent data. One of the well-liked methods for information retrieval based on themes from biomedical documents is topic modeling. Finding the correct subjects from the biological documents is a difficult task in topic modeling. Additionally, redundancy in biomedical text documents has a detrimental effect on text mining quality. As a result, the exponential rise of unstructured documents necessitates the development of topic modeling machine learning approaches ( Rashid et al., 2019). In the framework of document categorization, they have conducted a comparative analysis of three models for a feature representation of text documents. The most popular family of bag-of-words models, the recently suggested continuous space models Word2Vec and Doc2Vec, and the model based on the representation of text documents as language networks are all taken into consideration in detail ( Martinčić-Ipšić et al., 2019).

In this study, word representation techniques were used to analyze how the similarity between English words is calculated. This work used the Word2Vec paradigm to express words as vectors. The 320,000 English Wikipedia articles included in this study’s model served as the corpus, and the similarity value was calculated using the cosine similarity calculation method ( Jatnika et al., 2019). Real-world text categorization problems frequently involve a multitude of closely related categories arranged in a taxonomy or hierarchical structure. When processing huge sets of closely related categories, hierarchical multi-label text categorization has grown more difficult ( Ma et al., 2021). A popular technique for clustering functional data is the functional k-means clustering algorithm. The derivative information is not further taken into account by this approach when determining how similar two functional samples are to one another. In actuality, the derivative information is crucial for spotting variances in trend characteristics among functional data. By including their derivative information, we establish a novel distance in this paper that is utilized to compare functional samples ( Meng et al., 2018). Due to its capacity to analyze data from numerous sources or views, multi-view clustering has drawn a growing amount of interest in recent years. In the research, they presented a unique multi-view clustering method called Two-level Weighted Collaborative k-means (TW-Co-k-means) to simultaneously address the issues on consistency across different views and weighing the views for the improvement of cluster results. For multi-view clustering, a new objective function has been developed that leverages the unique information in each view while also cooperatively utilizing the complementarity and consistency between various views ( Zhang et al., 2018). The various pattern matching algorithms are used to locate every instance of a constrained set of patterns inside an input text or input document in order to examine the content of the documents. This research utilized four string matching techniques that are now in use: the Brute Force approach, the Knuth–Morris–Pratt algorithm (KMP), the Boyer–Moore algorithm, and the Rabin–Karp algorithm ( Bhagya Sri et al., 2018). All the literature listed has similarity in text clustering, modeling, and classification, and serves as a proof that the study is feasible, and the proposed intelligent model can be integrated to further assist in the accreditation process of CSPC.

Methods

As a guide in modeling the study, the researchers used the agile method ( https://dx.doi.org/10.17504/protocols.io.n2bvj82mxgk5/v2) as it promotes flexibility, speed, and, most importantly, continuous improvement in developing, testing, documenting, and even after delivery of the software. Since the phases of this model are light, the teams are not bound by a rigid systematic-based process on pre-set constraints and restrictions as some other models, like the waterfall model, and can adjust changes whenever they are needed. This flexibility on every stage propagates creativity and freedom within processes. Furthermore, development teams can modify and re-prioritize the backlog, allowing for speedy implementation ( Trivedi, 2021).

Following the agile methodology, the researchers adapted the stages, as presented in Figure 2. These are: (1) Plan: the researchers collected previous documents involved in the accreditation process, such as compliance reports under the areas of student, faculty, facility, library, and administration. Also, understanding of the existing problems in tracking, tagging, and duplication of these documents during the accreditation process. (2) Design: the requirement specifications in this stage were identified in relation to the existing problem of the HEI in tracking, tagging, and duplication of the documents for accreditation and quality assurance. Along with this, the researchers also created the process of the intelligent model, which will be the basis of document labelling. (3) Develop: this stage is intended on the creation of the prototype, which involves the processing of the documents in order to identify the proper label for each document. (4) Deploy: the prototype undergoes a test run during this stage. (5) Review: the researchers conduct a checklist function review to check if each component is running properly. Lastly, (6) launch: wherein the prototype is embedded to the local system of the HEI.

Figure 2. Agile methodology. Results and discussion Intelligent model

The results from this intelligent model are used for visualization in the super word vector and histogram. The super word vector is presented in a cloud map word to visualize the frequency of the words in the corpus, and the histogram is used to present the relationship of the words per sentence in the form of line graphs. The extracted labels and generated word vector and histogram are tagged and linked to the uploaded document, as patterned in the process shown in Figure 3. This model is implemented in the IQAS to assist in categorization and searching in the file repository of accreditation documents.

Figure 3. Process of the intelligent model. Prototype

The design prototype presented in this section is focused on the label extraction feature for automatic tagging of the archive documents used in accreditation.

Upload and clean

As shown in Figure 4, this phase allows the user to upload and clean the document through tokenization. Once uploaded, the user may set the configuration in cleaning the document. The options are removing numbers, symbols, and duplicates, adding and uploading additional stopwords, and showing and downloading the pre-processed data. There are other useful features particularly in managing the stopwords, such as showing the list of default stopwords and deleting the added and uploaded stopwords.

Figure 4. Phase I—upload and clean snapshot.

Setting up parameters

Phase II is intended for setting up the parameters for topic modeling, as presented in Figure 5. Right after uploading and cleaning the document, the user can set the topic modeling parameters that will be use in identifying and extracting the labels. The parameters included are the desired number of topics, frequency of iteration, the number of words per topic to be generated, optimization interval, and the model’s name. These parameters are primarily the factors in modeling the topics and label identification for automatic tagging.

Figure 5. Phase II—setting up parameters snapshot.

Extract label

This phase, as shown in Figure 6, provides the result of the processed corpus from the processing of the pre-processed document and the parameters that have been set up from the previous phase. This shows the number of documents uploaded, the total number of words in the document, the number of unique words, vocabulary density, readability index, average words per sentence, and most importantly the frequent words in the corpus. These frequently used words are extracted to be the label for automatic tagging later on. The user can also set the items to be shown in the most frequent word.

Figure 6. Phase III—extract label snapshot.

Word cloud

Along with the results of phase III, a word cloud is also generated. Phase IV, as depicted in Figure 7, is a super word vector view of the frequent words in the processed corpus. The most evident words in the word cloud are the frequently used words from the previous phase, which are faculty, activities, library, research, and materials. The font size of the word is based on how many times this word is used in the corpus.

Figure 7. Phase IV—word cloud snapshot.

LDA visualization

With the result generated during phase III, this phase provides the histogram presentation of the sample processed corpus with the support of the LDA visualization, as shown in Figure 8. The line graph provides the relative frequencies of each generated label per document segment.

Figure 8. Phase V—LDA visualization snapshot.

Auto-tagging to uploaded document

After the five phases, automatic tagging of the generated labels takes place, as shown in Figure 9. The document is then stored in the file repository of the IQAS. The uploaded document will have corresponding metadata such as filename, file size, user, date created, tags, and the link of the processed model. The filename can also be updated, and adding and removing tags is also possible.

Figure 9. Phase VI—auto-tagging of labels in the uploaded document snapshot.

Computational analysis

For better understanding, this section provides the computational analysis of the actual result based on the processed document.

In reference to the results of phase III, there are four significant results evident in Figure 6. Vocabulary density is the ratio between the total number of words present in the corpus and the unique words. To obtain the vocabulary density, the total number of unique words is divided by the total number of words; for the sample computation see Equation 1. Vocabularity density VD = Number of unique words UW Total word count WC VD = 720 2,833 VD = 0.254

Equation 1. Vocabulary density computation.

The vocabulary density of the processed corpus is 0.254, which implies that the corpus contains complex text with many unique words. Moreover, the readability index and average word per sentence uses Java break iteration, which is a local sensitive class that has an imaginary cursor that points to the current boundary in a string of natural language text. This contains different kinds of boundaries such as for text character, words, sentence instance, and potential line breaks. These boundaries are the basis for the readability index and average words per sentence, which are 16.106 and 21.5, respectively. Frequently used words are identified based on the number counts of the word used in the processed corpus.

The LDA visualization is presented through the correlation of the relative frequency of the word per document segmentation, as shown in Figure 8. To identify the relative frequency, it is necessary to decide the number of document segmentations. For the purpose of this study, the researchers used 10 segments for the document. The grouping of words per segment is based on the total word count. The prototype now determines how many times a particular word is used per segment. Upon determination, the identified number of counts is divided into the total word count. For the sample computation, see Equations 2 and 3. Words per segment WS = Desired number of segments DNS Total word count WC WS = 10 2,833 WS = 283.3 *

* First seven segments contain 283 words while the last three segments contain 284 words.

Equation 2. Words per segment computation. Relative frequency RF = Word count per segment WCS Total word count WC RF = 2 2,833 RF = 0.0007060

Equation 3. Sample computation for relative frequency (Word: research|2 ^nd Segment).

For the overall results of the histogram, Tables 1 and 2 present the tabular representation of the relative frequency per label and per segment.

Table 1. Word count of labels per document segment.

Labels	Word count per document segment										Total count
Labels	1	2	3	4	5	6	7	8	9	10	Total count
Faculty	1	13	3	5	1	2	1	1	0	1	28
Activities	0	3	5	1	6	0	0	2	2	6	25
Library	0	0	0	0	1	14	5	1	0	0	21
Research	0	2	2	12	0	1	0	0	4	0	21
Materials	3	6	0	3	0	0	0	0	3	3	18

Table 2. Relative frequency of labels per document segment.

Labels	Relative frequency per document segment										Total count
Labels	1	2	3	4	5	6	7	8	9	10	Total count
Faculty	0.000353	0.004589	0.001059	0.001765	0.000353	0.000706	0.000353	0.000353	0	0.000353	0.009884
Activities	0	0.001059	0.001765	0.000353	0.002118	0	0	0.000706	0.000706	0.002118	0.008825
Library	0	0	0	0	0.000353	0.004942	0.001765	0.000353	0	0	0.007413
Research	0	0.000706	0000706	0.004235	0	0.000353	0	0	0.001412	0	0.007413
Materials	0.001059	0.002118	0	0.001059	0	0	0	0	0.001039	0.001059	0.006354

Conclusions

CSPC is in an exploratory phase when it comes to solving this particular problem involving accreditation. It is evident that there are problems encountered by the organization pertaining to the accreditation process. Therefore, the researchers devised a model that supports the organization for accreditation. In addition, the researchers also designed a prototype with the implementation of the model to help the organization through the process. As a result, it is easier to retrieve and classify the data, which is the main problem of the task group. Furthermore, other text classification patterns may also be integrated into the system and the results compared with given parameters.

Software availability

Software available from: https://github.com/CraigList056/iqas/tree/v1.0.0-alpha

Source code available from: https://github.com/CraigList056/iqas

Archived source code at time of publication: https://www.doi.org/10.5281/zenodo.7507492

License: MIT License

Acknowledgements

We would like to express our great appreciation to our colleagues and friends for their undeniable support and for uplifting our spirits to make this research paper possible. We would like to also extend our appreciation to our respective institutions (Camarines Sur Polytechnic Colleges and University of the Cordilleras), which have been our second home and witness of our efforts during the research process.

References

Akhter

Jiangbin

Naqvi

: Document-Level Text Classification Using Single-Layer Multisize Filters Convolutional Neural Network. IEEE Access. 2020;8(Ml):42689–42707. 10.1109/ACCESS.2020.2976744

Bhagya Sri

Bhavsar

Narooka

: String Matching Algorithms. International Journal Of Engineering And Computer Science. 2018;7(03):23769–23772. 10.18535/ijecs/v7i3.19

CraigList056: CraigList056/iqas: Initial Release (v1.0.0-alpha). Zenodo. 2023. 10.5281/zenodo.7507492

Trivedi

: Agile Methodologies. International Journal of Computer Science & Communication. 2021;12(2):91–100.

Jatnika

Bijaksana

Suryani

: Word2vec model analysis for semantic similarities in English words. Procedia Computer Science. 2019;157:160–167. 10.1016/j.procs.2019.08.153

Kwok

: A vision for the development of i-campus. Smart Learning Environments. 2015;2(1):1–12. 10.1186/s40561-015-0009-8

Liu

Zhao

: Hybrid embedding-based text representation for hierarchical multi-label text classification. Expert Systems with Applications. 2021;187(July 2020):115905. 10.1016/j.eswa.2021.115905

Martinčić-Ipšić

Miličić

Todorovski

: The influence of feature representation of text on the performance of document classification. Applied Sciences (Switzerland). 2019;9(4). 10.3390/app9040743

Meng

Liang

Cao

: A new distance with derivative information for functional k-means clustering algorithm. Information Sciences. 2018;463-464:166–185. 10.1016/j.ins.2018.06.035

JWP

Azarmi

Leida

: The intelligent campus (iCampus): End-to-end learning lifecycle of a knowledge ecosystem. Proceedings - 2010 6th International Conference on Intelligent Environments, IE 2010. 2010;332–337. 10.1109/IE.2010.68

Rashid

Adnan Shah

Irtaza

: Topic Modeling Technique for Text Mining over Biomedical Text Corpora through Hybrid Inverse Documents Frequency and Fuzzy K-Means Clustering. IEEE Access. 2019;7:146070–146080. 10.1109/ACCESS.2019.2944973

Zhang

Wang

Huang

: TW-Co-k-means: Two-level weighted collaborative k-means for multi-view clustering. Knowledge-Based Systems. 2018;150:127–138. 10.1016/j.knosys.2018.03.009

10.5256/f1000research.142987.r165836

Reviewer response for version 1

Naseem

Shahid

1 Referee https://orcid.org/0000-0002-0791-541X 1Department of Information Sciences, Division of Science & Technology, University of Education, Lahore, Pakistan

Competing interests: No competing interests were disclosed.

29 3 2023

2023

This is an open access peer review report distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

recommendation

approve-with-reservations

By looking at paper overall structure, presentation and above all the provided contents, I would say the authors of the paper requires minor changes to accept it for indexing.

In this study, the indexing of the tagging/titles, sub-titles is missing.

Number of sentences and grammar mistakes in different sections of the paper.

In result section, number of students studied in the batch to be accredited, financial statement, and infrastructure must also be included because these documents are also required in accreditation process.

In related work, there should be structured or labelled data instead of f pre-defined data items.

In second paragraph of related work, the authors defined four types of machine learning techniques, but didn’t explain for what purpose, these four techniques were used in this study.

In figure 2, there should be one more step i.e. maintenance included.

All the equations used in this study must be numbering.

In equation 1, explain the procedure to calculate VD, from where we get the used valued to calculate VD.

Numbers of references used to validate this study are too short. There must be some more literature review to authenticate this study be used in this research.

Is the work clearly and accurately presented and does it cite the current literature?

Yes

If applicable, is the statistical analysis and its interpretation appropriate?

Yes

Are all the source data underlying the results available to ensure full reproducibility?

Partly

Is the study design appropriate and is the work technically sound?

Yes

Are the conclusions drawn adequately supported by the results?

Yes

Are sufficient details of methods and analysis provided to allow replication by others?

Yes

Reviewer Expertise:

Artificial Intelligence, Machine Learning, and Deep learning for analyzing healthcare data

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

Prianes

Freddie

Camarines Sur Polytechnic Colleges, Philippines

Competing interests: No competing interests were disclosed.

14 1 2024

We would like to express our sincere appreciation for reviewing our paper. Your comments and suggestions are utmost valued. These are our response:

1. In this study, the indexing of the tagging/titles, sub-titles is missing.

This is somehow unclear to us. But if this is pertaining to the generated tags/titles or sub-titles for the indexing of the documents, it’s been mentioned on Fig. 9 – Phase VI.

2. Number of sentences and grammar mistakes in different sections of the paper.

Accomplished

3. In result section, number of students studied in the batch to be accredited, financial statement, and infrastructure must also be included because these documents are also required in accreditation process.

Since the study is in exploratory analysis, we focused first on the area of Faculty and Library. But upon implementation of the prototype, we will include the other areas i.e. Students, Finance, and Infrastructure.

4. In related work, there should be structured or labelled data instead of f pre-defined data items.

Accomplished

5. In second paragraph of related work, the authors defined four types of machine learning techniques, but didn’t explain for what purpose, these four techniques were used in this study.

Accomplished

6. In figure 2, there should be one more step i.e. maintenance included.

Actually we include maintenance as a sub-phase of launch. We did not elaborate on this phase because we are doing another research on prototype testing and implementation which we will touch the maintenance procedure.

7. All the equations used in this study must be numbering.

We believe that all the equations in this study has values and have been numbered.

8. In equation 1, explain the procedure to calculate VD, from where we get the used valued to calculate VD.

Accomplished

9. Numbers of references used to validate this study are too short. There must be some more literature review to authenticate this study be used in this research.

Accomplished

We already made another a submission for the version 2 of our paper. Thank you.