Developing an Application for Document Analysis with Latent Dirichlet Allocation: A Case Study in Integrated Quality Assurance System

Freddie Prianes; Thelma Palaoag

doi:10.12688/f1000research.130245.3

Home Browse Developing an Application for Document Analysis with Latent Dirichlet...

ALL Metrics

Views

Downloads

Get PDF

Get XML

Export

▬

✚

Research Article

Revised

Developing an Application for Document Analysis with Latent Dirichlet Allocation: A Case Study in Integrated Quality Assurance System

[version 3; peer review: 1 approved, 1 approved with reservations, 2 not approved]

Previously titled: Modeling document labels using Latent Dirichlet allocation for archived documents in Integrated Quality Assurance System (IQAS)

Freddie Prianes ¹, Thelma Palaoag ²

PUBLISHED 09 Apr 2024

Author details Author details

¹ College of Computer Studies, Camarines Sur Polytechnic Colleges, Nabua, Camarines Sur, 4432, Philippines
² College of Information Technology and Computer Science, University of the Cordilleras, Baguio City, Benguet, 2600, Philippines

Freddie Prianes
Roles: Conceptualization, Writing – Original Draft Preparation

Thelma Palaoag
Roles: Writing – Review & Editing

OPEN PEER REVIEW

REVIEWER STATUS

This article is included in the Artificial Intelligence and Machine Learning gateway.

Abstract

Background

As part of the transition of every higher education institution into an intelligent campus here in the Philippines, the Commission of Higher Education has launched a program for the development of smart campuses for state universities and colleges to improve operational efficiency in the country. With regards to the commitment of Camarines Sur Polytechnic Colleges to improve the accreditation operation and to resolve the evident problems in the accreditation process, the researchers propose this study as part of an Integrated Quality Assurance System that aims to develop an intelligent model that will be used in categorizing and automating tagging of archived documents used during accreditation.

Methods

As a guide in modeling the study, the researchers use an agile method as it promotes flexibility, speed, and, most importantly, continuous improvement in developing, testing, documenting, and even after delivery of the software. This method helped the researchers design the prototype with the implementation of the said model to aid the file searching process and label tagging. Moreover, a computational analysis is also included to understand the result from the devised model further.

Results

As a result, from the processed sample corpus, the document labels are faculty, activities, library, research, and materials. The labels generated are based on the total relative frequencies, which are 0.009884, 0.008825, 0.007413, 0.007413, and 0.006354, respectively, that have been computed between the ratio of how many times the term was used in the document and the total word count of the whole document.

Conclusions

The devised model and prototype support the organization in file storing and categorization of accreditation documents. Through this, retrieving and classifying the data is easier, which is the main problem for the task group. Further, other clustering, modeling, and text classification patterns can be integrated into the prototype.

Keywords

Latent Dirichlet allocation, document labels, natural language processing, accreditation, quality assurance, Intelligent model, CSPC

Corresponding authors: Freddie Prianes, Thelma Palaoag

Competing interests: No competing interests were disclosed.

Grant information: The author(s) declared that no grants were involved in supporting this work.

Copyright: © 2024 Prianes F and Palaoag T. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

How to cite: Prianes F and Palaoag T. Developing an Application for Document Analysis with Latent Dirichlet Allocation: A Case Study in Integrated Quality Assurance System [version 3; peer review: 1 approved, 1 approved with reservations, 2 not approved]. F1000Research 2024, 12:105 (https://doi.org/10.12688/f1000research.130245.3) First published: 27 Jan 2023, 12:105 (https://doi.org/10.12688/f1000research.130245.1) Latest published: 09 Apr 2024, 12:105 (https://doi.org/10.12688/f1000research.130245.3)

Revised Amendments from Version 2

Based on the feedback of the reviewer, we have adapted the new title as suggested.

See the authors' detailed response to the review by Zbigniew H. Gontar
See the authors' detailed response to the review by Shahid Naseem

Introduction

The creation of a smart campus is a step toward the creation of a smart city. Teaching and learning will be more difficult in the future due to the rapid advancements in information and communication technology (Kwok, 2015). With this rapid advancement, there is already a shift from the “smart” era to the “intelligent” era. A “smartphone,” “smart building,” or “smart home” is capable of adapting to changing conditions. The term “intelligent,” on the other hand, refers to more than just being smart; rather, it refers to having the ability to think, reason, and understand and adapting to adapt to changing conditions. If you apply this to a device example, “smart devices” can perform tricks, but “intelligent devices” can learn new tricks in response to their changing surroundings (Ng et al., 2010).

As part of every Higher Educational Institution (HEI)’s transition to an intelligent campus, the Commission of Higher Education (CHED) has launched a program under CHED Memorandum Order No. 9 s. 2020 for developing smart campuses for State Universities and Colleges (SUCs). In fact, CHED releases a budget to assist SUCs in developing smart campuses in which HEIs use next-generation digital technologies woven seamlessly within a well-architected infrastructure in developing tools to enhance teaching and learning, research, and extension, as well as to improve operational efficiency. On the other hand, as a requirement by CHED and maintaining the quality of education in HEIs, CHED gives accountability and responsibility to the accrediting body, such as the Accrediting Agency of Chartered Colleges and Universities of the Philippines (AACCUP), Philippines Association of Colleges and Universities Commission on Accreditation (PACU-COA), Philippine Accrediting Association of Schools, Colleges, and Universities (PAASCU), and many others to assess and provide certifications of quality education in the accredited program/institution as stated in the CHED Memorandum Order No. 1 s. 200.

Achieving a smart/intelligent campus requires consideration of different areas by the institution. Based on the study of Ng et al., there are six main areas of intelligence, namely (1) iLearning, (2) iManagement, (3) iGovernance, (4) iSocial, (5) iHealth, and (6) iGreen. The accreditation process alone will fall under iManagement; however, that entire aspect and purpose of accreditation fall in all the areas.

As a state college, Camarines Sur Polytechnic Colleges (CSPC) will be one setting for the initial implementation of the system. As part of the goal of CSPC to be the center for development and center of excellence, the institution opted to go along with the launch of the CHED program to become one of the smart campuses in the region. In connection to this, the institution also undergoes continuous accreditation through AACUP, as depicted in Figure 1, and ISO quality assurance to achieve the goal and gain the university status as the Polytechnic University of Bicol.

Figure 1. Agency of Chartered Colleges and Universities of the Philippines (AACCUP) accreditation process.

As shown in Figure 1, the accreditation process passes through various phases or actions: (a) Application: An educational institution submits an application to AACCUP for accreditation. (b) Institutional self-survey: After the application has been approved, the applicant institution is expected to conduct an internal evaluation by its internal accreditors to evaluate whether the program is ready for an external review. (c) Preliminary survey visit: This is when external accreditors evaluate the program for the first time. The program is eligible for a Candidate status that is good for two years after passing the assessment. (d) The first formal survey visit reviews the program that has obtained Candidate status. If it has met a higher standard of excellence, it is given a Level I Accredited status, valid for three years. (e) The second survey visit entails evaluating an accredited program. Suppose it has met the standards for a greater degree of quality than the survey visit that came before it. In that case, the program may be eligible for a Level II Re-accreditation status valid for five years. (f) During the third survey visit, the accreditation level is completed by a program after five years of holding Level II Re-accreditation status. The program is reviewed and must perform exceptionally in four categories, namely instruction, and extension, which are essential, and two other areas, which must be selected from among research, performance in licensure exams, faculty development, and links. (g) The fourth survey visit is a more difficult level that, if passed, may grant the organization institutional accreditation status.

Accompanied with the tedious accreditation process are many documents that need to be produced. For most of the experiences in the current accreditation undertakings in CSPC, most of the tasks have been done manually. Though tools are available for cloud storage and automation like Google Drive, Dropbox, etc., problems such as repetition of work, invalid instruments, inefficient resource utilization, and inefficient monitoring before, during, and after the accreditation are still experienced by the personnel. With this perceived problem, an integrated system dedicated to quality assurance processes is a must.

Upon the CSPC’s goal of becoming a university and becoming a smart/intelligent campus, the researchers propose a centralized system that will cater to the institution’s needs in the accreditation process, which is part of quality assurance. Through this study, CSPC will benefit from being a smart/intelligent campus by using the system in the iManagement area, and, at the same time, it addresses the problems encountered during the accreditation processes.

Based on the problems identified and the commitment of the institution to be a smart/intelligent campus, the researchers propose this study as a component in the Integrated Quality Assurance System (IQAS) (RRID: SCR_023146). The study focuses on the documents required for the accreditation process. The system will have a document repository of archived documents, and the system will analyze these documents through the use of intelligent modeling. Using this, the documents will be categorized through the extracted labels.

The study aims to create a model supporting the categorization and automated tagging of archived documents used during accreditation.

Related works

Unstructured data makes it more difficult and time-consuming to find a relevant document due to the exponential growth of electronic documents. Text document classification, which organizes unstructured documents into pre-defined classifications, is crucial to information processing and retrieval (Akhter et al., 2020). The text documents provide several difficult data processing problems for retrieving pertinent data. One of the well-liked methods for information retrieval based on themes from biomedical documents is topic modeling. Finding the correct subjects from the biological documents is difficult in topic modeling. Additionally, redundancy in biomedical text documents has a detrimental effect on text mining quality. As a result, the exponential rise of unstructured documents necessitates developing topic-modeling machine-learning approaches (Rashid et al., 2019). In the framework of document categorization, they have conducted a comparative analysis of three models for a feature representation of text documents. The most popular family of bag-of-words models, the recently suggested continuous space models Word2Vec and Doc2Vec, and the model based on the representation of text documents as language networks are all taken into consideration in detail (Martinčić-Ipšić et al., 2019).

Based on the previous articles, unstructured text data refers to textual information that lacks a predefined structure, making it challenging to analyze and extract meaningful insights. Latent Dirichlet Allocation (LDA) is a probabilistic generative model that can be used for topic modeling, a technique that helps uncover latent topics within a collection of documents (Curiskis et al., 2020). When applied to document labeling, LDA can assist in organizing unstructured text into structured representations based on underlying topics (Maier et al., n.d.).

The LDA algorithm works properly, and it is suitable for text mining. It enables the user to extract important content from text data sets (Tong & Zhang, 2016). Apart from this, it converts inaccessible data into a structured format, which can be used for further analysis. It also emphasizes facts and relationships from large data sets (Yehia et al., 2016). This information is extracted and converted into structured data for visualization, analysis, and integration as structured data and refines the information using machine-learning methods (Gnanavel et al., 2022).

The output of LDA is structured data that organizes documents into topics, allowing for the identification of the most significant topics in the corpus and their associated words. This structured data provides insights into the underlying structure and themes of the corpus, enabling further analysis and interpretation (Liu et al., 2023).

By employing LDA, unstructured text data is transformed into structured representations through the identification of latent topics, facilitating improved organization, retrieval, and analysis of textual information (Camilleri & Miah, 2021). The labeled documents provide a meaningful and interpretable way to understand the content and themes within the corpus (Markowitz, 2021).

The study used word representation techniques to analyze how the similarity between English words is calculated. In a similar work, it used the Word2Vec paradigm to express words as vectors. The 320,000 English Wikipedia articles included in this study’s model served as the corpus, and the similarity value was calculated using the cosine similarity calculation method (Jatnika et al., 2019). Real-world text categorization problems frequently involve a multitude of closely related categories arranged in a taxonomy or hierarchical structure. When processing huge sets of closely related categories, hierarchical multi-label text categorization has grown more difficult (Ma et al., 2021). A popular technique for clustering functional data is the functional k-means clustering algorithm. The derivative information is not further considered by this approach when determining how similar two functional samples are to one another. The derivative information is crucial for spotting variances in trend characteristics among functional data. By including their derivative information, we establish a novel distance in this paper to compare functional samples (Meng et al., 2018). Due to its capacity to analyze data from numerous sources or views, multi-view clustering has drawn growing interest in recent years. In the research, they presented a unique multi-view clustering method called Two-level Weighted Collaborative k-means (TW-Co-k-means) to simultaneously address the issues of consistency across different views and weigh the views for improving cluster results. For multi-view clustering, a new objective function has been developed that leverages the unique information in each view while also cooperatively utilizing the complementarity and consistency between various views (Zhang et al., 2018). The various pattern matching algorithms are used to locate every instance of a constrained set of patterns inside an input text or input document to examine the content of the documents. This research utilized four string matching techniques that are now in use: the Brute Force approach, the Knuth–Morris–Pratt algorithm (KMP), the Boyer–Moore algorithm, and the Rabin–Karp algorithm (Bhagya Sri et al., 2018). Analogous to the technique used by the researcher in exploring all possible combinations using the functionality of LDA, the Brute Force approach is like exhaustively considering all possible topics and their distribution in the document (Robinson & Quinn, 2018). However, it’s inefficient, much like considering every possible combination of words as potential topics (Murray et al., 2022). On the other hand, the KMP algorithm’s efficiency in skipping unnecessary comparisons (Lu, 2019), similar to LDA it efficiently identifies topics in documents by leveraging models, optimizing the process of finding meaningful patterns (topics) in text (Rawat et al., 2022). In terms of skipping the portions of the text based on the information gathered during preprocessing in the documents, LDA skips irrelevant words and focuses on key terms that contribute to the identification of topics (Hwang et al., 2023) that are similar to the Boyer–Moore algorithm skipping portions of text during the matching process (Danvy & Rohde, 2006). Moreover, Rabin–Karp’s hashing for efficient matching (Siahaan, 2018) is akin to LDA, which also involves modeling to identify relevant topics in documents, quickly bypassing irrelevant information (Asmussen & Møller, 2019). All the literature listed has similarities in text clustering, modeling, and classification. It serves as proof that the study is feasible and the proposed intelligent model can be integrated to further assist in the accreditation process in CSPC.

Methods

As a guide in modeling the study, the researchers used the agile method (https://dx.doi.org/10.17504/protocols.io.n2bvj82mxgk5/v2) as it promotes flexibility, speed, and, most importantly, continuous improvement in developing, testing, documenting, and even after delivery of the software. Since the phases of this model are light, the teams are not bound by a rigid systematic-based process on pre-set constraints and restrictions as some other models, like the waterfall model, can adjust changes whenever needed. This flexibility at every stage propagates creativity and freedom within processes. Furthermore, development teams can modify and re-prioritize the backlog, allowing for speedy implementation (Trivedi, 2021).

Following the agile methodology, the researchers adapted the stages presented in Figure 2. These are (1) Plan: the researchers collected previous documents involved in the accreditation process, such as compliance reports for student, faculty, facility, library, and administration. Also, understanding the existing problems in tracking, tagging, and duplicating these documents during the accreditation process. (2) Design: The requirement specifications in this stage were identified about the existing problem of the HEI in tracking, tagging, and duplicating the documents for accreditation and quality assurance. Along with this, the researchers also created the process of the intelligent model, which will be the basis of document labeling. (3) Develop: This stage is intended for the creation of the prototype, which involves processing the documents to identify the proper label for each document. (4) Deploy: the prototype undergoes a test run during this stage. (5) Review: the researchers conduct a checklist function review to check if each component runs properly. Lastly, (6) launch, wherein the prototype is embedded in the local system of the HEI together with the maintenance procedure upon full implementation of the system.

Figure 2. Agile methodology.

Results and discussion

Intelligent model

The results from this intelligent model are used for visualization in the super word vector and histogram. The super word vector is presented in a cloud map word to visualize the frequency of the words in the corpus, and the histogram is used to present the relationship of the words per sentence in the form of line graphs. The extracted labels and generated word vector and histogram are tagged and linked to the uploaded document, as patterned in the process shown in Figure 3. This model is implemented in the IQAS to assist in categorization and searching in the file repository of accreditation documents.

Figure 3. Process of the intelligent model.

Prototype

The design prototype presented in this section is focused on the label extraction feature for automatic tagging of the archive documents used in accreditation.

Upload and clean

As shown in Figure 4, this phase allows the user to upload and clean the document through tokenization. Once uploaded, the user may set the configuration to clean the document. The options are removing numbers, symbols, and duplicates, adding and uploading additional stopwords, and showing and downloading the pre-processed data. There are other useful features, particularly in managing the stopwords, such as showing the list of default stopwords and deleting the added and uploaded stopwords.

Figure 4. Phase I—upload and clean snapshot.

Setting up parameters

Phase II is intended to set up the parameters for topic modeling, as presented in Figure 5. Right after uploading and cleaning the document, the user can set the topic modeling parameters to identify and extract the labels. The parameters included are the desired number of topics, frequency of iteration, the number of words per topic to be generated, optimization interval, and the model’s name. These parameters are primarily the factors in modeling the topics and label identification for automatic tagging.

Figure 5. Phase II—setting up parameters snapshot.

Extract label

This phase, as shown in Figure 6, provides the result of the processed corpus from the processing of the pre-processed document and the parameters that have been set up from the previous phase. This shows the number of documents uploaded, the total number of words in the document, the number of unique words, vocabulary density, readability index, average words per sentence, and most importantly, the frequent words in the corpus. These frequently used words are extracted to be the label for automatic tagging later. The user can also set the items to be shown in the most frequent words.

Figure 6. Phase III—extract label snapshot.

Word cloud

A word cloud is also generated with the results of phase III. Phase IV, as depicted in Figure 7, is a super word vector view of the frequent words in the processed corpus. The most evident words in the word cloud are the frequently used words from the previous phase: faculty, activities, library, research, and materials. The font size of the word is based on how many times this word is used in the corpus.

Figure 7. Phase IV—word cloud snapshot.

LDA visualization

With the result generated during phase III, this phase provides the histogram presentation of the sample processed corpus with the support of the LDA visualization, as shown in Figure 8. The line graph provides the relative frequencies of each generated label per document segment.

Figure 8. Phase V—LDA visualization snapshot.

Auto-tagging to uploaded document

After the five phases, automatic tagging of the generated labels occurs which are faculty, activities, library, research, and materials, as shown in Figure 9. The document is then stored in the IQAS file repository. The uploaded document will have corresponding metadata such as filename, file size, user, date created, tags, and the link of the processed model. The filename can also be updated, and adding and removing tags is possible.

Figure 9. Phase VI—auto-tagging of labels in the uploaded document snapshot.

Computational analysis

This section provides the computational analysis of the actual result based on the processed document for better understanding.

In reference to the results of phase III, four significant results are evident in Figure 6. Vocabulary density is the ratio between the total number of words in the corpus and the unique words (Crane, 2023). To obtain the vocabulary density, the total number of unique words is divided by the total number of words wherein the values used for sample computation are derive from the result of LDA algorithm embedded in the prototype which can be seen in Figure 4; for the sample computation, see Equation 1.

Vocabulary density (VD) = \frac{Number of unique words (UW)}{Total word count (WC)}

VD = \frac{720}{2,833}

VD = 0.254

Equation 1. Vocabulary density computation.

The vocabulary density of the processed corpus is 0.254, which implies that the corpus contains complex text with many unique words. Moreover, the readability index and average word per sentence use Java break iteration, a locally sensitive class with an imaginary cursor pointing to the current boundary in a string of natural language text. This contains different kinds of boundaries, such as text characters, words, sentence instances, and potential line breaks. These boundaries are the basis for the readability index and average words per sentence, which are 16.106 and 21.5, respectively. Frequently used words are identified based on the number counts of the word used in the processed corpus.

The LDA visualization is presented by correlating the relative frequency of the word per document segmentation, as shown in Figure 8. To identify the relative frequency, deciding the number of document segmentations is necessary. For this study, the researchers used ten (10) segments for the document. The grouping of words per segment is based on the total word count. The prototype now determines how often a particular word is used per segment. Upon determination, the identified number of counts is divided into the total word count. For the sample computation (Crane, 2023), see Equations 2 and 3.

Words per segment (WS) = \frac{Desired number of segments (DNS)}{Total word count (WC)}

WS = \frac{10}{2,833}

WS = {283.3}^{*}

* The first seven segments contain 283 words, while the last three segments contain 284 words.

Equation 2. Words per segment computation.

Relative frequency (RF) = \frac{Word count per segment (WCS)}{Total word count (WC)}

RF = \frac{2}{2,833}

RF = 0.0007060

Equation 3. Sample computation for relative frequency (Word: research|2^nd Segment).

For the overall results of the histogram, Tables 1 and 2 present the tabular representation of the relative frequency per label and segment.

Table 1. Word count of labels per document segment.

Labels	Word count per document segment										Total count
Labels	1	2	3	4	5	6	7	8	9	10	Total count
Faculty	1	13	3	5	1	2	1	1	0	1	28
Activities	0	3	5	1	6	0	0	2	2	6	25
Library	0	0	0	0	1	14	5	1	0	0	21
Research	0	2	2	12	0	1	0	0	4	0	21
Materials	3	6	0	3	0	0	0	0	3	3	18

Table 2. Relative frequency of labels per document segment.

Labels	Relative frequency per document segment										Total count
Labels	1	2	3	4	5	6	7	8	9	10	Total count
Faculty	0.000353	0.004589	0.001059	0.001765	0.000353	0.000706	0.000353	0.000353	0	0.000353	0.009884
Activities	0	0.001059	0.001765	0.000353	0.002118	0	0	0.000706	0.000706	0.002118	0.008825
Library	0	0	0	0	0.000353	0.004942	0.001765	0.000353	0	0	0.007413
Research	0	0.000706	0000706	0.004235	0	0.000353	0	0	0.001412	0	0.007413
Materials	0.001059	0.002118	0	0.001059	0	0	0	0	0.001039	0.001059	0.006354

Conclusions

CSPC is in an exploratory phase when it comes to solving this particular accreditation problem. It is evident that the organization encountered problems pertaining to the accreditation process. Therefore, the researchers devised a model that supports the organization’s accreditation. In addition, the researchers also designed a prototype with the implementation of the model to help the organization through the process. As a result, retrieving and classifying the data is easier, which is the main problem of the task group. Furthermore, other text classification patterns may also be integrated into the system, and the results may be compared with given parameters.

Software availability

Software available from: https://github.com/CraigList056/iqas/tree/v1.0.0-alpha

Source code available from: https://github.com/CraigList056/iqas

Archived source code at the time of publication: https://www.doi.org/10.5281/zenodo.7507492

License: MIT License

Acknowledgments

We would like to express our great appreciation to our colleagues and friends for their undeniable support and for uplifting our spirits to make this research paper possible. We would also like to extend our appreciation to our respective institutions (Camarines Sur Polytechnic Colleges and University of the Cordilleras), which have been our second home and witness to our efforts during the research process.

References

Akhter MP, Jiangbin Z, Naqvi IR, et al.: Document-Level Text Classification Using Single-Layer Multisize Filters Convolutional Neural Network. IEEE Access. 2020; 8(Ml): 42689–42707. Publisher Full Text
Asmussen CB, Møller C: Smart literature review: a practical topic modelling approach to exploratory literature review. Journal of Big Data. 2019; 6(1). Publisher Full Text
Bartolomeo G: A Zero-Knowledge Revocable Credential Verification Protocol Using Attribute-Based Encryption.2023. Reference Source
Bhagya Sri M, Bhavsar R, Narooka P: String Matching Algorithms. International Journal of Engineering and Computer Science. 2018; 7(03): 23769–23772. Publisher Full Text
Camilleri E, Miah SJ: Evaluating latent content within unstructured text: an analytical methodology based on a temporal network of associated topics. Journal of Big Data. 2021; 8(1). Publisher Full Text
CraigList056: CraigList056/iqas: Initial Release (v1.0.0-alpha). Zenodo. 2023. Publisher Full Text
Crane GR: Perseus Digital Library.2023. Reference Source
Curiskis SA, Drake B, Osborn TR, et al.: An evaluation of document clustering and topic modelling in two online social networks: Twitter and Reddit. Information Processing and Management. 2020; 57(2). Publisher Full Text
Danvy O, Rohde HK: On obtaining the Boyer-Moore string-matching algorithm by partial evaluation. Information Processing Letters. 2006; 99(4): 158–162. Publisher Full Text
Gnanavel S, Mani V, Sreekrishna M, et al.: Rapid Text Retrieval and Analysis Supporting Latent Dirichlet Allocation Based on Probabilistic Models. Mobile Information Systems. 2022; 2022: 1–12. Publisher Full Text
Hwang S, Flavin E, Lee JE: Exploring research trends of technology use in mathematics education: A scoping review using topic modeling. Education and Information Technologies. 2023; 28(8): 10753–10780. PubMed Abstract | Publisher Full Text | Free Full Text
Jatnika D, Bijaksana MA, Suryani AA: Word2vec model analysis for semantic similarities in English words. Procedia Computer Science. 2019; 157: 160–167. Publisher Full Text
Kwok L: A vision for the development of i-campus. Smart Learning Environments. 2015; 2(1): 1–12. Publisher Full Text
Liu Y, Wang J, Tang S, et al.: Integrating Information Entropy and Latent Dirichlet Allocation Models for Analysis of Safety Accidents in the Construction Industry. Buildings. 2023; 13(7). Publisher Full Text
Lu X: The Analysis of KMP Algorithm and its Optimization. Journal of Physics: Conference Series. 2019; 1345(4): 042005. Publisher Full Text
Ma Y, Liu X, Zhao L, et al.: Hybrid embedding-based text representation for hierarchical multi-label text classification. Expert Systems with Applications. 2021; 187(July 2020): 115905. Publisher Full Text
Maier D, Waldherr A, Miltner P, et al.: LDA TOPIC MODELING IN COMMUNICATION RESEARCH Applying LDA topic modeling in communication research: Toward a valid and reliable methodology Communication Methods and Measures: Special Issue on Computational Methods. (n.d.).
Markowitz DM: The Meaning Extraction Method: An Approach to Evaluate Content Patterns from Large-Scale Language Data. In Frontiers in Communication. Vol. 6. . Frontiers Media S.A; 2021. Publisher Full Text
Martinčić-Ipšić S, Miličić T, Todorovski L: The influence of feature representation of text on the performance of document classification. Applied Sciences (Switzerland). 2019; 9(4). Publisher Full Text
Meng Y, Liang J, Cao F, et al.: A new distance with derivative information for functional k-means clustering algorithm. Information Sciences. 2018; 463-464: 166–185. Publisher Full Text
Murray J, Sutter A, Lobifaro A, et al.: Incorporation of prior knowledge and habits while solving anagrams. Journal of Eye Movement Research. 2022; 15(5). PubMed Abstract | Publisher Full Text | Free Full Text
Ng JWP, Azarmi N, Leida M, et al.: The intelligent campus (iCampus): End-to-end learning lifecycle of a knowledge ecosystem. Proceedings - 2010 6th International Conference on Intelligent Environments, IE 2010. 2010; 332–337. Publisher Full Text
Rashid J, Adnan Shah SM, Irtaza A, et al.: Topic Modeling Technique for Text Mining over Biomedical Text Corpora through Hybrid Inverse Documents Frequency and Fuzzy K-Means Clustering. IEEE Access. 2019; 7: 146070–146080. Publisher Full Text
Rawat AJ, Ghildiyal S, Dixit AK: Topic Modeling Techniques for Document Clustering and Analysis of Judicial Judgements. International Journal of Engineering Trends and Technology. 2022; 70(11): 163–169. Publisher Full Text
Robinson AC, Quinn SD: A brute force method for spatially-enhanced multivariate facet analysis. Computers, Environment and Urban Systems. 2018; 69: 28–38. Publisher Full Text
Siahaan APU: Rabin-Karp Elaboration in Comparing Pattern Based on Hash Data. International Journal of Security and Its Applications. 2018; 12(2): 59–66. Publisher Full Text
Tong Z, Zhang H: A Text Mining Research Based on LDA Topic Modelling. Computer Science & Information Technology. 2016; 201–210. Publisher Full Text
Trivedi D: Agile Methodologies. International Journal of Computer Science & Communication. 2021; 12(2): 91–100.
Yehia M, Fattuoh L, Abulkhair M: Text Mining and Knowledge Discovery from Big Data: Challenges and Promise. International Journal of Computer Science Issues. 2016; 13(3): 54–61. Publisher Full Text
Zhang GY, Wang CD, Huang D, et al.: TW-Co-k-means: Two-level weighted collaborative k-means for multi-view clustering. Knowledge-Based Systems. 2018; 150: 127–138. Publisher Full Text

Comments on this article Comments (0)

Version 3

VERSION 3 PUBLISHED 27 Jan 2023

Author details Author details

Freddie Prianes
Roles: Conceptualization, Writing – Original Draft Preparation

Thelma Palaoag
Roles: Writing – Review & Editing

Competing interests

No competing interests were disclosed.

Grant information

The author(s) declared that no grants were involved in supporting this work.

Article Versions (3)

version 3

Revised

Published: 09 Apr 2024, 12:105

https://doi.org/10.12688/f1000research.130245.3

version 2

Revised

Published: 26 Jan 2024, 12:105

https://doi.org/10.12688/f1000research.130245.2

version 1

Published: 27 Jan 2023, 12:105

https://doi.org/10.12688/f1000research.130245.1

© 2024 Prianes F and Palaoag T. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download

Export To

metrics

	Views	Downloads
F1000Research	-	-
PubMed Central Data from PMC are received and updated monthly.	-	-

Citations

SEE MORE DETAILS

CITE

how to cite this article

Prianes F and Palaoag T. Developing an Application for Document Analysis with Latent Dirichlet Allocation: A Case Study in Integrated Quality Assurance System [version 3; peer review: 1 approved, 1 approved with reservations, 2 not approved]. F1000Research 2024, 12:105 (https://doi.org/10.12688/f1000research.130245.3)

NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.

track

receive updates on this article

Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status: ?

Key to Reviewer Statuses VIEW HIDE

ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested

Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.

Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions

Version 3

VERSION 3

PUBLISHED 09 Apr 2024

Revised

Views

Reviewer Report 29 Jul 2024

Sandhya Avasthi, CSE, ABES Engineering College, Ghaziabad, Uttar Pradesh, India

Not Approved

https://doi.org/10.5256/f1000research.164614.r301532

The problem is stated clearly, but the use of topic modelling in it is not clear. For example, the research paper didn’t include how many documents were taken for processing.
As shown in results, the

The problem is stated clearly, but the use of topic modelling in it is not clear. For example, the research paper didn’t include how many documents were taken for processing.
As shown in results, the system takes one document at a time. But the research as described in paper by David Blei titled, Latent dirichlet allocation - (2003, [Ref - 1]). The results were based on document collections/corpus.
The process described in Figure 3, also shows use of single document only, while in introduction given on accreditation process itself states that there is need of processing collections of past documents. The documents must be several documents spanned over period of time. Figure 3 does not show this.
Figure 4, also shows processing of documents one at a time, that is means if there are 100 documents, the process needs to be repeated 100 times, which is going to be really tiresome. So what is the need of automating this thing?
Processing of documents and performing topic modelling ( LDA) and generating labels are two different steps, how it works here is not clear?
There is no sufficient references given to support this model/system.
The developed model needs to redesigned to justify the need of such automation in accreditation document collection analysis.
The conclusion needs to be rewritten, it doesn’t summarize the methodology and findings, even results.
In it current form, paper cannot be accepted.
The source code provided is reproducible but it does not provide solution to the problem stated in abstract.

Is the work clearly and accurately presented and does it cite the current literature?

Partly
Is the study design appropriate and is the work technically sound?

Partly
Are sufficient details of methods and analysis provided to allow replication by others?

No
If applicable, is the statistical analysis and its interpretation appropriate?

Yes
Are all the source data underlying the results available to ensure full reproducibility?

Yes
Are the conclusions drawn adequately supported by the results?

No

References

1. Blei David M., Ng Andrew Y., Jordan Michael I.: Latent dirichlet allocation. 993-1022 Reference Source

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: Machine Learning, Artificial Intelligence, Information Extraction, Natural Language Processing, Text Mining

I confirm that I have read this submission and believe that I have an appropriate level of expertise to state that I do not consider it to be of an acceptable scientific standard, for reasons outlined above.

CITE

Report a concern

Respond or Comment

Views

Reviewer Report 11 Jul 2024

Tianbo Ji, Nantong University,, Nantong, China

Approved with Reservations

https://doi.org/10.5256/f1000research.164614.r301531

This paper proposes an application of leveraging LDA for document classification and labelling. It is a well-motivated paper as it is specifically designed for the accreditation of a certain academic institution. However, this paper looks like the introduction to a production, instead of a regular research article. The methodology described in it consists of several screenshots, while it mainly focuses on introducing the application.
In addition, it lacks many technology details. For example, 1) there is no clear definition of “super word vector”. 2) it is unclear how the model handles synonyms and lemmatization.

Suggestions:

“adapting to adapt to changing conditions” seems a repeating.
Avoid using second-person pronoun: “If you apply this to a device example”
As a research paper, I would suggest discussing with other technologies (e.g., BERTopic), and adding future work.

Is the work clearly and accurately presented and does it cite the current literature?

Partly
Is the study design appropriate and is the work technically sound?

No
Are sufficient details of methods and analysis provided to allow replication by others?

No
If applicable, is the statistical analysis and its interpretation appropriate?

No
Are all the source data underlying the results available to ensure full reproducibility?

Yes
Are the conclusions drawn adequately supported by the results?

Partly

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: Natural Language Processing, Machine Learning, Large Language Models

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

CITE

Report a concern

Respond or Comment

Version 2

VERSION 2

PUBLISHED 26 Jan 2024

Revised

Views

Reviewer Report 28 Mar 2024

Zbigniew H. Gontar, SGH Warsaw School of Economics, Warszawa, Poland

Not Approved

https://doi.org/10.5256/f1000research.161414.r254578

The title of the article, "Modeling document labels using Latent Dirichlet Allocation for archived documents in Integrated Quality Assurance System," suggests that it would address the issue of modeling within a defined domain using the Latent Dirichlet Allocation (LDA) method. However, the article instead describes the process of developing an application solution, complete with screenshots of the finished application. The modeling process itself is only briefly explained. It discusses a matching problem, which one might infer is crucial in the adopted model, presumably because a dictionary of concepts is either created from the analyzed documents or derived from external sources independent of the texts being analyzed. In LDA modeling, it's crucial that identical or similar words are assigned to specific concepts. Yet, it remains unclear whether this is the case, and if so, how it is achieved. Therefore, this could be considered as preprocessing in LDA, but it needs to be clearly described. Furthermore, the generation of themes is not detailed. In the field, perplexity, topic coherence, etc. are commonly used as measures of model quality, but this article lacks details on this aspect. The article is promising as it showcases application within a specific domain. However, it requires further methodological development to align with its title and contribute significantly to the field of knowledge.

Is the work clearly and accurately presented and does it cite the current literature?

No
Is the study design appropriate and is the work technically sound?

No
Are sufficient details of methods and analysis provided to allow replication by others?

No
If applicable, is the statistical analysis and its interpretation appropriate?

No
Are all the source data underlying the results available to ensure full reproducibility?

No
Are the conclusions drawn adequately supported by the results?

No

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: A title that more accurately reflects the article's scope and emphasis could be:"Developing an Application for Document Analysis with Latent Dirichlet Allocation. A Case Study in Integrated Quality Assurance Systems". This title shifts the emphasis from a general modeling approach to the specific application development process described in the article, while still highlighting the use of Latent Dirichlet Allocation (LDA) and its application within a particular domain.

CITE

Report a concern

Author Response 19 Jun 2024

Freddie Prianes, College of Computer Studies, Camarines Sur Polytechnic Colleges, Nabua, 4432, Philippines

19 Jun 2024

Author Response

Thank you for the feedback. The suggested title has been adapted for this research paper.
Competing Interests: No competing interests were disclosed.
Thank you for the feedback. The suggested title has been adapted for this research paper.
Thank you for the feedback. The suggested title has been adapted for this research paper.
Competing Interests: No competing interests were disclosed. Close
Report a concern
Respond or Comment

COMMENTS ON THIS REPORT

Author Response 19 Jun 2024

Freddie Prianes, College of Computer Studies, Camarines Sur Polytechnic Colleges, Nabua, 4432, Philippines

19 Jun 2024

Author Response

Thank you for the feedback. The suggested title has been adapted for this research paper.
Competing Interests: No competing interests were disclosed.
Thank you for the feedback. The suggested title has been adapted for this research paper.
Thank you for the feedback. The suggested title has been adapted for this research paper.
Competing Interests: No competing interests were disclosed. Close
Report a concern

Views

Reviewer Report 26 Mar 2024

Shahid Naseem, Department of Information Sciences, Division of Science & Technology, University of Education, Lahore, Pakistan

Approved

https://doi.org/10.5256/f1000research.161414.r241034

The authors of the said paper have incorporated ... Continue reading

CITE

Report a concern

Respond or Comment

Version 1

VERSION 1

PUBLISHED 27 Jan 2023

Views

Reviewer Report 29 Mar 2023

Shahid Naseem, Department of Information Sciences, Division of Science & Technology, University of Education, Lahore, Pakistan

Approved with Reservations

https://doi.org/10.5256/f1000research.142987.r165836

By looking at paper overall structure, presentation and above all the provided contents, I would say the authors of the paper requires minor changes to accept it for indexing.

In this study, the indexing of the

By looking at paper overall structure, presentation and above all the provided contents, I would say the authors of the paper requires minor changes to accept it for indexing.

In this study, the indexing of the tagging/titles, sub-titles is missing.
Number of sentences and grammar mistakes in different sections of the paper.
In result section, number of students studied in the batch to be accredited, financial statement, and infrastructure must also be included because these documents are also required in accreditation process.
In related work, there should be structured or labelled data instead of f pre-defined data items.
In second paragraph of related work, the authors defined four types of machine learning techniques, but didn’t explain for what purpose, these four techniques were used in this study.
In figure 2, there should be one more step i.e. maintenance included.
All the equations used in this study must be numbering.
In equation 1, explain the procedure to calculate VD, from where we get the used valued to calculate VD.
Numbers of references used to validate this study are too short. There must be some more literature review to authenticate this study be used in this research.

Is the work clearly and accurately presented and does it cite the current literature?

Yes
Is the study design appropriate and is the work technically sound?

Yes
Are sufficient details of methods and analysis provided to allow replication by others?

Yes
If applicable, is the statistical analysis and its interpretation appropriate?

Yes
Are all the source data underlying the results available to ensure full reproducibility?

Partly
Are the conclusions drawn adequately supported by the results?

Yes

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: Artificial Intelligence, Machine Learning, and Deep learning for analyzing healthcare data

CITE

Report a concern

Author Response 13 Apr 2024

Freddie Prianes, College of Computer Studies, Camarines Sur Polytechnic Colleges, Nabua, 4432, Philippines

13 Apr 2024

Author Response
We would like to express our sincere appreciation for reviewing our paper. Your comments and suggestions are utmost valued. These are our response:
1. In this study, the indexing of ... Continue reading
We would like to express our sincere appreciation for reviewing our paper. Your comments and suggestions are utmost valued. These are our response:
1.   In this study, the indexing of the tagging/titles, sub-titles is missing.

This is somehow unclear to us. But if this is pertaining to the generated tags/titles or sub-titles for the indexing of the documents, it’s been mentioned on Fig. 9 – Phase VI.

2.   Number of sentences and grammar mistakes in different sections of the paper.

Accomplished

3.   In result section, number of students studied in the batch to be accredited, financial statement, and infrastructure must also be included because these documents are also required in accreditation process.

Since the study is in exploratory analysis, we focused first on the area of Faculty and Library. But upon implementation of the prototype, we will include the other areas i.e. Students, Finance, and Infrastructure.

4.   In related work, there should be structured or labelled data instead of f pre-defined data items.

Accomplished

5.   In second paragraph of related work, the authors defined four types of machine learning techniques, but didn’t explain for what purpose, these four techniques were used in this study.

Accomplished

6.   In figure 2, there should be one more step i.e. maintenance included.

Actually we include maintenance as a sub-phase of launch. We did not elaborate on this phase because we are doing another research on prototype testing and implementation which we will touch the maintenance procedure.

7.   All the equations used in this study must be numbering.

We believe that all the equations in this study has values and have been numbered.

8.   In equation 1, explain the procedure to calculate VD, from where we get the used valued to calculate VD.

Accomplished

9.   Numbers of references used to validate this study are too short. There must be some more literature review to authenticate this study be used in this research.

Accomplished

We already made another a submission for the version 2 of our paper. Thank you.
We would like to express our sincere appreciation for reviewing our paper. Your comments and suggestions are utmost valued. These are our response:
1.   In this study, the indexing of the tagging/titles, sub-titles is missing.

This is somehow unclear to us. But if this is pertaining to the generated tags/titles or sub-titles for the indexing of the documents, it’s been mentioned on Fig. 9 – Phase VI.

2.   Number of sentences and grammar mistakes in different sections of the paper.

Accomplished

3.   In result section, number of students studied in the batch to be accredited, financial statement, and infrastructure must also be included because these documents are also required in accreditation process.

Since the study is in exploratory analysis, we focused first on the area of Faculty and Library. But upon implementation of the prototype, we will include the other areas i.e. Students, Finance, and Infrastructure.

4.   In related work, there should be structured or labelled data instead of f pre-defined data items.

Accomplished

5.   In second paragraph of related work, the authors defined four types of machine learning techniques, but didn’t explain for what purpose, these four techniques were used in this study.

Accomplished

6.   In figure 2, there should be one more step i.e. maintenance included.

Actually we include maintenance as a sub-phase of launch. We did not elaborate on this phase because we are doing another research on prototype testing and implementation which we will touch the maintenance procedure.

7.   All the equations used in this study must be numbering.

We believe that all the equations in this study has values and have been numbered.

8.   In equation 1, explain the procedure to calculate VD, from where we get the used valued to calculate VD.

Accomplished

9.   Numbers of references used to validate this study are too short. There must be some more literature review to authenticate this study be used in this research.

Accomplished

We already made another a submission for the version 2 of our paper. Thank you.
Competing Interests: No competing interests were disclosed. Close
Report a concern
Respond or Comment

COMMENTS ON THIS REPORT

Author Response 13 Apr 2024

Freddie Prianes, College of Computer Studies, Camarines Sur Polytechnic Colleges, Nabua, 4432, Philippines

13 Apr 2024

Author Response
We would like to express our sincere appreciation for reviewing our paper. Your comments and suggestions are utmost valued. These are our response:
1. In this study, the indexing of ... Continue reading
We would like to express our sincere appreciation for reviewing our paper. Your comments and suggestions are utmost valued. These are our response:
1.   In this study, the indexing of the tagging/titles, sub-titles is missing.

This is somehow unclear to us. But if this is pertaining to the generated tags/titles or sub-titles for the indexing of the documents, it’s been mentioned on Fig. 9 – Phase VI.

2.   Number of sentences and grammar mistakes in different sections of the paper.

Accomplished

3.   In result section, number of students studied in the batch to be accredited, financial statement, and infrastructure must also be included because these documents are also required in accreditation process.

Since the study is in exploratory analysis, we focused first on the area of Faculty and Library. But upon implementation of the prototype, we will include the other areas i.e. Students, Finance, and Infrastructure.

4.   In related work, there should be structured or labelled data instead of f pre-defined data items.

Accomplished

5.   In second paragraph of related work, the authors defined four types of machine learning techniques, but didn’t explain for what purpose, these four techniques were used in this study.

Accomplished

6.   In figure 2, there should be one more step i.e. maintenance included.

Actually we include maintenance as a sub-phase of launch. We did not elaborate on this phase because we are doing another research on prototype testing and implementation which we will touch the maintenance procedure.

7.   All the equations used in this study must be numbering.

We believe that all the equations in this study has values and have been numbered.

8.   In equation 1, explain the procedure to calculate VD, from where we get the used valued to calculate VD.

Accomplished

9.   Numbers of references used to validate this study are too short. There must be some more literature review to authenticate this study be used in this research.

Accomplished

We already made another a submission for the version 2 of our paper. Thank you.
We would like to express our sincere appreciation for reviewing our paper. Your comments and suggestions are utmost valued. These are our response:
1.   In this study, the indexing of the tagging/titles, sub-titles is missing.

This is somehow unclear to us. But if this is pertaining to the generated tags/titles or sub-titles for the indexing of the documents, it’s been mentioned on Fig. 9 – Phase VI.

2.   Number of sentences and grammar mistakes in different sections of the paper.

Accomplished

3.   In result section, number of students studied in the batch to be accredited, financial statement, and infrastructure must also be included because these documents are also required in accreditation process.

Since the study is in exploratory analysis, we focused first on the area of Faculty and Library. But upon implementation of the prototype, we will include the other areas i.e. Students, Finance, and Infrastructure.

4.   In related work, there should be structured or labelled data instead of f pre-defined data items.

Accomplished

5.   In second paragraph of related work, the authors defined four types of machine learning techniques, but didn’t explain for what purpose, these four techniques were used in this study.

Accomplished

6.   In figure 2, there should be one more step i.e. maintenance included.

Actually we include maintenance as a sub-phase of launch. We did not elaborate on this phase because we are doing another research on prototype testing and implementation which we will touch the maintenance procedure.

7.   All the equations used in this study must be numbering.

We believe that all the equations in this study has values and have been numbered.

8.   In equation 1, explain the procedure to calculate VD, from where we get the used valued to calculate VD.

Accomplished

9.   Numbers of references used to validate this study are too short. There must be some more literature review to authenticate this study be used in this research.

Accomplished

We already made another a submission for the version 2 of our paper. Thank you.
Competing Interests: No competing interests were disclosed. Close
Report a concern

Comments on this article Comments (0)

Version 3

VERSION 3 PUBLISHED 27 Jan 2023

Open Peer Review

Reviewer Status

Reviewer Reports

	Invited Reviewers
	1	2	3	4
Version 3 (revision) 09 Apr 24			read	read
Version 2 (revision) 26 Jan 24	read	read
Version 1 27 Jan 23	read

Shahid Naseem, University of Education, Lahore, Pakistan
Zbigniew H. Gontar, SGH Warsaw School of Economics, Warszawa, Poland
Tianbo Ji, Nantong University,, Nantong, China
Sandhya Avasthi, ABES Engineering College, Ghaziabad, India

Comments on this article

All Comments(0)

Add a comment

Browse by related subjects

Back to all reports

Reviewer Report

8 Views

29 Jul 2024 | for Version 3

Sandhya Avasthi, CSE, ABES Engineering College, Ghaziabad, Uttar Pradesh, India

8 Views Cite this report Responses(0)

Not Approved

The problem is stated clearly, but the use of topic modelling in it is not clear. For example, the research paper didn’t include how many documents were taken for processing.
As shown in results, the system takes one document at a time. But the research as described in paper by David Blei titled, Latent dirichlet allocation - (2003, [Ref - 1]). The results were based on document collections/corpus.
The process described in Figure 3, also shows use of single document only, while in introduction given on accreditation process itself states that there is need of processing collections of past documents. The documents must be several documents spanned over period of time. Figure 3 does not show this.
Figure 4, also shows processing of documents one at a time, that is means if there are 100 documents, the process needs to be repeated 100 times, which is going to be really tiresome. So what is the need of automating this thing?
Processing of documents and performing topic modelling ( LDA) and generating labels are two different steps, how it works here is not clear?
There is no sufficient references given to support this model/system.
The developed model needs to redesigned to justify the need of such automation in accreditation document collection analysis.
The conclusion needs to be rewritten, it doesn’t summarize the methodology and findings, even results.
In it current form, paper cannot be accepted.
The source code provided is reproducible but it does not provide solution to the problem stated in abstract.

Is the work clearly and accurately presented and does it cite the current literature?

Partly
Is the study design appropriate and is the work technically sound?

Partly
Are sufficient details of methods and analysis provided to allow replication by others?

No
If applicable, is the statistical analysis and its interpretation appropriate?

Yes
Are all the source data underlying the results available to ensure full reproducibility?

Yes
Are the conclusions drawn adequately supported by the results?

No

References

1. Blei David M., Ng Andrew Y., Jordan Michael I.: Latent dirichlet allocation. 993-1022 Reference Source

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

Machine Learning, Artificial Intelligence, Information Extraction, Natural Language Processing, Text Mining

Respond to this report

Responses (0)

Back to all reports

Reviewer Report

18 Views

11 Jul 2024 | for Version 3

Tianbo Ji, Nantong University,, Nantong, China

18 Views Cite this report Responses(0)

Approved With Reservations

“adapting to adapt to changing conditions” seems a repeating.
Avoid using second-person pronoun: “If you apply this to a device example”
As a research paper, I would suggest discussing with other technologies (e.g., BERTopic), and adding future work.

Is the work clearly and accurately presented and does it cite the current literature?

Partly
Is the study design appropriate and is the work technically sound?

No
Are sufficient details of methods and analysis provided to allow replication by others?

No
If applicable, is the statistical analysis and its interpretation appropriate?

No
Are all the source data underlying the results available to ensure full reproducibility?

Yes
Are the conclusions drawn adequately supported by the results?

Partly

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

Natural Language Processing, Machine Learning, Large Language Models

Respond to this report

Responses (0)

Back to all reports

Reviewer Report

22 Views

28 Mar 2024 | for Version 2

Zbigniew H. Gontar, SGH Warsaw School of Economics, Warszawa, Poland

22 Views Cite this report Responses(1)

Not Approved

Is the work clearly and accurately presented and does it cite the current literature?

No
Is the study design appropriate and is the work technically sound?

No
Are sufficient details of methods and analysis provided to allow replication by others?

No
If applicable, is the statistical analysis and its interpretation appropriate?

No
Are all the source data underlying the results available to ensure full reproducibility?

No
Are the conclusions drawn adequately supported by the results?

No

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

A title that more accurately reflects the article's scope and emphasis could be:"Developing an Application for Document Analysis with Latent Dirichlet Allocation. A Case Study in Integrated Quality Assurance Systems". This title shifts the emphasis from a general modeling approach to the specific application development process described in the article, while still highlighting the use of Latent Dirichlet Allocation (LDA) and its application within a particular domain.

Respond to this report

Responses (1)

Back to all reports

Reviewer Report

12 Views

26 Mar 2024 | for Version 2

Shahid Naseem, Department of Information Sciences, Division of Science & Technology, University of Education, Lahore, Pakistan

12 Views Cite this report Responses(0)

Approved

The authors of the said paper have incorporated all the mentioned issues in the previous report.

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

Artificial Intelligence, Machine Learning, and Deep learning for analyzing healthcare data

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

Respond to this report

Responses (0)

Back to all reports

Reviewer Report

39 Views

29 Mar 2023 | for Version 1

Shahid Naseem, Department of Information Sciences, Division of Science & Technology, University of Education, Lahore, Pakistan

39 Views Cite this report Responses(1)

Approved With Reservations

By looking at paper overall structure, presentation and above all the provided contents, I would say the authors of the paper requires minor changes to accept it for indexing.

In this study, the indexing of the tagging/titles, sub-titles is missing.
Number of sentences and grammar mistakes in different sections of the paper.
In result section, number of students studied in the batch to be accredited, financial statement, and infrastructure must also be included because these documents are also required in accreditation process.
In related work, there should be structured or labelled data instead of f pre-defined data items.
In second paragraph of related work, the authors defined four types of machine learning techniques, but didn’t explain for what purpose, these four techniques were used in this study.
In figure 2, there should be one more step i.e. maintenance included.
All the equations used in this study must be numbering.
In equation 1, explain the procedure to calculate VD, from where we get the used valued to calculate VD.
Numbers of references used to validate this study are too short. There must be some more literature review to authenticate this study be used in this research.

Is the work clearly and accurately presented and does it cite the current literature?

Yes
Is the study design appropriate and is the work technically sound?

Yes
Are sufficient details of methods and analysis provided to allow replication by others?

Yes
If applicable, is the statistical analysis and its interpretation appropriate?

Yes
Are all the source data underlying the results available to ensure full reproducibility?

Partly
Are the conclusions drawn adequately supported by the results?

Yes

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

Artificial Intelligence, Machine Learning, and Deep learning for analyzing healthcare data

Respond to this report

Responses (1)

Author Response

13 Apr 2024

Freddie Prianes, College of Computer Studies, Camarines Sur Polytechnic Colleges, Nabua, 4432, Philippines

We would like to express our sincere appreciation for reviewing our paper. Your comments and suggestions are utmost valued. These are our response:
1. In this study, the indexing of the tagging/titles, sub-titles is missing.

This is somehow unclear to us. But if this is pertaining to the generated tags/titles or sub-titles for the indexing of the documents, it’s been mentioned on Fig. 9 – Phase VI.

2. Number of sentences and grammar mistakes in different sections of the paper.

Accomplished

3. In result section, number of students studied in the batch to be accredited, financial statement, and infrastructure must also be included because these documents are also required in accreditation process.

Since the study is in exploratory analysis, we focused first on the area of Faculty and Library. But upon implementation of the prototype, we will include the other areas i.e. Students, Finance, and Infrastructure.

4. In related work, there should be structured or labelled data instead of f pre-defined data items.

Accomplished

5. In second paragraph of related work, the authors defined four types of machine learning techniques, but didn’t explain for what purpose, these four techniques were used in this study.

Accomplished

6. In figure 2, there should be one more step i.e. maintenance included.

Actually we include maintenance as a sub-phase of launch. We did not elaborate on this phase because we are doing another research on prototype testing and implementation which we will touch the maintenance procedure.

7. All the equations used in this study must be numbering.

We believe that all the equations in this study has values and have been numbered.

8. In equation 1, explain the procedure to calculate VD, from where we get the used valued to calculate VD.

Accomplished

9. Numbers of references used to validate this study are too short. There must be some more literature review to authenticate this study be used in this research.

Accomplished

We already made another a submission for the version 2 of our paper. Thank you.

View more View less

Competing Interests

No competing interests were disclosed.

Alongside their report, reviewers assign a status to the article:

Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested

Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.

Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions

[1] Akhter MP, Jiangbin Z, Naqvi IR, et al.: Document-Level Text Classification Using Single-Layer Multisize Filters Convolutional Neural Network. IEEE Access. 2020; 8(Ml): 42689–42707. Publisher Full Text

[2] Asmussen CB, Møller C: Smart literature review: a practical topic modelling approach to exploratory literature review. Journal of Big Data. 2019; 6(1). Publisher Full Text

[3] Bartolomeo G: A Zero-Knowledge Revocable Credential Verification Protocol Using Attribute-Based Encryption.2023. Reference Source

[4] Bhagya Sri M, Bhavsar R, Narooka P: String Matching Algorithms. International Journal of Engineering and Computer Science. 2018; 7(03): 23769–23772. Publisher Full Text

[5] Camilleri E, Miah SJ: Evaluating latent content within unstructured text: an analytical methodology based on a temporal network of associated topics. Journal of Big Data. 2021; 8(1). Publisher Full Text

[6] CraigList056: CraigList056/iqas: Initial Release (v1.0.0-alpha). Zenodo. 2023. Publisher Full Text

[7] Crane GR: Perseus Digital Library.2023. Reference Source

[8] Curiskis SA, Drake B, Osborn TR, et al.: An evaluation of document clustering and topic modelling in two online social networks: Twitter and Reddit. Information Processing and Management. 2020; 57(2). Publisher Full Text

[9] Danvy O, Rohde HK: On obtaining the Boyer-Moore string-matching algorithm by partial evaluation. Information Processing Letters. 2006; 99(4): 158–162. Publisher Full Text

[10] Gnanavel S, Mani V, Sreekrishna M, et al.: Rapid Text Retrieval and Analysis Supporting Latent Dirichlet Allocation Based on Probabilistic Models. Mobile Information Systems. 2022; 2022: 1–12. Publisher Full Text

[11] Hwang S, Flavin E, Lee JE: Exploring research trends of technology use in mathematics education: A scoping review using topic modeling. Education and Information Technologies. 2023; 28(8): 10753–10780. PubMed Abstract | Publisher Full Text | Free Full Text

[12] Jatnika D, Bijaksana MA, Suryani AA: Word2vec model analysis for semantic similarities in English words. Procedia Computer Science. 2019; 157: 160–167. Publisher Full Text

[13] Kwok L: A vision for the development of i-campus. Smart Learning Environments. 2015; 2(1): 1–12. Publisher Full Text

[14] Liu Y, Wang J, Tang S, et al.: Integrating Information Entropy and Latent Dirichlet Allocation Models for Analysis of Safety Accidents in the Construction Industry. Buildings. 2023; 13(7). Publisher Full Text

[15] Lu X: The Analysis of KMP Algorithm and its Optimization. Journal of Physics: Conference Series. 2019; 1345(4): 042005. Publisher Full Text

[16] Ma Y, Liu X, Zhao L, et al.: Hybrid embedding-based text representation for hierarchical multi-label text classification. Expert Systems with Applications. 2021; 187(July 2020): 115905. Publisher Full Text

[17] Maier D, Waldherr A, Miltner P, et al.: LDA TOPIC MODELING IN COMMUNICATION RESEARCH Applying LDA topic modeling in communication research: Toward a valid and reliable methodology Communication Methods and Measures: Special Issue on Computational Methods. (n.d.).

[18] Markowitz DM: The Meaning Extraction Method: An Approach to Evaluate Content Patterns from Large-Scale Language Data. In Frontiers in Communication. Vol. 6. . Frontiers Media S.A; 2021. Publisher Full Text

[19] Martinčić-Ipšić S, Miličić T, Todorovski L: The influence of feature representation of text on the performance of document classification. Applied Sciences (Switzerland). 2019; 9(4). Publisher Full Text

[20] Meng Y, Liang J, Cao F, et al.: A new distance with derivative information for functional k-means clustering algorithm. Information Sciences. 2018; 463-464: 166–185. Publisher Full Text

[21] Murray J, Sutter A, Lobifaro A, et al.: Incorporation of prior knowledge and habits while solving anagrams. Journal of Eye Movement Research. 2022; 15(5). PubMed Abstract | Publisher Full Text | Free Full Text

[22] Ng JWP, Azarmi N, Leida M, et al.: The intelligent campus (iCampus): End-to-end learning lifecycle of a knowledge ecosystem. Proceedings - 2010 6th International Conference on Intelligent Environments, IE 2010. 2010; 332–337. Publisher Full Text

[23] Rashid J, Adnan Shah SM, Irtaza A, et al.: Topic Modeling Technique for Text Mining over Biomedical Text Corpora through Hybrid Inverse Documents Frequency and Fuzzy K-Means Clustering. IEEE Access. 2019; 7: 146070–146080. Publisher Full Text

[24] Rawat AJ, Ghildiyal S, Dixit AK: Topic Modeling Techniques for Document Clustering and Analysis of Judicial Judgements. International Journal of Engineering Trends and Technology. 2022; 70(11): 163–169. Publisher Full Text

[25] Robinson AC, Quinn SD: A brute force method for spatially-enhanced multivariate facet analysis. Computers, Environment and Urban Systems. 2018; 69: 28–38. Publisher Full Text

[26] Siahaan APU: Rabin-Karp Elaboration in Comparing Pattern Based on Hash Data. International Journal of Security and Its Applications. 2018; 12(2): 59–66. Publisher Full Text

[27] Tong Z, Zhang H: A Text Mining Research Based on LDA Topic Modelling. Computer Science & Information Technology. 2016; 201–210. Publisher Full Text

[28] Trivedi D: Agile Methodologies. International Journal of Computer Science & Communication. 2021; 12(2): 91–100.

[29] Yehia M, Fattuoh L, Abulkhair M: Text Mining and Knowledge Discovery from Big Data: Challenges and Promise. International Journal of Computer Science Issues. 2016; 13(3): 54–61. Publisher Full Text

[30] Zhang GY, Wang CD, Huang D, et al.: TW-Co-k-means: Two-level weighted collaborative k-means for multi-view clustering. Knowledge-Based Systems. 2018; 150: 127–138. Publisher Full Text

Developing an Application for Document Analysis with Latent Dirichlet Allocation: A Case Study in Integrated Quality Assurance System

Abstract

Background

Methods

Results

Conclusions

Keywords

Revised Amendments from Version 2

Introduction

Figure 1. Agency of Chartered Colleges and Universities of the Philippines (AACCUP) accreditation process.

Related works

Methods

Figure 2. Agile methodology.

Results and discussion

Intelligent model

Figure 3. Process of the intelligent model.

Prototype

Figure 4. Phase I—upload and clean snapshot.

Figure 5. Phase II—setting up parameters snapshot.

Figure 6. Phase III—extract label snapshot.

Figure 7. Phase IV—word cloud snapshot.

Figure 8. Phase V—LDA visualization snapshot.

Figure 9. Phase VI—auto-tagging of labels in the uploaded document snapshot.

Table 1. Word count of labels per document segment.

Table 2. Relative frequency of labels per document segment.

Conclusions

Software availability

Acknowledgments

References

Comments on this article Comments (0)

Open Peer Review

Comments on this article Comments (0)

Open Peer Review

Reviewer Status

Reviewer Reports

Comments on this article

Browse by related subjects

Competing Interests Policy

Stay Updated