Progress and Challenges for the Application of Machine Learning for Neglected Tropical Diseases

Neglected tropical diseases (NTDs) continue to affect the livelihood of individuals in countries in the Southeast Asia and Western Pacific region. These diseases have been long existing and have caused devastating health problems and economic decline to people in low- and middle-income (developing) countries. An estimated 1.7 billion of the world's population suffer one or more NTDs annually, this puts approximately one in five individuals at risk for NTDs. In addition to health and social impact, NTDs inflict significant financial burden to patients, close relatives, and are responsible for billions of dollars lost in revenue from reduced labor productivity in developing countries alone. There is an urgent need to better improve the control and eradication or elimination efforts towards NTDs. This can be achieved by utilizing machine learning tools to better the surveillance, prediction and detection program, and combat NTDs through the discovery of new therapeutics against these pathogens. This review surveys the current application of machine learning tools for NTDs and the challenges to elevate the state-of-the-art of NTDs surveillance, management, and treatment.


Introduction 5
Neglected tropical diseases (NTDs) 5 NTDs in South East Asia (SEA) 7 Dengue 9 Soil-transmitted Helminths (STH) 10 Introduction vii. Venom* 20. Snakebite envenoming *Newly added diseases conditions into the NTD list prior to the outcome of 10 th meeting of the Strategic and Technical Advisory Group for Neglected Tropical Diseases; Source (https://www.who.int/health-topics/neglected-tropical-diseases#tab=tab_1)

Disability-adjusted life year (DALY) impact of NTDs
High incidences of NTDs were commonly reported from tropical countries due to its optimal humidity and climate for the pathogens to thrive. Low-and middle-income countries in Africa and Asia that lack proper access to clean water and waste management greatly contribute to the spread of NTDs among women and children. To measure the extent of devastation caused by NTDs, the disability-adjusted life year (DALY; one DALY represents the loss of the equivalent of one year of full health) metric was introduced as a means to quantify the overall burden of disease borne by individuals (Mitra & Mawson 2017). DALYs for a disease or health condition are the sum of years of life lost due to premature mortality (YLLs) and years of healthy life lost due to disability (YLDs) due to prevalent cases of the disease or health condition in a population (Vinkeles Melchers et al. 2021). Based on the data collected by WHO, we were able to summarize the global burden for 14 of the 20 NTDs as estimated by DALYs in Table 2 below. Global burden for five of the highest estimated DALY burden NTDs are soil-transmitted helminthiases (STHs) (2.748 million years), rabies (2.635 million years), dengue (1.952 million years), schistosomiasis (1.628 million years), and lymphatic filariasis (LF) (1.616 million years).

NTDs in South East Asia (SEA)
The Southeast Asia region consists of 11 countries that are of tropical climate, in which 10 of these countries (except for Timor-Leste) make up the Association of Southeast Asian Nations (ASEAN) member states (https://asean.org/about-asean  Table 3. Based on the DALYs estimate by cause and WHO region in 2019, dengue has the highest burden of 1.510 million DALYs, followed by LF (1.029 million years), STHs (0.616 million years), rabies (0.455 million years), and cysticercosis (0.109 million years) in WHO-SEA Region. A different trend is seen in WHO Western Pacific Region where the leading burden is rabies (0.728 million years), food-borne trematodes (0.721 million years), cysticercosis (0.376 million years), dengue (0.211 million years),and STH (0.222 million years) at the bottom of top-five NTDs DALY estimates. We discuss further NTDs of great public concern in Southeast Asia and Western Pacific region namely: Dengue, STH, Rabies, Cysticercosis, LF, and Food-borne trematodes.

Dengue
Dengue is a mosquito-borne viral disease transmitted to humans through the bites of infected female mosquitoes mainly of the species Aedes aegypti (Simo et al. 2019). The disease is caused by members of the genus Flavivirus, within the Flaviviridae family (Molyneux 2019). There are four distinct, but closely related serotypes of the virus that causes dengue, namely DENV-1, DENV-2, DENV-3 and DENV-4 (Braack et al. 2018). These viruses are capable of causing illnesses of dengue fever (DF), dengue haemorrhagic fever (DHF), and dengue shock syndrome (DSS) (Wibawa & Satoto 2016). Recovery from dengue infection is believed to provide immunity against that particular serotype with partial and temporary cross-immunity against other serotypes. Hence, the recovered individual is still vulnerable to dengue infection caused by other serotypes, with increased risk of developing severe dengue (Tsheten et al. 2021). Prevalence of dengue was reported in tropical and subtropical areas of Africa, America, Southeast Asia, the Pacific Ocean, and Western Mediterranean (Simo et al. 2019).
The number of dengue cases reported to WHO increased over 8-fold over the last two decades, from 505,430 cases in 2000, to over 2.4 million in 2010, and 5.2 million dengue cases reported in 2019 (World Health Organization 2021). Subsequently, the WHO Regional Technical Advisory Group for dengue and other arbovirus diseases in October 2021 reported approximately 3.5 billion people are living in dengue-endemic countries, of which 37% (1.3 billion populations) live in dengue endemic areas from 10 countries of the WHO Southeast Asia Region. Contributing factors for the widespread distribution of dengue mosquito vectors and viruses are due to high rates of population growth, inadequate water supply and poor practices, poor sewage and waste management systems, and a surge in global commerce and tourism. GBD 2019 estimated 2.38 million DALYs lost and age-standardized rates of 32.1 DALYs per 100,000 (95% UI 11.1 -44.1) (Abbafati et al. 2020).
Currently, Dengvaxia® (CYD-TDV) is the very first licensed dengue vaccine that has been approved by the FDA and has been licensed in 20 countries. Dengvaxia® is a live attenuated, recombinant prophylactic, tetravalent viral vaccine employing the attenuated Yellow Fever virus 17D strain as the replication bone (Guirakhoo et al. 2001;Tully & Griffiths 2021). Unfortunately, the vaccine had its drawback. When issued to seronegative dengue trial recipients (individuals who have not gotten dengue infection before from any of the four serotypes), the vaccinated seronegative group exhibited similar clinical features of severe dengue to those of unvaccinated seropositive groups (people who have had dengue infection in the past) (Thomas & Yoon 2019). Additionally, the vaccinated seronegative group exhibited higher risk of plasma leakage and severe thrombocytopenia compared to unvaccinated seronegative trial participants (Sridhar et al. 2018). Due to the incidence where the vaccine induces an immune status that increases the risk of severe dengue at the time of the first natural dengue infection for vaccinated seronegative groups, WHO does not recommend the its use in seronegative individuals but endorsed it among seropositive populations based on the successful protective nature from vaccine trial studies (Tian et al. 2022). Clinical trials of the dengue vaccine were proven to be efficacious and safe when issued to seropositive recipients (World Health Organization 2018). Other than the vaccine, there is no specific antiviral treatment available for dengue illness, and the only approach to control or prevent dengue virus transmission is through interventions targeting mosquito vectors (Simo et al. 2019; World Health Organization 2022a).

Soil-transmitted Helminths (STH)
The Ancient Greek word 'hélmins' means intestinal worm. Up-to-date, helminths are a group of parasitic worms and representative of an infection category in WHO list of NTDs comprised from the phylum Nematoda (roundworm) and Platyhelminthes (flatworm) (Akinsanya B, Adubi Taiwo, Macauley Adedamola 2021). Helminths can be further divided into three major groups of cestodes (tapeworms), nematodes (roundworms), and trematodes (flukes) based on the definitive classification of external and internal morphology of egg, larval, and adult stages (Castro 1996).
Helminth infections are one of the most widespread infectious agents causing debilitating illnesses in the human population. The leading cause of human helminthiases in 2019 based on DALYs burden are caused by soil-transmitted helminths that typically infects the intestinal region, followed by schistosomiasis, LF, onchocerciasis, and food-borne trematodes (World Health Organization 2020).
STHs illnesses focused by WHO consisted of three distinct disease conditions namely ascariasis, trichuriasis, and hookworm diseases. There are four main nematode species responsible for causing these STHs infection. Roundworms (Ascaris lumbricoides), whipworms (Trichuris trichiura), and anthropophilic hookworms (Necator americanus and Ancylostoma duodenale) are responsible for infecting humans ascariasis, trichuriasis, and hookworm diseases (Mogaji et al. 2020;Zhu et al. 2020). Mode of infection in humans is through contact with parasitic eggs or larvae in soil (Brooker, Clements & Bundy 2006). Nevertheless, all STH parasitizes the intestinal tract. Ascariasis and trichuriasis are due to ingestion of fecal-contaminated food or water, or any form of fecal-oral route transmission (Muñoz-Antoli et al. 2022). Hookworm infection is transmitted primarily by walking barefoot on contaminated soil, in which the larvae mature into a form that can penetrate the skin of humans (Bethony et al. 2006).
Studies revealed that presence of high prevalence of STH are due to poor sociodemographic and socioeconomic status, especially in rural areas with poor infrastructure facilities, improper sewage and waste management, inadequate water supply, prolonged direct contact with soil such as walking barefooted, and poor sanitation and self-hygiene (Alelign, Degarege & Erko 2015;Ali et al. 2020). Anthelmintic medications (drugs that remove parasitic worms from the body), such as albendazole or mebendazole in combination with either oxantel pamoate and ivermectin, are the drugs of choice for treatment regardless of the species of STH infections (Casulli 2021). As of 2022, significant reductions of STH infections were reported through mass drug administration (MDA) as a principle of preventive chemotherapy (PC), improved water supplies and sanitation, and hygiene education programs interventions (Zeynudin et al. 2022). GBD 2019 estimated similar figures to WHO GHE 2019, with DALYs burden of 1.970 million years and age-standardized rates of 26.6 DALYs per 100,000 (95% UI 17 -40.5). Similar trends were noted when broken into the three respective helminth diseases. Ascariasis with 754,000 DALYs burden and age-standardized rates of 10.4 DALYs per 100,000 (95% 6.6 -15.6). Trichuriasis with 236,000 DALYs burden and age-standardized rates of 3.1 DALYs per 100,000 (95% 1.7 -5.3). Hookworm diseases account for Rabies Rabies is a zoonotic disease from the Lyssavirus genus in the Rhabdoviridae family that has the capability to infect all mammalian lifeforms (Condori et al. 2020). The disease has caused a high count of human mortality and economic consequences (World Organisation for Animal Health 2008). It is an acute, progressive encephalitis caused by a lyssavirus (Brown et al. 2016). It causes inflammation of active tissues of the brain which can lead to onsets of headache, stiff neck, sensitivity to light, mental confusion and seizures (Simon et al. 2013). Bite wounds or entrance of Prior to an infection, the virus enters the eclipse phase and replicates in the surrounding muscle tissues. Transmitted viruses would attach themselves to target cells through G-protein receptors and amplify in muscle tissues (Tsiang et al. 1986). Then, the virus enters peripheral nerves to be transported to the CNS. Once it has disseminated into the CNS, the virus will infect the neurons and distribute further into highly innervated tissues via the peripheral nerves. Presence of the virus can then be found in saliva and cerebrospinal fluid (CSF), nervous tissues, and salivary glands (The Center for Food Security and Public Health 2012). Two forms of rabies disease may follow which are furious rabies and paralytic rabies (Division 2018). Patients with furious rabies are easily diagnosed as they exhibit hyperactivity, excited behavior and hydrophobia, whereas patients

Cysticercosis
Cestode infection involving Taenia solium results in two distinct illnesses. Taeniasis is an intestinal adult tapeworm infection which occurs when one consumes raw or undercooked contaminated pork.
Tapeworm eggs are spread when the carrier defecates in open areas (WHO 2014). Additionally, gravid proglottids, a mature segment of the tapeworm that may detach and migrate to the anus and pass as feces, contain T. solium eggs and is considered as another means of spreading the infection (Jansen et al. 2021). Human taeniasis is often asymptomatic, but mild symptoms may manifest which includes abdominal pain, distention, diarrhea, and nausea (García et al. 2003).
Cysticercosis is an infection in both humans and porcine, caused by parasitic larval form (cysticercus), after consuming food or water contaminated with feces containing T. solium eggs (fecal-oral contamination) (Lustigman et al. 2012). Once ingested, the eggs hatch in the intestine, releasing oncospheres that invade the intestinal wall and entering the bloodstream, and then migrating to multiple tissues and organs (muscles, skin, eyes, and central nervous system) where they then mature into cysticerci (Galipó et al. 2021). Development of parasitic cysts in the brain or central

Lymphatic filariasis (LF)
LF is caused by a group of helminths (roundworm) from the family of Filariodidea that reside in the lymphatic systems of humans (Wibawa & Satoto 2016). Majority of the infections worldwide are caused by Wuchereria bancrofti, followed by Brugia malayi and Brugia timori. (Mitra & Mawson 2017). A wide range of mosquito species are responsible for the spread of LF and can be concluded in the primary genera of Anopheles, Culex, Aedes, and Mansonia (Famakinde 2018). Similar to many mosquito-borne diseases, LF infection is transmitted when mosquitoes pick up the microfilariae during a blood meal of an infected LF individual, the parasite develops within the mosquito and subsequently infects the next victim on the subsequent blood meal (Douglass et al. 2017).
Disease morbidity results from damage to one's lymphatic vessels by adult parasite nests and microfilaria released in the bloodstream. Functionally impaired lymphatic systems will lead to manifestation of lymphoedema (elephantiasis), hence the enlarged state of the patient's limbs.

Melioidosis
Melioidosis or Whitmore's disease is a bacterial infection caused by Burkholderia pseudomallei, a soil-and water-borne Gram-negative bacterium capable of causing illness ranging from an acute or chronic localized infection to a widespread septicemic infection in multiple organs (Cheng & Currie 2005). Cases of melioidosis are frequently reported in endemic countries such as Africa, Australia, China India, Middle East, and Southeast Asia (typically Malaysia, Singapore, and Thailand) (Cheng & Currie 2005;Galyov, Brett & Deshazer 2010). Since its discovery in 1912, this bacteria still remains a topic of discussion among researchers due to being zoonotic in nature, limited therapeutic options with no available vaccines till date, making the etiological agent capable of causing economic crises at unpredicted outbreaks (Borlee et al. 2017). The United States Select Agent designated B.
pseudomallei as a Tier 1 agent due to their biothreat potential including high morbidity and mortality rates in low infectious doses, multidrug antibiotic resistance, and the amenability to be aerosolized (Hatcher, Muruato & Torres 2015).
Melioidosis infection can be acquired through many routes with skin inoculation and inhalation or ingestion of contaminated water and air droplets to be the leading cause (Larsen & Johnson 2009). Disease severity varies depending on the bacterial mode of infection, strain, as well as the host's susceptibility and immunological state (Chakravorty & Heath 2019;Wiersinga et al. 2018;Wolff et al. 2021). Among many melioidosis physiology, pulmonary melioidosis infection confers greater lethality compared to non-pneumonic type and lung involvement were observed in approximately 50% of the patients particularly during the rainy seasons due to direct inhalation of the bacterial particle. Inhalational infection causes the most severe damage to the host due to the rapid dissemination rate to other vital organs such as spleen and brain (Cheng et al. 2008;Limmathurotsakul et al. 2005). This imposed a great occupational risk among farmers as they are prone to minor cuts to their feet and hands along with prolonged exposure to muddy water in the paddy field (Chaowagul et al. 1989). Prior to first clinical manifestation of melioidosis, the bacteria may lay dormant ranging from days to years to evade the detection of the host immune systems awaiting the opportunity for relapse (Pal et al. 2022). Furthermore, melioidosis mimics the signs and

Leptospirosis
Leptospirosis is a zoonotic disease caused by a lethal bacteria of the genus Leptospira (Horwood et al. 2019). In the host, the bacteria reside in the kidney to undergo its lifecycle and are then shed in the urine. A molecular serotyping study concluded a more than 20 Leptospira spp. which can be further segregated into three phylogeny clades of pathogenic, intermediate, and non-pathogenic (Levett 2015).
Various wild and domesticated mammals can act as host reservoirs for Leptospira spp. in the city; rodents are considered as one of the most important host source of leptospirosis infection as they can persistently shed pathogenic Leptospira spp. to the environment throughout their lifecycle without any clinical manifestations (Urbanskas, Karvelienė & Radzijevskaja 2022). Rapid urbanization and population growth in the city along with poor sanitation and waste management by the city council propels the prevalence and spread of Leptospira spp. by city rats (Dobigny et al. 2018).
Plantations are a rich source of food for rodents which further favours the presence of rats thus increasing the chance for the transmission of Leptospira spp. through urinal discharge at places with high human mobility (Garba et al. 2018). The Leptospira spp. has great adaptability in both environmental and within the host reservoir. Andre-Fontaine, Aviat & Thorin (2015) reported Leptospira survival of as long as 10 months in 4°C and up to 20 months when kept at 30°C. Human individuals can contract the illness through direct contact with Leptospira-contaminated urine, water, and wet soil (Sun, Liu & Yan 2020). Individuals infected with pathogenic Leptospira spp. may be asymptomatic or associated with varying clinical manifestations ranging from acute febrile illness to severe characterized dysfunction of multiple organs leading to death (Sykes et al. 2022).
Patients would exhibit sudden onset of fever, chills, and headache which mimics the symptoms of other threatening diseases such as dengue, influenza, and malaria (Haake & Levett 2015). Patients may recover from the symptoms but if left undetected, a second more severe phase will occur leading to kidney or liver failure, and possibly meningitis (Abdullah et al. 2019). In countries or regions where diagnosis confirmation by laboratory tests is limited, this causes the number of reported leptospirosis to be underreported and thus neglected. Due to this, there is no accurate data available from WHO GHE and GBD. To gain a grasp of the global burden of leptospirosis, a model study estimated approximately 1.03 million cases of leptospirosis worldwide annually, of which 5.72% (58,900) results in death (Costa et al. 2015). Additionally, those figures were incorporated by Torgerson et al. (2015) to estimate the global burden of leptospirosis in terms of DALYs which were predicted to be at 2.90 million DALYs annually, representing incidence of 41.8 DALYs per 100,000 population (UI 18.1 -65.5).

Malaria
Malaria is an ancient life-threatening disease caused by parasites transmitted through the bites of infected female Anopheles mosquitoes (Christophers 1951). The causative agent for malaria is a group of unicellular protozoan parasites originating from the Plasmodium genus (Sato 2021). All Plasmodium spp. are capable of infecting malaria but to a specific range of host, and there are P. falciparum, P. vivax, P. malariae, P. ovale, and P. knowlesi that are naturally capable of infecting humans with malaria (Lalremruata et al. 2017). In addition, two of the aforementioned Plasmodium spp. are of great research focus as P. falciparum is the deadliest and most prevalent malaria parasite on the Africa continent whereas P. vivax is the most dominant malaria infection outside of sub-Saharan countries (Larson 2019;Liu et al. 2014).
There were an estimated 241 million cases of malaria in 2020, and the estimated number of malaria deaths stood at 627,000 (Singh et al. 2022). In the same report, nearly half of the world's population was at risk of malaria with the most cases and deaths reported in sub-Saharan Africa.
However, the WHO regions of South-East Asia, Eastern Mediterranean, Western Pacific, and the Americas also report significant numbers of cases and deaths.

Surveillance and Disease Management
The DALYs burden estimates for each of the aforementioned diseases underlines a pressing need for a clear guide and protocol by government authorities and international bodies in reaching elimination targets for these diseases. Here, we review surveillance programmes and disease management actions that are being implemented in response to these neglected diseases.
Dengue surveillance is crucial for detecting outbreaks and monitoring disease incidences.
Increasing the number of surveillance traps that capture eggs (ovitraps) and ovipositing females (gravid traps) with appropriate larvicide and mosquitocide (Selvarajoo et al. 2022). This is to prevent hatching of eggs or any subsequent production of mosquitoes inside the trap. This method is a double prong approach allowing authorities to survey the incidences and population of mosquitoes as well as for vector control. However, counting of both traps requires a group of individuals In summary, nationally representative survey programs suited for the geographical and environmental etiological factors for each respective country, such as demographic and health surveys (DHS), may offer an appropriate platform for active disease surveillance. Picking the proverb "prevention is better than cure", core strategic interventions together with disease management will better facilitate in eliminating the prevalence and transmission of diseases, and at the same time decrease the morbidity and mortality inflicted.

Application of Machine Learning Tools for NTDs.
The conventional approach to drug discovery costs a fortune and takes up a considerable amount of time. Computational approaches to drug discovery using Artificial Intelligence ( (Oguike et al. 2022). In this section, we explore the applications of ML tools in developing drugs for a selection of NTDs such as dengue, malaria, and leptospirosis. Next, we discuss how advances in adjacent fields of protein-and antibody-language models, cancer research and computer vision can be leveraged for NTDs research and disease management. Lastly, we discuss steps taken for regional collaboration, data and infrastructure sharing within and around SEA Dengue Virus (DENV) NS2b-NS3 protease complex is essential for the viral replication process making it a great target for antiviral agents. However, available choices of inhibitors during that time were unsatisfactory due to weak activity or low selective index towards the NS3 active site.
The NS3 protease domain is essential in processing the DENV polyprotein for the replication process. The presence cofactor NS2b is significant for substrate recognition and in maintaining the complex stability of NS2b-NS3 assembling complex for the DENV replication to take place.
Aguilera-Pesantes et al. used ML methods to identify potential residues and sites for drug-like molecule interaction, and bindable sites for drug development. They used four ML models, Random Forest (RF), Least absolute deviation tree (LAD Tree), voting feature interval (VTI), and multilayer perceptron algorithm (MLP), to classify their data. They found that MLP models work best in their study to properly classify residues interacting with NS3 that would cause major change in activity, moderate change in activity, and residues with similar activity as wild type residues Matthew's coefficient correlation (MCC), iAMAP-SCM was reported to achieve scores of 0.957 and 0.834 respectively and outperformed the other three classifiers employed, when the model was screened independent test datasets for validation. Mswahili et al. (2021) developed and compared the performance of five ML models to predict antimalarial bioactivities against P. falciparum. They trained ML models of artificial neural network (ANN), SVM, RF, extreme gradient boost (XGB), and LR over a data set of 4,794 antimalarial drug candidate compounds (2,070 active and 2724 inactive molecules). The K-best filter-based algorithm that selects potential features according to a particular function and Recursive Feature Elimination (RFE) wrapper-based algorithm that treats feature selection as a search problem were chosen as feature selection algorithms for performance examination and comparison. K-best was adopted as an accuracy metric whereas RFE was viewed as an efficiency metric. Based on the two metrics, they found that XGB, ANN, and RF models gave the best three accuracies in finding new antimalarial drug formation without losing too much precision.
Four ML classification algorithms, namely NB, NN, RF, and SVM were employed to investigate protein-protein interaction (PPI) networks for human and malarial parasites obtained from STRING database (version 11.0) was reported by Apichat Suratanee and colleagues to identify new human proteins associated with malaria as a means for additional drugs development (Suratanee, Buaboocha & Plaimas 2021). The ML models were trained with a data set of 12,038 human proteins with 313,359 interactions, and 1,787 P. vivax proteins with 11,477 interactions. They constructed a heterogeneous network connecting human-human protein interactions and P. vivax-P. vivax protein interactions with the human-P.vivax protein associations while investigating five topological features of (i) betweenness centrality, (ii) closeness centrality, (iii) degree, (iv) eccentricity, and (v) Kelinberg's hub centrality. They applied ten 10-fold cross-validations for each algorithm to yield performance metrics of an ROC curve with an AUC in which RF algorithm was the best classifier (AUC of 0.85) followed by NN (AUC of 0.79), SVM (AUC of 0.77), and NB (AUC of 0.74). With the best performance and results of the RF classifier, the authors obtained 411 human proteins through a top-ranking score calculation for each human protein. Subsequent functional annotation of the proteins revealed previously reports of promising candidates for multistage targets for malaria therapy. Leptospirosis Abdullah et al. (2021) studied the identification of a suitable Leptospira spp. multiepitope-based vaccine candidate which utilized two ML programs, namely Vaxign-ML and C-ImmSim. In his study, all protein antigens have protegenicity score greater than 90% signifying as effective antigens for vaccine developments, and simulations from C-ImmSim showed diverse immune reactions of the vaccine construct indicating promising subunits of multiepitope vaccine candidate for immunity against Leptospira spp. Infections.
Vaxign-ML is a supervised ML classification reverse vaccinology (RV) program trained to predict rank score (termed protegencity) of bacterial protective antigens (BPAgs) based on a training data set consisted of viral and bacterial antigens (Ong et al. 2020). Through a nested 5-fold cross-validation (N5CV) and leave-one-pathogen-out validation (LOPOV) evaluation approach, extreme gradient boosting (XGB) was the best out of five other ML algorithms. Set as the benchmark against five other existing programs and methods, Vaxign-EGB-ML displayed satisfactory results outperforming four programs. Final validation on external data sets of clinical trials or licensed vaccines reported ranked calculation of best top 10% BPAg candidates for 20 proteins. Next, C-ImmSim is an immune-simulation study server that employs machine learning methods and position-specific scoring matrices to identify epitope peptides and other immune interactions (Rapin et al. 2010). The program combines a mesoscopic scale simulator of the immune system with a set of agent-based class computational models to predict molecular-levels of major histocompatibility complex-peptide binding interactions and neural networks for prediction of epitopes.
Comparisons with ML application in cancer research, computer vision, protein language models Majority of studies employing ML models to discover novel drug candidates for NTDs have been published in the past two decades. Similarly, applications of ML in cancer research have been in practice since the early 2000s (Bertsimas & Wiberg 2020). Research domains where ML-based methods can be employed in cancer biology includes genomics, proteomics, metabolomics, epigenetics, transcriptomics, and system biology (Kourou et al. 2021;You et al. 2022). ML tools developed specifically for the molecular study of cancer have. An overview of the application of AI in identifying cancer targets and drug discovery has been reviewed (Alqahtani 2022; Shao et al. 2022;Taylor 2020;You et al. 2022). Despite the advances in chemotherapy and immunotherapy, early detection of cancer increases one's survival rate tremendously. Compared to detection at later stages, the cancer would have metastasized, spread to vital organs where surgery may not be feasible and hence has been sentenced to doom. Thanks to technological innovations, a new branch of AI known as computer vision (CV) will significantly lighten the burden of physicians and radiologists when it comes to interpreting an MRI or histology slide for the presence of a tumor.
In the past six years, an increasing array of ML tools has been developed in response to cancer diagnosis as well. Literature mining on ML-based studies on cancer diagnosis, patients' classification, and prognosis (excluding reviews and technical reports) between 2016 and 2020 in PubMed biomedical repository and Digital Bibliography and Library Project (DBLP) computer science bibliography yielded 921 and 165 studies respectively. Additionally, the total number of articles started with around 25 in 2016 and ending with approximately 625 studies published in 2020 (Kourou et al. 2021). Advances of AI and ML techniques, particularly Deep Learning (DL) which is a subset of ML technique, can be developed to mimic human-like capabilities for data processing to identify images, objects, process languages, improve drug discovery, upgrade precision medicines, improve diagnosis, and assists in decision making with or without human supervision (Davenport & Kalakota 2019). The multi-layered neural network architecture of DL enables models to grow at an alarming rate, provided with abundantly dimensional data (Lecun, Bengio & Hinton 2015). As reviewed by Kourou et al. (2021), most cancer ML-based studies on cancer detection and diagnosis centered around imaging data (input) from computed tomography (CT), magnetic resonance imaging (MRI), X-ray radiography, and positron-emission tomography (PET) to develop DL architectures of automated diagnostic models. Successfully early diagnosis of breast cancer using convolutional neural network (CNN) to analyze histopathological images were reported, with additional validation from other researchers of promising plus accurate diagnostic capabilities of deep CNN architectures by analyzing imaging slides. Efforts to develop an image-based lung cancer detection model, a region-based CNN model trained with 42,290 whole-CT lung scans has outperformed the average radiologists at malignancy risk-prediction, and achieved AUC score greater than 95% when validated with 1,139 clinical cases (Ardila et al. 2019).
Advances in the field of AI, typically ML and DL methods, were utilized to develop language models to predict proteins. Algorithms from these methods were employed to process the efficiency and quality of the natural language processing (NLP). To develop a protein language model (PLM), large text (protein sequences from large databases) are given as input to train the prediction of masked or missing amino acids (Bepler & Berger 2021;Ofer, Brandes & Linial 2021;Rives et al. 2021). At the end of protein information processing, involving multi-dimensional vectors and hidden layers, a representative group of proteins are acquired and are referred to as embeddings as mentioned by Elnaggar et al. (2021). Advances of embeddings from literature findings displayed stunning performances in predicting secondary structure and subcellular location comparable to other methods that employ evolutionary information from MSA inputs, substituting sequence similarity for homology-based annotation transfer, and predicting mutational effects on protein-protein interactions (PPI) (Alley et al. 2019;Heinzinger et al. 2019;Littmann et al. 2021;Stärk et al. 2021;Zhou et al. 2020). Variant Effect Score Prediction without Alignments (VESPA) is able to predict sequence residue conservation and single amino acid variants (SAV) almost as comparatively accurately to other existing methods (ESM-1v, DeepSequence, and GEMME) without employing multiple sequence alignment (MSA) approach (Marquet et al. 2022).
PLM has extended its range to analyzing protein via MSA data approach as well. Homologous proteins indicate descendant relationship with an ancestral protein and thus share similar structure and function. Analyzing MSAs data of said homologous proteins would provide valuable information about functional, structural, sequence conservation and evolutionary information that the organism or gene underwent. Successful breakthrough in atomic-resolution structure prediction problem by PLM-based structure prediction models, such as AlphaFold2 (AF2) and RoseTTAFold, was achieved on the use of MSAs and templates of similar protein structures to achieve the best optimal structural prediction performance (Baek et al. 2021;Jumper et al. 2021). Lupo, Sgarbossa and Bitbol (2022) investigated the performances of MSA-based LMs in isolating coevolutionary signals encoding functional and structural constraints from phylogenetic correlations through a set of pre-trained synthetic MSAs generated from Potts models. Three attention based neural network architecture MSA Transformers were studied, namely AlphaFold2 (AF2), RoseTTAFold and RGN2. All programs displayed efficient performance in differentiating correlations from contacts and phylogeny, which are always both present in natural data causing phylogeny noise and proven to be a fundamentally hard problem. The results demonstrated that contact inference by MSA were less deteriorated by phylogenetic correlations and showed greater accuracy in structural contacts compared to Potts models despite the MSA Transformer being pre-trained with a dataset of minimized diversity.
The substantial performance of AF2 in the recent CASP14 prediction challenge depicted the combination of MSAs data and ML-based methods is able predict unsolved protein model structure with remarkable accuracy complementing results from experimental works (X-ray crystallography, cryoEM, and NMR) (Jumper et al. 2021). It is important to note that this does not conclude protein structure prediction models implementing ML-based techniques can entirely substitute the experimental methods, but instead assists in understanding the folding nature of protein biophysics. For these programs to be at its peak (achieve high accuracy), an enormous yet varying amount of MSAs database is required to be fed in order for each of their related ML-based algorithm systems to learn and predict protein structure based on the co-evolutionary relationships encoded within the MSAs (David et al. 2022). Hence, when tasked with predicting proteins that lack available data on sequence homology, these programs may be less performative (Pearson 2013). Thus, Chowdhury et al. (2021) has developed an end-to-end differentiable recurrent geometric network (RGN2) that is capable of predicting structure from single protein sequences without using MSAs data. The program employs two novel elements of (i) AminoBERT PLM that uses Transformer units to learn latent structural information from millions of unaligned proteins, and (ii) geometric modules representing the Cα backbone geometry. Chowdhurry and colleagues described the performance of RGN2 surpassing AF2 and RoseTTAFold when tasked with predicting proteins with no known homologs and can compete on de novo designed proteins. Absence of MSAs-element in RGN2 allowed protein structure prediction speed of up to six-fold faster than programs that require it.
A subsequent study on the advances of PLM in ML-based protein structure prediction since AF2 and RoseTTAFold were conducted by (Lin et al. 2022). In their study, they developed ESMFold that can compete with AF2 and RoseTTAFold in atomic level protein structure prediction accuracy with information on individual sequence of a protein. Additionally, ESMFold given a single sequence as input outperforms both AF2 and RoseTTAFold, and can compete closely with RoseTTAFold even when given full MSAs information. Prediction speed displayed by ESMFold was reported to be faster compared to the existing programs, which can help in addressing the ever-growing protein sequence information compared to lagging growth of the structural database. They concluded that the LM employed for ESMFold is able to learn information similar to AF2 with MSAs data and that LM has a significant contribution to atomic-resolution structure prediction performance on rare proteins.
Diving into the complex quaternary structures of proteins, protein complexes such as antibodies are produced as an immune response to invading pathogens. Antibodies are made up of two pairs of heavy (one variable domain and three constant domains) and light chains (one variable domain and constant domain), of which three loops can be found respectively on each of the respective variable domains (L1, L2, L3, and H1, H2, H3) (Graves et al. 2020). These variable domains form complementarity-determining regions (CDRs) that are crucial in determining specific antibodies binding activity (Akbar et al. 2021(Akbar et al. , 2022Polonelli et al. 2008). Efforts in antibody studies have led to the discovery of a canonical set of structural conformation displayed for five of the six CDRs (L1, L2, L3, H1, and H2). In contrast, this was not observed for CDR H3 which was highly variable in length and amino acid sequences (Teplyakov et al. 2016). As antibodies represent a unique group of proteins, development of an antibody language model (ALM) for prediction would definitely outperform a trained protein language model that covers a holistic range of protein.
ALM AbLang outperformed both IMGT germlines and protein language model ESM-1b in terms of a faster completion time and capability in restoring missing residues of antibody sequences (Olsen, Moal & Deane 2022). In AbLang, two separate antibody models were trained, one for heavy and another for light chains, with an imbalanced data set of 14 million heavy sequences and 187 thousand light sequences retrieved from the Observed Antibody Space (OAS) database. Olsen and colleagues designed the program to be able to generate three different useful representations of the antibody sequences, namely (i) res-codings useful for residue specific predictions, (ii) seq-codings useful for sequence specific predictions, and (iii) generating likelihood of amino acids at their respective specific positions in a given antibody sequence, which is handy for antibody engineering. Consequently, when comparing the performance of information extraction on B-cell sequence between the ALM against ESM-1b, AbLang was able to scrutinize better by separating the sequences based on V-genes into smaller clusters. Further, a clearer distinct segregation between naïve and memory B-cells was successfully achieved by AbLang. Within the OAS database, more than 40% of the sequences identified have 15 missing residues at the N-terminal. Utilizing the seq-codings representation information generated as a means to restore missing residues, AbLang displayed very similar performance with the IMGT germlines without needing any additional germlines information.
Another antibody LM, Antibody-specific Bidirectional Encoder Representation from Transformers (AntiBERTa), was reported to outperform two existing PLMs (ProtBert and Sapiens) and exhibited better B cell receptors representation when compared to ProtBERT that was assigned with a smaller dataset (Choi 2022;Leem et al. 2022). The authors employed a 12-layer transformer model to train on 57 million human BCR sequences (biased data sets of 42 million heavy and 15 million light chains), based on the RoBERTa architecture that allows a more direct comparison to be established (Liu et al. 2019). Performance validation reported capabilities of AntiBERTa in better distinguishing naïve and memory B-cells than the two PLMs. Integrated with a self-attention mechanism that gives informational embedding for each amino acid in the BCR sequence, AntiBERTa is more invested in what is functionally important for specific binding compared to ProtBERT finding a conserved disulfide bridge for all antibodies. When compared in the efficiency in paratope prediction against other existing paratope prediction approaches (Parapred, ProABC-2, ProtBERT, and Sapiens), AntiBERTa surpassed all of them. The authors described self-attention changes as the element of AntiBERTa allowed it to correctly predict paratope positions of both CDR and non-CDR positions. Interested readers are referred to the design methods of a linguistic-based formalization of the antibody language (Vu et al. 2022) In summary, advances in cancer, computer vision, and protein language research have been primarily driven by the accumulation of large training datasets and the development of highly sophisticated deep learning architectures. Typically large attention-based models are trained on datasets in the order of 10 6 -10 7 data points. In contrast to NTD research where datasets remain restricted at the scale of 10 2 -10 3 (up to five orders of magnitude lower), shallow machine learning methods, namely RF, SVM, LR, among others, are more prevalent. The lack of large datasets restrict the widespread application of large deep learning models for the discovery of new NTD therapeutics and thus hamper the potential for efficient management and eradication of these diseases.
On regional collaboration, data, and infrastructure sharing The Southeast Asian countries are strategically located and exceptionally diverse in culture. Apart from the geographic proximity of Member States of the ASEAN regional association, the countries share a few other similarities of having densely populated communities, mineral-rich economies that open throughout the globe, and share similar tropical and subtropical climates. Hence, when there's a disease outbreak reported among any of the SEA countries, chances of the imported disease to neighboring countries are very high. Hence, there is a need to circumvent the matter through regional collaboration, and data plus infrastructure sharing among the SEA countries.
As previously described, the SEA region is endemic to vector-borne diseases such as arboviral diseases (dengue, LF, and malaria), leptospirosis, cysticercosis, and rabies. These diseases require up-to-date, robust, and comprehensive information on presence, species-strain diversity, ecology, environmental and geographical information regarding the organisms that carry and transmit the infectious agents. As such, the Malaria Atlas Project (MAP) (https://malariaatlas.org/) is an open-access database and WHO collaborating platform for geospatial disease modeling to project spatial limits, prevalence and endemicity of malaria in all locations around the world. The European Centre for Disease Prevention and Control (ECDC) (https://www.ecdc.europa.eu/) is an open-access database on dengue surveillance, threats, and outbreaks governed by the European Union. The database consists of almost all NTDs and other diseases that are of public health concern. The Global Atlas of Helminth Infections (GAHI) (https://www.thiswormyworld.org/) is an open-access database containing geographical distribution of neglected tropical diseases transmitted by worms: soil-transmitted helminthiasis, schistosomiasis, and lymphatic filariasis. All GAHI resources are available on an open access basis but up till the year 2015 only.
Another publicly available but somewhat geographically irrelevant to SEA region is the WHO-driven Expanded Special Project for the Elimination of NTDs (ESPEN) (https://espen.afro.who.int/). However, the ESPEN portal only contains survey data sets of NTDs in Africa. In response to the neglect of melioidosis, Melioidosis.info (https://www.melioidosis.info/infobox.aspx?pageID=101) serves as an online-platform for reporting melioidosis cases and for disseminating information of melioidosis for public, researchers and health policy makers. Two other notable and frequently visited database for health metrics and disease related data retrievals that have have been actively mentioned throughout this review are none other than WHO's Global Health Observatory (GHO) (https://www.who.int/data/gho) and Global Health Data Exchange (GHDx) (https://ghdx.healthdata.org/) , where both the Global Health Estimate (GHE) 2019 data and Global Burden of Disease Study (GBD) 2019 could be retrieved respectively by interested readers.
Global Alliance for Rabies Control (GARC) (https://rabiesalliance.org/) is the leading international rabies non-profit organization set to work with international stakeholders, governments and local partners to raise awareness about rabies, encourage collaboration, and build the evidence needed to increase political commitment and funding to end dog rabies in every country. Their main team of nine members work across three established work networks of ARACON (Asian Rabies Control Network), MERACON (Middle East, Eastern Europe, Central Asia and North Africa Rabies Control Network), and PARACON (Pan-African Rabies Control Network) as an effort to end rabies. International body responsible for infrastructure sharing in combating LF is the Global Alliance to Eliminate Lymphatic Filariasis (GAELF) (https://www.gaelf.org/). GAELF is a steering body aimed at bringing relevant partners to support the GPELF established by WHO via political, financial and technical resources mobilization. In response to the outbreak of leptospirosis, Global Leptospirosis Environment Action Network (GLEAN) (https://sites.google.com/site/gleanlepto/home?authuser=0) was created to reduce the global impact of leptospirosis through better understanding of the relationship between its occurrence and various associated factors including environmental, biological, ecological, economic and demographic factors, providing more timely warnings of outbreaks and identifying prevention and control strategies.
Data and infrastructure sharing is undeniably crucial for NTDs since the scale of publicly available datasets for NTDs research is dwarfed by other fields such as cancer, computer vision, and protein/antibody research. Efforts displayed by each governing body in maintaining and keeping up-to-date open-access databases or infrastructure and technical outreach organizations are to ensure that every country would have the latest disease intelligence and technical skills in order for effective surveillance, preventive, and disease management control to be executed according to each country's governing leadership. Importance of having centralized data sharing at a regional-scale has been highlighted in a study by Alemu et al. (2022). With access to publicly available standardized survey and treatment coverage data, which was at first unavailable probably due to absence of reports by the country's Ministry of Health to WHO, they were now able to access ample amounts of collected evidence pointing to the advantages of school-based deworming programs and LF MDA campaigns.

Concluding remarks
NTDs impact nearly 2 billion people especially in countries with developing economies such as countries in the SEA and Western Pacific region causing reduction in productivity and substantial accumulation of Disability-Adjusted Life Years, DALY. Machine learning has been widely applied in fields such as cancer research, computer vision, and protein (language) modeling, however, the application of machine learning in NTDs research is hampered by the limited amount of data, the absence of centralized/standardized collaborative framework and the general lack of attention from public and private stakeholders alike. To unleash the full potential of machine learning to elevate the state-of-the-art NTDs surveillance, management, and treatment, increased investment in terms of research funding, public-private collaborative initiative, data accumulation and sharing are desperately needed.

Author Contributions
CYK gathered the data, performed the analyses, and wrote the manuscript. RA wrote the manuscript and supervised the work. NMA wrote the manuscript and jointly supervised the work. awarded to Norfarhan Mohd-Assaad. The APC was partially funded by Universiti Kebangsaan Malaysia (GGPM-2019-042).