Uncovering host-microbiome interactions in global systems with collaborative programming: a novel approach integrating social and data sciences [version 1; peer review: 1 approved with reservations]

Microbiome data are undergoing exponential growth powered by rapid technological advancement. As the scope and depth of microbiome research increases, cross-disciplinary research is urgently needed for interpreting and harnessing the unprecedented data output. However, conventional research settings pose challenges to much-needed interdisciplinary research efforts due to barriers in scientific terminologies, methodology and research-culture

intellectual framework, we set the focus on the topics of microbiome

Introduction
OneHealth Codeathon: Genesis of a working model for applied, interdisciplinary problem-solving The National Institutes of Health National Center for Biotechnology Information (NIH NCBI) model for codeathons-intensely collaborative, time-limited data workshops which encourage teams of participants to produce software prototypes to solve problems related to a common biomedical topic-are an effective avenue for the generation of software prototypes in the biomedical informatics space. Our previous "Iron Hack" event 1 , centered on rare iron-related diseases, was a transdisciplinary twist on this NCBI model designed to complement and unite local University of South Florida (USF) research programs, inspiring participation from clinicians, genetic counsellors, and researchers from a diversity of biomedical fields at all different career-stages.
We set out to further expand on the more traditional foundation of codeathons for this year's event, working with the local research-community to select challenges that would encourage and more heavily utilize skillsets less-traditionally drawn to codeathons (e.g. social science researchers), while also supporting emerging USF research initiatives and addressing wider challenges in biomedical data science. This year's event (dubbed the USF OneHealth Codeathon) therefore focused on the fast-evolving field of host-microbiome interactions, with concepts for our team-projects designed around data-centric problems encountered by our interdisciplinary participants in their research and practice. The event took place on USF's Tampa campus over February [26][27][28]2020.
As a result of these intense collaborative efforts, teams developed resources that are relevant not only to microbiome studies, but also general bioinformatics problems. The objective of this report is to demonstrate the utility of a codeathon model to rapidly develop tools for human and environmental health research, with the added community-building benefits of (1) providing opportunities for meaningful, long-term, cross-departmental interactions that stimulate collaborations and creative project design, and (2) offering in-depth exposure to applied data-science for members of traditionally less-computational fields.

Critical gaps OneHealth Codeathon projects sought to address
We addressed challenges related to the host microbiome, including the great need for novel genomics tools to handle large, recently generated heterogenous microbiome datasets. We established six OneHealth Codeathon teams to develop six computational-tool prototypes broadly focused on (1) power calculation for microbiome study design, (2) geographical information systems-analysis of microbiome data and associated risk factors, (3) mining archaeological microbiome data, and (4) searching for ecological drivers of earth microbiomes ( Figure 1). These team-efforts have led to the convergence of social science, ecology and medical communities with genomics data-science researchers to produce promising computational tools, strengthened through an iterative process of soliciting ideas and feedback from domain experts.
The remainder of this report is organized into subsections by project, beginning with a detailed description for the six projects, the motivations behind them, and the gaps they seek to fill. We next describe the methodologies and implementations of the projects into usable software applications, how to operate the software applications, and results produced using the software applications. Finally, we discuss the pros and cons of this new highly interdisciplinary and community-driven twist on more traditional hackathons.

Team 1 -MicroPower Plus
Project title: Microbiome power-calculation tool for biologists: towards rigorous, reproducible microbiome study-design Rationale: Measured differences between sample groups can result from any number of experimental artifacts not reflective of actual biology, including differing definitions of what a clinical population signifies within different studies, how samples are prepared, and analytical decisions (e.g., bioinformatic and statistical tool-selection, parameter-settings [2][3][4] ). Statistical power calculations are a key part of quality study-design, informing the sample-size required to have sufficient statistical power to detect differences between experimental groups. The size of this difference between groups-the effect size-should also be taken into account during experimental planning; smaller effects are more sensitive to being obscured by experimental noise. Sufficiently powered studies are critical for robust biological conclusions, and funding agencies increasingly require power and sample-size analyses to consider applications for support.
R-based software packages enabling power analyses modeling relationships between sample-size and detectable effect-size using PERMANOVA-based methods have been developed to estimate required samples for microbiome experimental design 5 , given input data from pilot studies. These handy tools are not generally accessible to biologists with limited computational experience and/or a more cursory grasp of statistics. We sought to build on these methods to create a more intuitive calculator/ guide for biologists, who often need only a quick point-and-click reference for experimental planning.
Goal: To provide an intuitive power-and effect-size calculator-tool for biologists with limited computational experience.

Data-sources and processing
Predicted effect sizes detectable at a range of sample sizes and power-levels were precomputed on OTU tables from a variety of human body-site datasets from the Human Microbiome Project (HMP) using the R package micropower (v0.4) (Jaccard distance method) 5,6 . We used these precomputed data as a reference for quick and interactive power calculations for commonly used sample sizes by body-site.
We added additional functionality for calculating the effect size of the experimental intervention given a control group vs. an experimental group using linear modeling. Our tool computes the Bray-Curtis distance between all samples, then uses the Adonis function from the vegan package (v2.5.6) to calculate the correlation parameter Pearson's R 7 .
A conceptual overview of MicroPower Plus functionality is provided in Figure 2.

Operation and Implementation
The MicroPower Plus 8 workflow is implemented in a user-friendly R-Shiny web application. RStudio and the R packages shiny, plotly and tidyverse are required to operate MicroPower Plus 9-12 . Further documentation and a tutorial are available at the GitHub repository as listed in the code-availability section.
After installation of required packages, all necessary tutorial files can be downloaded from GitHub onto the user's local computer, and MicroPowerPlus can be launched by opening the "app.R" file in RStudio.

Use cases
MicroPower Plus 8 is most useful as a statistical referenceguide for biologists to make quick calculations to aid in experimental design of microbiome studies. We built a user-interface around the human gut microbiome reference dataset that allows the user to visualize the relationship between sample size, effect size and statistical power as a proof of concept using R Shiny 10 . Resulting effect size is reported as a bar graph, with reference to effect sizes reported in the literature for comparisons. We created an additional tool that allows the user to input their own data, calculate the effect size from their experiment and report it as a bar graph. Future iterations of this tool will include interactive visualizations for the pre-computed reference data from other body-sites.
The provided tutorial walks the user through an example power calculation ( Figure 3) and effect size calculation (Figure 4) using the pre-computed human gut microbiome datasets.

Project title: Environmental Chemicals: Impact on Human Microbiomes
Rationale: Environmental exposures to chemicals have been a public health concern due to the ubiquitous nature of its effects on human health and the environment. Industries and manufacturing sectors contribute to chemical exposures by releasing these chemicals into the environment. Chemicals commonly found in commercial products, such as heavy metals and chlorinated hydrocarbon solvents, can persist in the environment for extended periods, increasing the latency of exposure 13 .

Figure 1. Scope of human holobiont interactions with microbiomes in various contexts explored through USF's OneHealth
Codeathon. Two teams (Teams MicroPower Plus and Zero) focused on developing practical computational tools for microbiome studydesign and data-analysis. Four teams (Teams Geo, Animal, Track and Yolo) focused on exploring different aspects of host-microbiome interactions from environmental consequences to clinical presentations. Input-data are OTU or ASV tables selected from curated, published microbiome studies of various human body-sites from which effect size has been pre-calculated for several common sample-sizes using complementary methodologies. The user can then use the interactive, graphical output to explore the relationships between effect-size, sample size and statistical power to use as a quick reference for their own experimental planning. The user selects the sample type, the sample size for each group and a distance measure. When the user moves the power slider, the estimated effect-size graph (red) changes to the minimum effect size required to attain the given power level. The gray bars reference effect sizes calculated from the indicated sources. By comparing the estimated effect size to the reference effect sizes, the user can get a sense of how large a difference would have to be between their samples to detect significance using different experimental designs.
A lack of information led to relatively few rules for handling and disposing of chemicals in the first part of the 20 th century, which resulted in the random release of these hazardous chemicals and toxins into the environment. Knowledge of toxic waste dumps and their associated human health and environmental health consequences received national attention in the late 1970's 14 . In response to public outcry, Congress created "Superfund" in the 1980's to fund toxic waste clean-up at industrial sites 14,15 . Superfund sites require long-term remediation efforts, and sites are evaluated for eligibility on a point-based system requiring a preliminary assessment and site-inspection (known as the Hazard Ranking System, or HRS) 16 . Reporting from the public or an agency is also considered in assessing a site for the qualification. Superfund sites are prioritized by HRS score onto the National Priority List (NPL) 16 . Currently there are 1335 NPL sites around the U.S., each having specific chemical contaminations.
Human exposure to toxic chemicals has been shown to elicit different effects depending on the host's immune response, with long-term exposures associating with a range of serious maladies varying from cancers acting on various bodily tissues to neurological effects 17 . The gut microbiome is hypothesized to have a unique role in enhancing and maintaining host health through the microbiome-gut-brain axis and can impact endocrine, immunological and nutrient signals 18 . Microbiome dysbiosis can occur with exposure to toxic environmental contaminants via ingestion or inhalation and can lead to several chronic conditions. Due to its diverse functions in the body, the gut microbiome acts as an indicator for health, and there is a growing body of literature exploring the interactions of environmental contaminants with the host microbiome 13,17,18 .
Environmental contaminants present in Superfund sites around the U.S. can significantly affect the health of the population in the surrounding areas. To illustrate this effect, we created a tool for visualizing the impact of environmental toxicants on the gut microbiome.
Goals: 1) To illustrate the trends of environmental chemical exposures from U.S. Superfund sites over time. 2) to create a tool for visualizing the impact of exposure to environmental chemicals on the gut microbiome around the U.S.

Implementation and Operation
Data-sources and processing: We processed and combined datasets from the American Gut Project (AGP), census data, and EPA Superfund data to search for informative patterns using the R package phyloseq 1.30.0 19 . We identified most abundant taxa by Superfund site/geographic location. We then performed basic association analyses to assess relationships between abundant/ rare taxa, various Superfund sites and contaminants. Archived code are available, see Software availability 20 .

1) American Gut Project data:
The American Gut Project (AGP) is a large-scale, crowdsourced project (n =29778) of microbial sequence data with the aim of characterizing the human gut microbiome including associated mitigating factors ranging from diet, lifestyle, overall health, and the broader environment. The metadata file obtained from AGP sample information (file 04-meta). was reduced to responses from participants within the United States only. Important variables that have been previously found to be associated with differential phenotypes mediated by air pollution in microbial communities in published studies were also selected and included in subsequent testing for associations with Superfund-site proximity.
2) Superfund data: Superfund sites and associated contamination data for current NPL sites were retrieved from EPA data using the R superfundr 0.0.0.9000 package. The data were prepared and transformed using Statistical Analysis Software (SAS v 9.4, Cary, NC). We focused on 10 priority chemicals listed by the EPA.
3) Census data: Select data from the American Community Survey (ACS) were downloaded from the U.S. Census Bureau: American FactFinder website via the download center (U.S. Census Bureau, 2020). This population-based data source provides descriptive socio-demographic data (e.g., sex, race, ethnicity, economic indicators, etc.) by zip code across the nation. Once all datasets were downloaded for each variable, all variables were then merged by a linking variable (i.e., zip code) that each dataset had in common. After data-cleaning, percentages were calculated for each variable. All data-cleaning was conducted using Statistical Analysis Software (SAS v 9.4).
Loading and filtering OTU tables was memory-intensive, as the initial dataset is very large. Initial attempts for loading the OTU table with a 16 GB laptop were insufficient. To solve the problem, we performed this filtering on a high-performance computation cluster with 180 GB of memory.
Merging data across disparate datasets: Several distinct datasets across the AGP, Superfund, and ACS provided unique information connected only by geographic location and could be merged by an appropriate linking variable (e.g., zip code). Data from all three sources were combined for a total of ~1000 samples. We further reduced the dataset to only samples that were directly related to the gut for downstream prediction using machine learning approaches.
ArcMap version 10.7 (2020) was used to create choropleth maps from the combined ACS and Superfund datasets to evaluate the association of chemicals found at EPA Superfund sites with select population-based socio-demographic data by zip code overtime. An open source software can be used for the same work is QGIS Geographic Information System, at Open Source Geospatial Foundation Project (http://qgis.org).
Machine learning analysis on data collected from individuals near Superfund sites: We selected individuals that were self-identified to be within 5 km of Superfund sites from the final combined dataset. We next performed a classification analysis using random forests implemented via the R package ranger 21 .
For each contaminant, we classified each individual as exposed or unexposed based on their proximity to a Superfund site with that contaminant. We then performed 10-fold cross-fold validation and reported the accuracy of the most and least informative contaminants in regard to the microbiome.

Results
Geographic distribution of select Superfund-site contaminants and abundance of Bacteriodetes OTUs are shown in Figure 5. We next explored a potential relationship between abundance of this bacterial phylum and individual contaminants, and further possible predictive efficacy of contaminants for certain OTUs, using proof-of-concept modeling. We restricted samples to those within 5 km of a Superfund site for these analyses. We constructed a random forest using each contaminant as a binary predictorvariable. We found a strong relationship between several contaminants and microbial composition. The two most predictive contaminants were polycyclic aromatic hydrocarbons and polychlorinated biphenyls (PAH, 94% and PCB, 81%, respectively). The contaminant with the lowest accuracy was lead (60%).
It is worth noting that PAH are known to bio-amplify as they go through food-webs. Other health outcomes linked to PAH exposure are various forms of cancer, as well as developmental impacts. PCB have been banned in the manufacturing process since 1979, yet they do not readily break down and remain a hazard over long periods of time. Because of these properties, they are commonly listed as Superfund contaminants of high concern.
In conclusion, we found that for several contaminants the microbial composition varied significantly among individuals living near Superfund sites with high or low levels of PAH and PCB, respectively.

Team 3 -ZERO
Project title: Creating a web app to study human gut microbiome variation across geographic regions of the world Project Rationales, Descriptions and Goals

Rationale:
The human gut microbiome is one of the most densely populated sites by bacteria in the human body. It performs numerous functions, and its dysbiosis has been associated with several diseases. A major goal of microbiome researchers has been to understand the diversity of the gut microbiome across human populations. Although several studies have been undertaken for this purpose, these studies are limited in scope and comparative ability. Therefore, the rationale of the present work was to create a web tool which will be equipped with reference databases, populations and necessary scripts for the users to upload, analyze and visualize their own microbiome data at the server, with additional options to compare with the reference populations. Results can subsequently be downloaded by the user. Finally, all the reference population data is to be made available for download, along with necessary scripts to enable the user to run the program on their local computers, without the need to upload their raw data. Such a tool will be extremely useful to any interdisciplinary researchers who may have microbiome-related research questions but are not experienced in writing code, handling large microbiome datasets or who do not have access to advanced computational facilities. The codes, instructions and guidelines are available through a GitHub repository. The flowchart summarizing the approach is provided in Figure 6.  . Proposed Team Zero web-app workflow. Users will be able to upload fastq files for analyses and choose reference-datasets for comparison. The in-built pipeline will then generate the Amplicon Sequence Variants (ASVs) from which the most informative for differentiating populations will be chosen using a Gaussian-Mixture EM algorithm followed by unsupervised K-means clustering. Heatmaps and PCA-plots describing the data will be generated and made available for download. Goals: 1) To download raw microbiome data (V4 region of 16S rRNA gene) from various world populations and generate amplicon sequence variant (ASV) table for comparison purposes. 2) Construct simple, but informative plots such as heatmaps and principle component analysis (PCA) plots to visualize relationships/patterns in the data through the proposed web app.
3) Provide all raw sequencing data, bash scripts and R scripts to run all steps of the analyses, as well as appropriate documentation and guidelines for an easy and error-free run of the pipeline on the user's local computer.

Data sources and processing
We first mined microbiome data from various world populations by geographical region. We narrowed our focus to studies on the human gut microbiome involving the V4 region of the 16S rRNA gene. A total of 1428 samples spanning populations from China, the Indian subcontinent (Himalayan region), Brazil and Europe meeting these criteria were incorporated. Raw data were downloaded from the European Nucleotide Archive (Accessions: China, PRJNA396815; Indian subcontinent, PRJEB29137; Europe, PRJNA497734; Brazil, PRJEB19103) ( Table 1).
Despite this initial filtration step, analysis-time was still estimated to be too high to move forward under Codeathon timerestrictions. Thus, in a second step to reduce data volume, 5000 sequences were subsampled using Seqtk 1.3-r115-dirty 22 from each of the forward and reverse fastq files for each of the samples. All the downstream analyses were based on the subsampled reads. The fastq files were analyzed using the standard DADA2 1.14.1 pipeline 23 to generate the distribution of ASVs observed in this dataset. The corresponding classification of each ASV was obtained using the Silva database (v132) 24 . The bacterial count table was further utilized for downstream analysis.
The resulting ASV table contained 1,428 samples with 2,655 bacterial taxa. Considering the very sparse data in the ASV table (only 1.231% of ASV elements exhibit reads numbers > 0), we used a Gaussian-mixture model to remove the bacteria with lower reads-coverage. A total of 1,783 taxa were removed and the remaining ASV table was normalized for each sample by the proportion of reads in each taxon using orders-of-magnitude multipliers (1-e 8 ). The distribution of standard deviation in reads-number was calculated, and taxa at the tail-ends of the distribution were eliminated, leaving 237 taxa. Similarly, individual samples at the extreme low-end of the reads-number distribution (365 samples) were also removed using the Gaussian-mixture model. Unfortunately, all Chinese-population samples were eliminated during this step, and all downstream analyses were performed only on the populations from Europe, Brazil and the Indian subcontinent.

Modeling relationships between population and bacterial taxa
We used the resulting filtered dataset to perform K-means clustering to determine the optimal number of categories, finding k=18 to be most informative for the data. The Akaike information criterion (AIC) and Bayesian information criterion (BIC) were utilized to measure model robustness.

Operation and implementation
We incorporated a set of unsupervised machine learning back-end computational methods to investigate the datasets for encoded geographical information. We used python v3.6.9 along with the django web framework and conda 4.7.12 to build our workflow. The machine learning components of the workflow to identify ASVs distinguishing populations by geography are performed using TensorFlow2 25 . Data preprocessing and data visualization are mediated through R scripts (see Implementation and Software Availability).
Herein we implemented a web-based application 26 for the deposition and rapid analysis of microbiome data. Importantly, users are able to (1) download a prepared database along with the server source code, or (2) construct their own database for analysis. The web-based application source code, the preprocessing and data visualization scripts, and instructions for their usage are available online as listed in the Software availability section.

Results
The unsupervised classification algorithm indicates strong bacterial association with geographic populations Our k-means parameter-exploration indicated 18 classes within the sample ASV data. The result indicates at least one or two bacterial groups are enriched for each class ( Figure 7A). Classification further indicated differences in community composition by geographical location ( Figure 7B). We performed a PCA to further characterize the relationship between sample categories detected via clustering. We found that the samples from classes 1, 6, 9 and 14 form clearly distinct clusters from each other ( Figure 7C), further indicative of underlying geographic patterns. We identified important bacterial taxa contributing to sample classification ( Figure 8) and plotted relative contribution of each ASV (classified up to genus-level) driving ordination ( Figure 9). Differential relative abundance of these ASVs across all geographic populations indicated distinct geographical patterns, with several ASVs strongly associating with Indian, Brazilian, or European (to a lesser extent) populations ( Figure 9). The classification of the ASVs corresponding to Figure 9 are provided in Table 2.

Conclusions
Our work was aimed at creating a web app to study the geographical patterns of the human microbiome and selecting features which could be useful to distinguish the populations. Using publicly available resources, we were able to include different geographical populations and select features to identify differences across groups. The resources for our study are deposited in our GitHub repository (see Software availability). Limitations of this study include that factors such as age, gender and other participant phenotypes which could be contributing to geographical patterns were not included in these analyses. However, we were able to create a web-app prototype for identifying features from the composition of the human gut microbiome related to geographical population. In the future, this work can be extended to include other variable regions of the 16S rRNA gene, as well as including other body sites such as the oral cavity, skin, etc. Similarly, batch-effect correction-tools need to be incorporated for unbiased comparison across different studies.

Team 4 -YOLO
Project title: A web-based machine learning pipeline for disease prediction using microbial data  Table 2. Classification of ASVs displaying highest geographical patterns as shown in Figure 9. Classification only up to genus level were obtained since the studied region was limited to V4 of the 16S rRNA gene. When two ASVs were affiliated with the same genus, they were distinguished by adding a serial number as suffix. For example, Bacteroides_1 and Bacteroides_2 belong to the same genus.

Project Rationales, Descriptions and Goals
Rationale: High-throughput sequencing technologies have resulted in the generation of an increasing amount of microbial data, such as microbiome data. Using these data, machine learning methods are powerful in identifying functionally active microbes and predicting disease status. Even though machine learning algorithms are popular approaches to investigate microbiome, to adopt these methods effectively usually requires specialized training. In addition, model selection and hyperparameter tuning can be time-consuming even for experienced practitioners. Thus, our project focused on the efficiency of AI in solving big-data problems and facilitating humans to perform other cognition-demanding tasks by developing a GUI-based pipeline for training machine learning algorithms on taxonomic microbiome data. Our pipeline expands access of computational tools to researchers in non-computational disciplines to improve cross-disciplinary study. As a proof of concept, we successfully utilized our pipeline to train a predictive algorithm for obesity rates based upon orthogonal taxonomic units which may be applied toward generating health-related features from clinical, historical, or forensic samples. Our code utilizes three methods: K-nearest neighbors (KNN), support vector machine (SVM), and adaptive boosting (AdaBoost) to achieve respective accuracies near eighty-four, ninety-one, and eighty-six percent. Both KNN and SVM utilized a 10-fold cross-validations to prevent overtraining. Under this method, training was achieved near instantaneously on a 16 GB MacBook to demonstrate feasibility. Outputs are processed into interactive graphical visualizations to improve ease-of-use. Although previous projects have utilized these computational techniques toward processing microbiome data, our pipeline removes barriers to use for researchers without coding backgrounds while streamlining efficiency for all users.
Studies have revealed significant diversity in the gut microbiome composition related to various phenotypes. Obesity has been associated with changes in the microbiota at phylumlevel, reduction in bacterial diversity, and different representations of bacterial genes. For example, studies of lean and obese mice suggest a strong relationship between gut microbiome and obesity. Phylogenetic marker genes uncovered by 16S rRNA gene amplicon sequencing have revolutionized the field of microbial ecology. This PCR-based method has the advantage of identifying difficult to culture bacterial organisms. Various bioinformatic pipelines can then group these sequences into clusters called OTUs. OTUs are based on their sequence similarity to each other rather than a reference taxonomic dataset which may be biased towards existing taxonomic classification 27 .

Goals:
We were interested in finding out if there is an association between gut microbiome OTUs and obesity. Additionally, we wanted to be able to use this data to distinguish between lean, overweight, and obese phenotypes in humans. We were able to successfully develop a machine-learning based pipeline that shows the association between gut microbiome OTUs and obesity with high accuracy. Furthermore, this pipeline can predict whether sample OTU data comes from a lean, overweight, or obese human phenotypes. Our work is significant because a heavy coding background is not required for use of high-accuracy machine learning tools.

Data preprocessing
To develop our pipeline 28 , sample microbiome data was retrieved from 29. First, we cleaned the data by removing duplicate entries which leaves us with 151 samples. Second, to deal with the sparsity of OTU count data, we added a random small positive number to all 0 entries. Third, data was normalized using the centered log-ratio (CLR) transformation 30 . Then, the dimensionality reduction was performed. We chose to use the Max-min Markov Blanket method to recursively select a small subset of features that are important to the outcome of interest (Obesity or lean in this case). A total of 10 highly informative OTUs were selected during this process and various machine learning methods were explored based on a recent review article 31 .

Principal component analysis (PCA)
is an unsupervised dimensionality reduction technique that finds relationships in the dataset, then transforms and reduces them into principal components (i.e. uncorrelated features that embody the information contained within the dataset) that do not have redundant information.
Random forest describes a supervised machine learning strategy that splits samples into successively smaller groups based on specific features and associated threshold values. This method is in the planning phase for future versions.
SVM is a method of supervised machine learning that is useful for classification, regression, and detection of outliers. SVMs are effective in higher dimensions where the dimensions are greater than the numbers of samples. Linear Support Vector Machine (SVM) classifier was used to project samples into a higher dimensional space so that they are linearly separable. Linear SVM was performed using 10-fold cross-validation with 3 repeats.
KNN is a machine learning algorithm that can be used for classification and regression. In our pipeline, KNN classifier was used for the classification of disease-status, with classification determined by majority-vote of close-by data points (n = K).
AdaBoost is a machine learning meta-algorithm that can be used to improve performance of other machine learning algorithms. AdaBoost classifier was used to train multiple tree classifiers (where each tree has a subset of available features) to iteratively add more weight to those misclassified samples in the next training loop. GitHub readme and description are available in the software accessibility section.

Operation and implementation
We implemented various machine learning models, namely k Nearest Neighbor, AdaBoost, and Support Vector Machines, to predict disease from the microbiome pre-processed data. It includes three main steps. 1) Users can prepare the biome OTU table to perform downstream analysis, such as PCA and machine learning. 2) In the next step, the processed data can be used to perform PCA for exploratory analysis.
3) The data is fed into machine learning models to select the highly predictive features and for the final prediction of disease-status.
Feature selection and machine learning were implemented using MXM 1.4.5 32 and caret 6.0-85 R packages 33 , respectively, in R version 3.6. To make it easy for others to use this implementation, we designed a shiny application with an intuitive graphical user interface (GUI). Users can plot, visualize, and download their results generated through the app.

Results
We show that machine learning can be used to differentiate disease from the normal states using OTU information. We used pre-processed data from a twin study with 281 samples and 5462 OTUs 29 . For exploratory analysis we performed PCA ( Figure 10 and Figure 11; analyses and plots generated using our Shiny app) as shown in Figure 10 and Figure 11. This analysis and plots are generated using the Shiny app. We performed feature selection to select the highly significant features, shown in Table 3. Abundance of significant OTUs is shown in Figure 12. By using a set of predefined hyperparameters for each model, we achieved 10-fold cross validated accuracy of 0.936 using a linear support vector machine ( Figure 13). Additionally, 10 OTUs we identified as important to obesity-status are provided in Table 3. While we do not have assigned significant functional annotations for them in the current development, future studies could use them as candidate functional groups to aid experimental design for validating clinical and public health microbiome findings.

Project Rationales, Descriptions and Goals
Rationale: As the collection of human microbiome data grows, developing user-friendly tools to search proteomics databases has become critical. Bridging the gap between computer science and biological science expertise will facilitate microbiome analysis for both explanatory and predictive purposes, making significant additions to general knowledge in this field. Such effective and convenient methods of sifting through vast datasets would be well-suited to the investigation of not only modernday microbiome samples, but also preserved historical microbial and proteomic data retrieved from ancient populations at archaeological sites worldwide. Proteomic determination of the microbes of deceased individuals would provide another dimension to forensic analysis by uncovering the pathogens that might have been responsible for their death. The significance of this determination goes beyond simply detecting the presence of bacterial peptides, also extending to tracking the migration and virulence of diseases over time in human populations.
Exploring ancient or paleolithic host-microbiome interactions is an emerging approach to explore widespread microbial infectious diseases, and even pandemics, by identifying pathogen-expressed proteins in human dental calculus. This approach is supplemented by data from metabolomic analyses, anthropological and paleopathological data from the skeletal material, archaeological contexts, and archival data. Examining protein content of dental calculus has typically given insight into diet and oral health of communities of past generations [34][35][36] . Since dental calculus is formed as a result of bacterial plaque accumulation around the gingiva, dental calculus consists primarily of bacteria. Thus, dental calculus lends itself well to oral microbiome analysis. For example, it was found in a medieval sample that 85-95% of the calculus was composed of bacterial proteins 36 . This indicates a novel method of examining the constituents of the oral microbiome and its variation across cultures, geographies, and various historical periods.
The availability of a unique set of data from the first quarantine in the world will enable substantial focus on infectious diseases and the modeling of ancient epidemics ( Figure 14). All of the approximately 1500 individuals for this project died of an infectious disease, we know this from archival records. The addition of body responses to the environment and diseases (metabolites), as well as dietary data (stable isotopes to detect malnutrition), will be trialed, providing the best chance to recognize the pathogen responsible and its overall effects. In genetics and medicine, the combination of code, workflow, logic and available data will provide over 300 years of data on epidemics (especially bubonic plague) including the first influenza pandemic, dated 1580, and outbreaks of typhus and measles. It will be possible to reach ca. 600 years of data at one location using historical and medical records. The plague and other similar illnesses provoking fever are replaced by smallpox, measles and flu in later times, as medicine provides therapies, mobility increases  Figure 10 for different classes. This PCA plot shows more separation in the OTU clusters based on ancestry than by different disease classes (shown in Figure 10).   and diet changes with many plants cultivated in different continents from where they originated. Our TRACK prototypes will enable investigations related to pathogen evolution, microbiome adaptations and human immunity responses changes.

Goals:
To achieve the transdisciplinary goals inherent to the nature of this paleo-omics project, a central database able to contain different data types is required. Towards this objective, we created and implemented a paleo-omics workflow consisting of: 1) a search engine to query the multi-data database, 2) a retrieving pipeline for paleo proteins, and 3) a query gateway for microbiome-human host interactions ( Figure 15).
While mass spectrometry (MS), shotgun sequencing, and 16S rRNA sequencing data can be employed in paleo-omics, we focused on an MS-based meta-proteomics approach for proofof-concept demonstration of our prototype within the time constraints of the Codeathon, which we applied to data derived from human dental calculus protein-samples taken from archeological sites.

Methods
Data sources and processing MS data and shotgun sequencing data obtained from ancient human dental calculus samples were used for these analyses 36,37 .
(1) MS data: peptides were identified from raw data files by comparing spectra from the second spectrometer of a tandem-MS (MS2) to reference spectra available in protein databases. Many existing proteomics software packages, such as MaxQuant, have been designed for analyzing large MS data sets, such as the MaxQB database, and thus can perform this task 38 .
(2) Shotgun sequencing data: the resulting short reads in FASTQ data format have been initially verified if they correspond to human DNA sequences, sequence reads were aligned to a human reference genome (Genome Reference Consortium Human Build 38) to verify human sequences using the Bowtie version 1.3.0 and BWA programs version 0.7.17 39,40 . Reads not aligning to the human reference genome were characterized as non-human.
All processed data were stored in a high-performance database for future analysis. A web user interface and a search/analysis engine 41 were developed to access these data.

Assessing presence of select pathogens
We performed targeted pathogen searches for sequences of oral pathogenic microbes and other human pathogens, including the major human malaria parasite Plasmodium falciparum. We identified pathogenic oral microbes similar to previously published results, but no significant hits to P. falciparum from these two test-sets were identified. We additionally searched for marker oral microbiome species for other human infectious diseases as reported in detail in the results section.

Operation and Implementation
Source-code for our prototype is available through our GitHub repository (see Software availability section). This implementation requires the following software packages to reproduce: Python version 3.

Results
To test our prototype 41 , we searched for pathogen sequences against the two archaeological samples in the database, one from Denmark 1100-1450 AD 36 and one from the United Kingdom 1770-1855 AD 34 . The medieval Danish samples were used with a complete set of dental pathology characterization and individual data. Consistent with the reported results 36 , there are oral disease pathology and bacteria normally found in the oral microbiome that can be recovered ( Figure 16). For instance, the species Porphyromonas gingivalis is frequently present in individuals with orthodontic diseases, while Streptococcus sanguinis is present in both medieval and contemporary individuals with satisfactory oral health.
This approach can also be used to discover other bacteria linked to health and possibly reveal other correlations between microbiome bacteria and health status as well as recent evolutionary changes. In archaeology, the current focus is on revealing specific pathogens and there is no established reference material to investigate the past microbiome or its effects on health. Even in recent studies, any conclusions on medieval or older individuals is based on direct comparison with the contemporary microbiome. By using archaeological methods (chronological seriation) together with software developed from our code, it will be possible to investigate any correlation between microbiome and health searching individuals dating to older periods. Such work could provide a reference standard for archaeologists, and evolutionary data to health professionals. For example, using the existing data, we found the opportunistic respiratory pathogen Haemophilus parainfluenzae 43,44 present less frequently in this set of medieval samples (Wilcoxen test, p < 0.05), raising interesting questions about human society transition and infectious diseases. This group appears in Neolithic agrarian human oral microbiomes (7440 BCE) 45 , but is at low levels in human groups practicing hunting and gathering (2000 BCE, modern day South Africa). Questions of interest to both health professionals and archaeologists that could be answered by employing our code may be when this pathogen became more frequent and why.
Understanding the origins and evolution of pathogens is very important to prepare for future pandemics. The only successful work attempted on combining archaeology with genetics and health studies to investigate past pathogens, the reconstruction of the 1918 flu pathogen 46 proved to be both technically challenging and costly even though fewer than a hundred years had passed since the pandemic because that work tried to reproduce an active virus now extinct. It was also very useful to demonstrate that the strong virulence reported in historical sources, but unconfirmed in medicine, was real. Since 1919, only COVID-19 has demonstrated a similar virulence, proving that data from historical record can be critical in addressing new types of known viruses and pathogens, which can regain traits unseen for a century or more within that category of pathogens (respiratory viruses with flu-like symptoms in this case). That work has shown also how the choice of suitable burial grounds is essential for such work. Our work uses new -omics analyses that are providing new sources of data and could prove equally valuable, revealing the history of recent pathogens, characteristics that may have been present only occasionally, and their successes and failures. Future pathogens might reuse and re-combine successful traits (symptoms, virulence) from past epidemics and therefore our preparedness depends on knowing what to expect, on learning from the past.
The results of our work are therefore limited to making possible future interdisciplinary research and open up a path to answer new questions. Sequencing proteomic and metabolomic data from pre-modern individuals is still rare and there is no existing database, besides data from a few academic papers, that our software code could search. Yet, making possible new studies through a working proof-of-concept will accelerate the production of databases for ancient individuals. Existing archaeological studies have borne out of early full sequencing of genomes and have been severely limited by such approaches. The benefits deriving from new -omics analyses combined with our approach can provide valuable information on older pathogens. Future work may focus on epidemics initially, but with a potential also for revealing and understanding more subtle and complex relationships between human microbiome and health.

Project Rationales, Descriptions and Goals
Rationale: One primary goal of host-microbiome studies is to capture and understand ecological and host drivers of microbial diversity. Research on host-microbiome associations across host species has been facilitated by the increasing accessibility of high-throughput sequencing techniques and the availability of integrated microbiome datasets, such as the Earth Microbiome Project dataset 47 . These have yielded useful insights on how host-microbiome associations are impacted by host diet 48 , host taxonomy or phylogeny 49 , host immune system 50 , and environmental factors 51 . However, host species traits vary immensely across species and such diversity has been under sampled in microbiome studies. As a result, the effects of other host factors, including body mass and life history, in relation to previously characterized host and environmental effects, on hostmicrobiome associations have been understudied.
Goal: In this project, we aim to investigate the effects of various host traits, including diet, host taxonomy, body mass, and longevity, in relation to environmental factors, on the intestine, fecal, foregut, and stomach microbiomes of Metazoan (animal) species. We first mined available microbiome and metadata datasets, then applied unsupervised learning directly on rarefied OTU abundance data to uncover clusters of microbial community similarity among animals.

Data sources and processing
Rarefied OTU

ANOVA F-test and correlation analysis
For feature selection, ANOVA F-tests were implemented in python to identify quantitative metadata variables with significant means variance differences between clusters. Pearson correlation analysis was also performed in python to evaluate linear relationships between metadata variables.

Operation and implementation
The analyses can be performed on a local computer or server with R and Python installed. A step-by-step tutorial of the unsupervised clustering approach is available at https://enterotype. embl.de/enterotypes.html. R markdown and Python codes used for analyses are also available as listed in the Software availability section 58 .
ANOVA analysis indicated that clusters had the most significant mean differences in microbial alpha diversity, Simpson diversity, Shannon diversity, Faith's phylogenetic diversity, and Chao 1 diversity (Table 4). Digestive habitat type, host taxonomy/phylogeny, immune complexity, and life stage, were also significantly different between clusters, along with DNA extraction methods and environmental variables. Notably, body mass and maximum longevity were also significantly different between clusters.
Cluster-specific correlation analyses showed that alpha diversity in clusters 1 and 2 was consistently positively correlated with host taxonomy, immune complexity, diet, maximum longevity and latitude. Body mass, vegetation index, terrain complexity, mean temperature of the driest quarter and precipitation of the warmest and coldest quarters showed positive correlations with alpha diversity in cluster 1, but not cluster 2. Latitude and country were positively correlated with alpha diversity in cluster 2, but not  cluster 1. Alpha diversity in cluster 3, which comprised butterflies and moths, was positively correlated with environmental variables (terrain complexity, mean diurnal temperature range, precipitation seasonality, elevation) and host factors (digestive habitat type and diet).
The results support our premise that host traits, including but not limited to body mass and maximum longevity, are under sampled in microbial diversity studies. Understudied host traits could also shape animal internal microbiomes together with wellcharacterized host traits and environmental variables. Based on our results, we propose comprehensive sampling of host traits in future microbiome studies, which may yield new and unexpected patterns of microbial community organization serving as a baseline for deeper investigations.

Lessons learned
Throughout this process we identified several areas where improvements could be made for future disease-focused hackathons. A few of these are described below.
Collaboration across domains requires extensive communications with minimum use of jargons, and active learning from diverse backgrounds. We aimed to further expand on the traditional foundation of codeathons, and we generated novel tools by leveraging research strengths of the local community. However, there has been some challenges in the six teams to efficiently work together, with barriers in communicating the feasibility and significance of particular problems. In-depth and succinct explanation of the technical problems are critical for the successful operations.
Scalability of R has been called into question during the prototype development. For large dataset computations, more efficient implementation can be developed once the prototype has proven to be useful for the community. However, the granularity of solutions available in R make it the preferred tool for designing and experimenting with different solutions.
Meticulous documentation of each analysis step remains crucial for effective dissemination of our approach and results. These necessary components of any project are also excellent opportunities to apply the skillsets of non-coders, as well as to heighten engagement of trainees by reinforcing project rationale. Good documentation, including simple flowcharts, are very useful tools for keeping focus. Non-coding participants who want to gain some experience can often quickly learn markdown language and be vital contributors to repositories.

Conclusion and next steps
Interdisciplinary collaborations have proven to be very productive as shown by our six working prototypes addressing broad microbiome related challenges, ranging from power calculations, AI classifiers, GIS integration and large data set visualizations. Although working across fields has been a challenging task, we found that parsing a complex question into distinct parts can help different domain-experts to work together and accomplish tasks none of the individuals could accomplish in isolation. The codeathon workflow is thus a useful research model for many urgent societal problems that suffer from knowledge-transfer and communication issues. We have made all data and code publicly available for further exploration of these tools. Importantly, we are developing impactful projects to further pursue intersectional research spurred by this event, including microbiome-related machine learning, and data mining across archaeological time and geography.

Data availability
All data underlying the results are available as part of the article and no additional source data are required.  License: GNU General Public License 3.0.
License: GNU General Public License 3.0.