Keywords
Protein-protein interactions, interaction inference, microbe-host interactions, molecular mimicry, short linear motifs
This article is included in the Bioinformatics gateway.
This article is included in the Cell & Molecular Biology gateway.
The increasing incidence of emerging infectious diseases is posing serious global threats. Therefore, there is a clear need for developing computational methods that can assist and speed up experimental research to better characterize the molecular mechanisms of microbial infections.
In this context, we developed mimicINT, an open-source computational workflow for large-scale protein-protein interaction inference between microbe and human by detecting putative molecular mimicry elements mediating the interaction with host proteins: short linear motifs (SLiMs) and host-like globular domains. mimicINT exploits these putative elements to infer the interaction with human proteins by using known templates of domain-domain and SLiM-domain interaction templates. mimicINT also provides (i) robust Monte-Carlo simulations to assess the statistical significance of SLiM detection which suffers from false positives, and (ii) an interaction specificity filter to account for differences between motif-binding domains of the same family. We have also made mimicINT available via a web server.
In two use cases, mimicINT can identify potential interfaces in experimentally detected interaction between pathogenic Escherichia coli type-3 secreted effectors and human proteins and infer biologically relevant interactions between Marburg virus and human proteins.
The mimicINT workflow can be instrumental to better understand the molecular details of microbe-host interactions.
Protein-protein interactions, interaction inference, microbe-host interactions, molecular mimicry, short linear motifs
A typo was fixed. We added the Inserm as funding agency for articles processing charges.
See the authors' detailed response to the review by Ylva Ivarsson
See the authors' detailed response to the review by Sobia Idrees
Most pathogens interact with their hosts to reach an advantageous niche and ensure their successful dissemination. For instance, viruses interfere with important host-cell processes through protein-protein interactions to coordinate their life cycle.1 It has been shown that host cell networks subversion by pathogen proteins can be achieved through interface mimicry of endogenous interactions (i.e., interaction between host proteins).2,3 This strategy relies on the presence in pathogen protein sequences of host-like elements, such as globular domains and short linear motifs (SLiMs), that can mediate the interaction with host proteins.4–6
Over the last years, many computational methods have been developed to predict pathogen-host protein interactions, some of which are based on the detection of sequence or structural mimicry elements.7–9 Such approaches allowed, for instance, to suggest potential molecular mechanisms underlying the implication of gastrointestinal bacteria in human cancer10,11 or to discriminate between viral strains with different oncogenic potentials,12 thus showing that protein-protein interaction predictions can be instrumental in untangling microbe-host disease associations. Nevertheless, the source code of many of these tools is not freely available to the community (e.g., Refs. 11–13) providing the predictions through a database (e.g., Ref. 12), or can be only used through a web interface,14,15 thus limiting reproducibility and tool usability.
In this context, and inspired by our previous work,10 we have developed the mimicINT workflow, and its webserver companion mimicINTweb (https://mimicintweb.tagc.univ-amu.fr), to enable large-scale interaction inference between microbe and human proteins based on the detection of host-like elements and the use of experimentally identified interaction templates.16,17
mimicINT detects putative molecular mimicry elements in microbe sequences of interest that can mediate the interaction with host proteins ( Figure 1). mimicINT is written in Python (https://www.python.org/) and R (https://www.r-project.org/) languages and exploits the Snakemake workflow manager for automated execution.18 It consists of four main steps: (i) the detection of host-like elements in microbe sequences; (ii) the collection of domains on the host protein (iii); the interaction inferences between microbe and host proteins; and (iv) the functional enrichment analysis on the list of inferred host interactors.
(A) By providing a fasta file of protein sequences of the query species (e.g., microbe sequences), mimicINT allows identifying both the (B) domain- and (C) SLiM-mediated interfaces of interactions. (D) Using publicly available interaction templates, mimicINT infers the interactions between the proteins of the query and target (i.e., host) species. (E) Finally, it provides a list of functional annotations significantly enriched in inferred protein targets.
In the first step, mimicINT takes the FASTA-formatted sequences of microbe proteins (e.g., viral or other pathogen proteins susceptible to be found at the pathogen-host interface) as input to detect host-like elements: domains and SLiMs. The domain identification is performed by the InterProScan stand-alone version19 using the domain signatures from the InterPro database.20 By default, mimicINT retains InterProScan matches with an E-value below 10−5, a common threshold value used for detecting profile-based domain signatures in protein sequences in the context of interaction inference.21 The host-like SLiM detection exploits the motif definitions available in the ELM database17 and is carried out by the SLiMProb tool from the SLiMSuite software package.22 As SLiMs are usually located in disordered regions,23 SLiMProb uses the IUPred algorithm24 to compute the disorder propensity of each amino acid in the query sequences and generates an average disorder propensity score for every detected SLiM occurrence. For SLiM detection, the default IUPred disorder propensity threshold is set to 0.2, a value commonly used to limit false negatives,22,25 and the minimum size of the predicted disorder region is set to 5, which is the optimal size to detect true positive SLiM occurrences.26 Nevertheless, the user can choose all running parameters for the host-like element detection in the mimicINT configuration file.
In the second step, mimicINT gathers the domain annotations of the host proteins from the InterPro database through a REST API query.
In the third step, mimicINT infers the interactions between host and microbe proteins. This analysis takes as input the list of known interaction templates collected from two resources: (i) the 3did database,16 a collection of domain-domain interactions extracted from three-dimensional protein structures,27 and (ii) the ELM database17 that provides a list of experimentally identified SLIM-domain interactions in Eukaryotes. The inference procedure checks whether any of the microbe proteins contain at least one domain or SLIM for which an interaction template is available. In this case, it infers the interaction between the given protein and all the host proteins containing the cognate domain (i.e., the interacting domain in the template). As motif-binding domains of the same group, like SH3 or PDZ, show different interaction specificities,28 we have implemented a previously proposed strategy29 to take these differences into account (see below the sub-section “Computation of the motif-binding domain similarity scores”). This approach assigns a “domain score” that can be used to rank, or filter inferred SLiM-domain interactions. Once this step is completed, the inferred interactions are stored in both tab-delimited and JSON files to facilitate the import in other applications, such as Cytoscape.30
In the final step, to identify the host cellular functions potentially targeted by the pathogen proteins, mimicINT executes a functional enrichment analysis of host-inferred interactors. This analysis statistically assesses the over-representation of functional categories, such as Gene Ontology terms and biological pathways (e.g., KEGG and Reactome), using the g:Profiler R client.31
Given the degenerate nature of SLiMs,23 their detection is prone to generate false positive occurrences. For this reason, we implemented an optional sub-workflow that, using Monte-Carlo simulations, assesses the probability of a given SLiM to occur by chance in query sequences and, thus, can be used to filter out potential false positives5 (see below the sub-section “Statistical significance of the SLiMs detected on the microbe sequences”).
To ease deployment and ensure reproducibility and scalability on high-performance computing infrastructures, mimicINT is provided as a containerized application based on Docker and Singularity.32,33
To identify motif-binding domains that can be specifically associated with a given ELM motif class, we use the same strategy proposed by Weatheritt et al. in 2012,29 which assumes that a domain significantly similar to a known motif-binding domain should also bind the same motif. We first compiled a list of experimentally identified motif-binding domains from the original list from Weatheritt et al. complemented by more recent annotations from the ELM database17 (August 2020). Obsolete ELM class identifiers from Weatheritt et al. were mapped to current ELM identifiers using the “Renamed ELM classes” file (http://elm.eu.org/infos/browse_renamed.tsv) and duplicated domain annotations were removed. In total, we collected 538 domains in 415 human proteins known to bind 212 ELM motif classes (73% of the 290 motif classes present in ELM, August 2020). The sequences of these 415 annotated proteins were fetched from UniprotKB.34 We next fetched the sequences of 1452 reference Eukaryota proteomes (22,262,113 protein sequences in total) from UniprotKB (August 2020). We removed redundancy using the CD-HIT algorithm35 to generate a database of 21,414,544 non-identical sequences. We used the GOPHER tool36 from the SLiMSuite package22 to identify orthologous sequences of the annotated proteins in the database of non-identical eukaryotic sequences by reciprocal BLAST best hits. Selected orthologous proteins were aligned using the multiple sequence alignment algorithm Clustal Omega (v. 1.2.4).37 Once the position of the motif-binding domain was identified within the alignment, we removed aligned domains with indels covering >10% of the annotated domain sequence. We iteratively realigned the sequences until a set of proteins was identified with <10% indels coverage. In total, we selected 701 multiple sequence alignments used as input for generating domain-specific HMM profiles with the hmmbuild program from the HMMER package v.3.1.1.38 Subsequently, we scanned a representative set of the human proteome (20,350 “reviewed” sequences from UniprotKB) with the domain-specific HMMs using the hmmsearch program. We used an E-value cutoff of 0.01 to select the best hits and we rejected those hits covering less than 90% of the annotated motif-binding domain sequence length. Finally, the E-value of the best-scoring domain was converted into a domain similarity score using the iELM script downloaded from http://elmint.embl.de/program_file/.29 Doing so, we computed at least one motif-binding domain similarity score for 1,461 human proteins.
To assess the probability of a given motif to occur by chance in microbe sequences, we implemented a previously proposed approach5 to randomly shuffle the disordered regions of each sequence of a microbe of interest to generate a large set of randomized microbe proteins. The number of shuffled sequences to be generated by mimicINT can be chosen by the user in the corresponding configuration file (see the mimicINT online documentation for more details). By default, mimicINT creates two sets of 100,000 randomly shuffled proteins (one set for each IUPred disorder propensity prediction mode, i.e. short and long), with the assumption that the input sequences belong to the same microbe species or strain. Once the shuffled sequences are generated, the occurrences of each detected motif are compared in each microbe input sequence to the occurrences observed in the corresponding set of shuffled sequences. To compute the probability (P) of each detected motif to occur by chance, mimicINT counts the number of times (m) out of the shuffled sequences (N) where there is at least the same number of instances of the given motif in the input sequence:
For example, if a given motif occurs twice in the input sequence, the methods count how many times the same motif is detected at least twice in the corresponding set of randomly shuffled sequences. The lower the value of P, the rarer the instances, thus suggesting that the given motif can be likely functional. In this work, we set the significant threshold equal 0.1, as reported in Ref. 5.
The mimicINTweb server allows users not familiar with the command-line interface to run the mimicINT workflow through an easy-to-use web interface. The number of input sequences is limited to 50. A step-by-step tutorial is available on the mimicINTweb site (https://mimicintweb.tagc.univ-amu.fr/tutorial). The mimicINTweb server uses the Django framework (version 2.2.1 under Python 3.12.4) as web app core to manage URL routing, HTML rendering, authentication, administration, and backend logic. The Django component has been complemented with two additional application layers to guarantee server performances and security: Gunicorn (version 22.0.0) as web server gateway interface, and Nginx (version 1.25) as reverse proxy server.
The mimicINT workflow can be run on a Linux-based computer with at least 32 GB RAM and it has been successfully used on Ubuntu (16.04 and higher) and CentOS (7.4) distributions. The following software is required: Python (3.6 or higher), Snakemake (6.5 or higher), Docker (18.09 or higher) and IUPred (version 1.0). The workflow can be also deployed on high-performance computing (HPC) clusters. In this case, the Singularity application (2.5 or higher) is required. More detailed information can be found on the mimicINT GitHub repository (https://github.com/TAGC-NetworkBiology/mimicINT). The mimicINTweb server can be accessed from Linux, Windows or Mac OS based systems, and it has been tested with the following browsers: Chrome, Firefox and Safari.
We sought to evaluate the ability of mimicINT to correctly infer SLiM-domain interactions, as this inference can generate many false positives,29 using the default parameters for SLiM detection (see Implementation). To do so, we used as controls two datasets of established motif-mediated interactions (MDI) from the ELM database17: (i) 103 interactions between 87 viral and 44 human proteins (vMDI); (ii) 31 interactions between 16 bacterial and 23 human proteins (bMDI). We were able to correctly infer most of these interactions (91 vMDI, true positive rate = 88.3%; 21 bMDI, true positive rate = 67.7%). Notably, almost all the correctly inferred interactions have a domain score above 0.4 (87 out of the 91 vMDI, 19 out of 21 bMDI). As the availability of negative SLiM-mediated interaction datasets is very limited,17,29,39 we estimated the false positive rate (FPR) by applying mimicINT to two sets of randomly generated interaction sets (degree-controlled, vMDIrnd and bMDIrnd, respectively). Thirty-four vMDIrnd and 7 bMDIrnd were inferred as motif-mediated (FPR = 33% and FPR = 23%, respectively). We next annotated the human proteins in the two random sets with domain scores. We kept only interactions for which the domain score was above 0.4,29 thereby reducing the number of random interactions predicted as motif-mediated to 9 (FPR = 8.7%) for vMDIrnd and 2 (FPR = 6.4%) for bMDIrnd. Finally, we tested mimicINT on two sets of experimentally verified negative protein interactions from the Negatome 2.0 database40: 37 viral-human and 4 bacterial-human interactions. Only two virus-human negative interactions (5.4%) were inferred as motif-mediated by mimicINT.
In light of these results, we used mimicINT in two tasks: (i) the identification of putative interfaces in experimentally identified interactions between secreted effectors from the enteropathogenic Escherichia coli serotype O157:H7 (EHEC) and human proteins; (ii) the inference of interactions between human and the Marburg virus (MARV) proteins, an emerging infectious agent for which experimental protein interaction data is scarce.
We collected 83 interactions between 24 EHEC secreted effectors and 74 human proteins by querying (January 2022) the IMEx consortium databases41 via the PSICQUIC interface.42 We gathered the sequences of EHEC effectors from43 and ran mimicINT with default parameters. We computed the motif probabilities using the dedicated sub-workflow by performing 100,000 randomizations. We were able to identify a putative interaction interface for 26 of the 83 experimental EHEC-human interactions (31.3%) ( Figure 2A), which is higher than the number of interactions with identified putative interfaces in a degree-controlled randomized network (3 interactions, 3.6%). Most of the putative interfaces were identified using motif-domain interaction templates (MDI), namely 24 interactions, whereas the putative interfaces for 9 interactions were identified with domain-domain interaction templates (DDI). Interestingly, we identified putative interfaces with both MDI and DDI templates for 7 interactions ( Figure 2A). Among the interactions with MDI interfaces, almost all have a motif probability below 0.1 (23 interactions, see Supplementary File 1). Seven interactions have a domain score above 0.4 (29.2%) and their cognate motifs show a motif probability lower than 0.1 ( Figure 2B). This suggests that most of the identified putative interfaces can be considered as high confidence. To further support these inferences, we sought to verify whether the 26 putative interfaces corresponded to experimentally identified binding regions. To do so, we collected the biological features (i.e. “binding-associated region”, “necessary binding region”, “sufficient binding region”)43 reported in the interaction records downloaded via the PSICQUIC interface, and we found that for half of the interactions with an inferred interface (13 interactions, 11 MDI and 2 DDI) there is supporting experimental evidence for at least one of the interaction partners ( Figure 2C). For 7 interactions (27%), mimicINT inferred correctly the interface elements of both EHEC effectors and human proteins. For the other 4 interactions, the experimental evidence supports the EHEC effector interface element only (see Supplementary File 1). Importantly, the 11 MDI inferences can be considered of high confidence as they have either a motif probability < 0.1 or a domain score > 0.4. Overall, these results indicate that high confidence mimicINT inferred interaction can identify bona fide interaction interfaces.
(A) Proportion of experimentally determined interaction between EHEC secreted effectors and human proteins with at least one putative interface inferred by mimicINT (left). Split proportion of EHEC-human interactions with a putative interface according to the interaction templates: motif-domain (MDI) and domain-domain (DDI). (B) Proportion of EHEC-human interactions with high-confidence MDI-inferred interfaces based on the computed motif probability (mp) and domain score (ds). See main text for more details. (C) Network representations of the interactions between EHEC secreted effectors (circle nodes) and human proteins (square nodes). Edges represented as parallel lines indicate interactions with experimentally identified binding regions in at least one of the interaction partners. Coloured edges represent interactions with at least one putative inferred interface by mimicINT. The network was generated using Cytoscape.30
We downloaded MARV protein sequences (7 proteins, Proteome ID: UP000180448, January 2022) from UniprotKB in FASTA format and ran mimicINT with default parameters. We also computed the motif probabilities using the dedicated sub-workflow by performing 100,000 randomizations.
In total, we inferred 11,431 interactions between 7 MARV and 2757 human proteins (see Supplementary File 2). Most of the inferred interactions, namely 10,101, are motif-domain interactions (MDI) between 7 MARV and 2324 human proteins, and the remaining 1,339 are domain-domain interactions (DDI) between 5 MARV and 479 human proteins (9 interactions were inferred with both MDI and DDI templates). The functional enrichment analysis performed by mimicINT on the full list of inferred host interactors returned 975 enriched annotations at FDR<0.01 (see Supplementary File 2). We further filtered out the functional categories annotating less than 5 or more than 500 proteins, obtaining a list of 763 enriched annotations (241 GO biological processes, 63 GO Cellular components, 6 CORUM complexes, 130 KEGG and 237 Reactome pathways, see Supplementary File 2), which points towards cellular processes and pathways related to viral infection and immune system ( Table 1). By applying the default thresholds on motif probabilities and domain scores on inferred MDI, we defined a set of 535 high-confidence MDI interactions between 7 MARV and 419 human proteins. We combined this set with the inferred interaction using DDI templates and ran a functional enrichment analysis on a list of 891 human interactors returning 908 enriched annotations at FDR<0.01. As above, after filtering on the size of functional categories, we obtained 743 enriched annotations (287 GO biological processes, 57 GO Cellular components, 1 CORUM complexes, 141 KEGG and 257 Reactome pathways, see Supplementary File 2). Interestingly, 27% of the enriched GO biological process annotations (77 out of 287) are related to infection and immunity,44 and notably 8 out of the 10 most enriched.
Overall, these results reinforce the biological relevance of the inferred interactions, particularly those considered of higher confidence.
The top 10 most enriched terms are shown for Gene Ontology Biological Process (BP) and Cellular Component (CC) terms. For each enriched term the following information is reported: term identifier, term name, adjusted P-value, number of human proteins annotated with the given term in the statistical background (term size), number of inferred interactors annotated with the given term (intersection size). The terms reported in italic are related to viruses, infection and immunity according to Garcia-Moreno and colleagues.44
We have developed mimicINT, an open-source computational workflow enabling large-scale interaction inference between microbe and host proteins. In the first use case presented here, we show that mimicINT can identify bona fide interaction interfaces in an experimentally generated interaction network between secreted pathogenic bacterial effectors and human proteins. Notably, we also successfully used it to identify interaction interfaces between commensal bacterial effectors and human proteins in a large-scale interaction dataset generated by yeast two-hybrid.45 In the second use case, we used mimicINT to infer the interactions between viral and human proteins which are biologically relevant given the results of the functional enrichment analysis.
Although we developed mimicINT as a tool to infer protein interactions between microbe and human proteins, it can be used on any organisms whose proteins bear either domains or motifs with known interaction templates (e.g., human, mouse or fruit fly). For instance, we have recently used mimicINT to generate the first interactome of small human peptides encoded by short Open Reading Frames (sORFs).46 Nevertheless, the only limitation of the workflow is the availability of motif-domain and domain-domain templates, which depends on the curation efforts done by teams maintaining the corresponding source database (i.e., ELM and 3did).
Finally, compared to other similar tools,47 mimicINT provides two functionalities to define high-confidence inferred interactions based on motif-domain templates, that is the computation of (i) motif probabilities and of (ii) motif-binding domain similarity scores. As shown in the use cases, the application of these two strategies supports the identification of bona fide interaction interfaces in the EHEC-human interaction network and the biological relevance of the inferred MARV-human interactions.
All in all, given the increasing frequency of (re-)emerging infectious diseases and the accumulating evidence on the fundamental role played by microbes in chronic diseases,48–50 there is no doubt that mimicINT will be useful to better understand the molecular details of the microbe-host relationships.
Supplementary data
The data analyzed and produced in this manuscript, including the protein sequences mentioned in the use cases, are available are available from:
Zenodo: mimicINT workflow: Use cases for interaction interface identification and protein interaction inference. DOI: https://doi.org/10.5281/zenodo.14614802.51
Sequence data
MARV protein sequences available from https://www.uniprot.org/proteomes/UP000180448
Short linear motif data
ELM motif class definitions are available from http://elm.eu.org/downloads.html#classes
ELM renamed class names are available from http://elm.eu.org/infos/browse_renamed.tsv
Data are available under the terms of the Creative Commons Attribution 4.0 International license (CC-BY 4.0).
Sequencing data
Workflow source code available from: https://github.com/TAGC-NetworkBiology/mimicINT
Archived workflow source code at time of publication: https://doi.org/10.5281/zenodo.12078119 52
Webserver available from: https://mimicintweb.tagc.univ-amu.fr/
Webserver source code available from: https://github.com/TAGC-NetworkBiology/mimicINTweb
Archived webserver source code at time of publication: https://doi.org/10.5281/zenodo.14548031
The iELM script is available from: from http://elmint.embl.de/program_file/
License: GNU General Public License, V3.
The authors thank Paul de Boissier for helping in the early development of the workflow and Fabrice Lopez for technical advice. The authors are also grateful to the members of the DIME project for fruitful scientific discussions. Centre de Calcul Intensif d’Aix-Marseille is acknowledged for granting access to its high-performance computing resources.
Views | Downloads | |
---|---|---|
F1000Research | - | - |
PubMed Central
Data from PMC are received and updated monthly.
|
- | - |
Is the rationale for developing the new software tool clearly explained?
Yes
Is the description of the software tool technically sound?
Yes
Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?
Yes
Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?
Partly
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?
Yes
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Short Linear Motifs, Protein-Protein Interactions, Host-Pathogen Interactions, Bioinformatics
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Expert on the experimental identification and charachterization of SLiM-based interactions.
Is the rationale for developing the new software tool clearly explained?
Yes
Is the description of the software tool technically sound?
Yes
Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?
Yes
Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?
Yes
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?
Yes
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: SLiM mediated interactions.
Is the rationale for developing the new software tool clearly explained?
Yes
Is the description of the software tool technically sound?
Yes
Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?
Yes
Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?
Yes
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?
Yes
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Expert on the experimental identification and charachterization of SLiM-based interactions.
Alongside their report, reviewers assign a status to the article:
Invited Reviewers | |||
---|---|---|---|
1 | 2 | 3 | |
Version 2 (revision) 28 Mar 25 |
read | read | |
Version 1 27 Jan 25 |
read | read |
Provide sufficient details of any financial or non-financial competing interests to enable users to assess whether your comments might lead a reasonable person to question your impartiality. Consider the following examples, but note that this is not an exhaustive list:
Sign up for content alerts and receive a weekly or monthly email with all newly published articles
Already registered? Sign in
The email address should be the one you originally registered with F1000.
You registered with F1000 via Google, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Google account password, please click here.
You registered with F1000 via Facebook, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Facebook account password, please click here.
If your email address is registered with us, we will email you instructions to reset your password.
If you think you should have received this email but it has not arrived, please check your spam filters and/or contact for further assistance.
Comments on this article Comments (0)