LiftoffTools: a toolkit for comparing gene annotations mapped between genome assemblies

Alaina Shumate; Steven Salzberg

doi:10.12688/f1000research.124059.2

Home Browse LiftoffTools: a toolkit for comparing gene annotations mapped between...

ALL Metrics

Views

Downloads

Get PDF

Get XML

Export

▬

✚

Software Tool Article

Revised

LiftoffTools: a toolkit for comparing gene annotations mapped between genome assemblies

[version 2; peer review: 2 approved]

Alaina Shumate ^1,2, Steven Salzberg^1-4

PUBLISHED 29 Apr 2024

Author details Author details

¹ Center for Computational Biology, Whiting School of Engineering, Johns Hopkins University, Baltimore, MD, 21211, USA
² Biomedical Engineering, Johns Hopkins University, Baltimore, MD, 21218, USA
³ Biostatistics, Bloomberg School of Public Health, Johns Hopkins University, Baltimore, MD, 21205, USA
⁴ Computer Science, Johns Hopkins University, Baltimore, MD, 21218, USA

Alaina Shumate
Roles: Conceptualization, Methodology, Software, Writing – Original Draft Preparation

Steven Salzberg
Roles: Funding Acquisition, Supervision, Writing – Review & Editing

OPEN PEER REVIEW

REVIEWER STATUS

This article is included in the Bioinformatics gateway.

This article is included in the Genomics and Genetics gateway.

Abstract

In 2020 we published Liftoff, which was the first standalone tool specifically designed for transferring gene annotations between genome assemblies of the same or closely related species. While the gene content is expected to be very similar in closely related genomes, the differences may be biologically consequential, and a computational method to extract all gene-related differences should prove useful in the analysis of such genomes. Here we present LiftoffTools, a toolkit to automate the detection and analysis of gene sequence variants, synteny, and gene copy number changes. We provide a description of the toolkit and an example of its use comparing genes mapped between two human genome assemblies.

Keywords

Bioinformatics, Genome annotation, Genomics

Corresponding author: Alaina Shumate

Competing interests: No competing interests were disclosed.

Grant information: This work was supported by NIH grant: HG006677
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Copyright: © 2024 Shumate A and Salzberg S. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

How to cite: Shumate A and Salzberg S. LiftoffTools: a toolkit for comparing gene annotations mapped between genome assemblies [version 2; peer review: 2 approved]. F1000Research 2024, 11:1230 (https://doi.org/10.12688/f1000research.124059.2) First published: 28 Oct 2022, 11:1230 (https://doi.org/10.12688/f1000research.124059.1) Latest published: 29 Apr 2024, 11:1230 (https://doi.org/10.12688/f1000research.124059.2)

Revised Amendments from Version 1

The differences in this version are just minor updates to the text to further elaborate on points that were unclear to the reviewers. No data, methods, or conclusions have changed.

See the authors' detailed response to the review by Mark Borodovsky
See the authors' detailed response to the review by Adam Frankish

Introduction

Liftoff (Shumate and Salzberg, 2021) is a computational tool specifically designed for mapping gene annotations from a reference assembly to a target assembly of the same or closely related species. Liftoff uses sequence alignment software to align the complete exon-intron structure of each annotated transcript from a source to a target, and it can also map virtually any other feature specified as an interval along the genome. It also includes a method to find additional copies of genes that might be present in higher copy numbers in the target genome. After lifting genes over, one of the first questions that many researchers have is how the sets of genes compare between the reference assembly and the target, and in particular whether any of the differences are biologically consequential.

Here we introduce LiftoffTools, a toolkit to compare genes mapped from one assembly to another. LiftoffTools includes three different modules. The first identifies changes in protein-coding genes and their effects on the corresponding genes, including simple amino acid changes as well as more-serious alterations. The second compares the gene synteny (e.g., the preservation of gene order along the chromosomes), and the third clusters genes into groups of paralogs to evaluate gene copy number gain and loss. While LiftoffTools is designed to analyze the output of Liftoff, it is also compatible with the output of other annotation transfer tools such as UCSC liftOver (Kuhn et al., 2013) that preserve the feature IDs between annotations. Here we provide a description of each module as well as results comparing genes in the GRCh38 human reference genome mapped onto CHM13, the first truly complete human genome (Nurk et al., 2022).

Methods

The inputs required for all three modules of LiftoffTools are the sequences of the reference and target assemblies (in FASTA format), and the annotations of the reference and target assemblies (in GFF3 or GTF format). The target annotation can be derived from other lift-over tools besides Liftoff, as long as the feature IDs in the reference and target annotations are the same. All three modules can be run with the following command:

liftofftools all -r reference.fasta -t target.fasta -rg reference.gff3 -tg target.gff3

Each module can also be run separately as detailed on GitHub.

Operation

LiftoffTools is designed and implemented in Python 3 (requires 3.6 or higher) and is easily installable with PyPi (pip install liftofftools) and bioconda (conda install -c bioconda liftofftools). Details on how to run LiftoffTools are available on the GitHub page.

Variants

The variants module calculates the sequence identity between mRNA transcripts in the reference genome and the corresponding transcripts in the target genome. For protein-coding genes, the module identifies variants that have a neutral or deleterious effect on the translated amino acid sequences in the target genome. The first step in the module will globally align the nucleotide sequences of the reference transcripts to the target transcripts using the Needleman-Wunsch algorithm implemented by Parasail (Daily, 2016), which is a single instruction/multiple data (SIMD) C library for sequence alignment. If the transcript has an annotated protein-coding sequence (a CDS feature), we align the protein sequences again using Parasail. We then identify mismatches and gaps in the alignments and evaluate the effects on the protein sequence. The potential effects we look for are synonymous mutations, nonsynonymous mutations, in-frame deletions, in-frame insertions, start codon loss, 5′ truncations, 3′ truncations, frameshifts, and stop codon gain. For all transcripts we output the percent identity at the nucleotide level, and for protein-coding transcripts we also output the protein percent identity and the variant effect if applicable. While there may be multiple variants within a transcript, the intent of this module is to summarize the functional consequences of variation; therefore, if there is more than one variant, we report only the most severe. For example, if a transcript has a synonymous mutation and a frameshift mutation, we output ‘frameshift’ for that transcript as this would be more disruptive to gene function. Combining the sequence identity information with the variant effect can provide further insights into the severity of the variant. For example, a gene with a frameshift near the 3′ end or a gene with a second, compensatory frameshift nearby will have a high percent identity at the amino acid level and may still retain function.

Synteny

The synteny module compares the gene order in the reference annotation to the order in the target annotation. The genes present in both annotations are sorted first by chromosome and then by start coordinate in each annotation. Each gene is then plotted as a point on a 2D plot where the x-coordinate is the ordinal position (e.g., 1^st, 2^nd, 3^rd, etc.) in the reference genome and the y-coordinate is the ordinal position in the target genome. The color of the point corresponds to the sequence identity between the corresponding genes, where green indicates higher identity and red indicates lower identity. Note this color feature is only available for target annotations created by Liftoff which have the sequence identity information in the GTF/GFF3. The plot and a file with the ordinal positions and sequence identities of each gene is output. The user also has the option to calculate the edit distance between the reference order and the target order.

Clusters

This module clusters the genes into paralogous groups to evaluate gene copy number gain and loss. LiftoffTools first invokes MMSeqs2 (Steinegger and Söding, 2017) to cluster the reference gene sequences. MMSeqs2 clusters the amino acid sequences of the protein-coding genes, and the nucleotide sequences of noncoding genes. For each gene we select only the longest isoform to be included in the clustering. For genes to be considered copies and be clustered together, they must be at least 90% identical across 90% of both of their lengths, although these parameters can be adjusted by the user. After clustering the reference genes, we create the target gene clusters by first iterating through each reference cluster and removing any gene absent in the target genome. Next, if Liftoff was run with the -copies option to identify extra gene copies in the target genome, we add the extra copies to the same cluster as their closest paralog. If Liftoff was run without the -copies option, no extra gene copies will be present in the target annotation, and thus the clusters modules will only report instances of copy number loss. For each cluster, we output the number of reference genes and the number of target genes belonging to that cluster as well as the gene IDs of the cluster members.

Results

To illustrate the use of these tools, we used them to compare the human annotation on the current reference genome, GRCh38, to the same annotation when mapped onto the first-ever complete human genome, CHM13 (Nurk et al., 2022). We first mapped the human annotation onto CHM13 by running Liftoff v1.6.3 (with options -copies -sc 0.95 -polish) to map genes from RefSeq release 110 (O’Leary et al., 2016) from GRCh38 onto CHM13v2.0. (This annotation is available on the Johns Hopkins Center for Computational Biology website) We then ran each module of LiftoffTools on the resulting CHM13 annotation.

Variants

Running the variants module on GRCh38 and CHM13, we found that out of 130,316 protein-coding transcripts in GRCh38, 77,109 CHM13 transcripts were identical, 421 failed to map, and 52,669 had variants with the effects shown in Table 1. The vast majority of these effects were either simple amino acid changes or insertion/deletions (indels) that preserved the reading frame; only 932 of the variants had a major effect on the translated protein sequence.

Table 1. Effects of sequence differences on protein-coding transcripts and the number of transcripts affected in CHM13 identified by the LiftoffTools variants module.

In-frame changes refer to insertions or deletions that are a multiple of 3 in length. Truncations are variants that shorten the protein sequence by removing either the 5′ or 3′ end of the transcript including the start or stop codon. Start codon loss variants are point mutations in the start codon, and stop codon gain variants are point mutations that result in a premature stop codon.

Variant effect	Number of transcripts
None (synonymous)	21,823
Non-synonymous	28,507
In-frame deletion	744
In-frame insertion	663
Start codon lost	117
5’ truncation	1
3’ truncation	7
Frameshift	718
Stop codon gained	206

Synteny

We ran the synteny module to compare the gene order of CHM13 to GRCh38. The dot plot in Figure 1 shows that the vast majority of genes were collinear and nearly identical in sequence, as expected. The small number of genes which were not collinear generally mapped with a lower sequence identity, suggesting they may have been mapped to a different (non-syntenic) copy of a gene in a multi-gene family.

Figure 1. Dot plot showing the ordinal position of each gene in GRCh38 on the x-axis and the ordinal position in CHM13 on the y-axis.

The color of each point indicates the sequence identity, and the gray lines separate the chromosomes.

Clusters

The clusters module found 5,213 genes in GRCh38 with at least one paralog that met the 90% sequence identity and alignment length minimums. These 5,213 genes were grouped into 1,629 clusters with copy numbers ranging from two to 66. In CHM13, 8,356 genes had at least one paralog. These copies were grouped into 2,089 clusters with copy numbers ranging from two to 228. (Note that the ribosomal DNA gene is the largest cluster, and most copies of this gene are not present in the GRCh38 assembly.) Among clusters with a copy number of at least 2 in GRCh38, 134 clusters had fewer gene copies in CHM13 resulting in a total loss of 188 gene copies. A total of 715 clusters had more copies in CHM13 resulting in a total gain of 3,035 gene copies.

Conclusions

Liftoff gave us the ability to easily map genes between closely related genomes, but further analysis is required to identify similarities and differences between the genes in each assembly that may be biologically important. LiftoffTools enables this analysis by automating the comparison of protein-coding variants, gene synteny, and gene copy loss and gain. Here we provided an example demonstrating the use of LiftoffTools to compare genes mapped between two human assemblies, and we hope this set of tools will be useful for a wide diversity of assembled genomes from species across all domains of life.

Software availability

Source code available from: https://github.com/agshumate/LiftoffTools

Archived source code as at time of publication: https://doi.org/10.5281/zenodo.6967163 (Shumate, 2022)

License: GNU GPL v3

Data availability

Underlying data

GRCh38 sequence: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.26_GRCh38/GCF_000001405.26_GRCh38_genomic.fna.gz

CHM13 sequence: https://s3-us-west-2.amazonaws.com/human-pangenomics/T2T/CHM13/assemblies/analysis_set/chm13v2.0.fa.gz

CHM13 annotation: https://ccb.jhu.edu/T2T.shtml or ftp://ftp.ccb.jhu.edu/pub/data/T2T-CHM13/chm13v2.0_RefSeq_Liftoff_v3.gff3

(Note: The CHM13 annotation has been updated to v4 since the submission of this manuscript)

References

Daily J: Parasail: SIMD C library for global, semi-global, and local pairwise sequence alignments. BMC Bioinform. 2016; 17: 1–11. Publisher Full Text
Kuhn RM, et al.: The UCSC genome browser and associated tools. Brief Bioinform. 2013; 14: 144–161. PubMed Abstract | Publisher Full Text
Nurk S, et al.: The complete sequence of a human genome. Science (1979). 2022; 376: 44–53. Publisher Full Text
O’Leary NA, et al.: Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. 2016; 44: D733–D745. PubMed Abstract | Publisher Full Text
Shumate A, Salzberg SL: Liftoff: accurate mapping of gene annotations. Bioinformatics. 2021; 37: 1639–1643. PubMed Abstract | Publisher Full Text
Shumate A: agshumate/LiftoffTools: (v0.4.3.2). [software] Zenodo.2022. Publisher Full Text
Steinegger M, Söding J: MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat. Biotechnol. 2017; 35: 1026–1028. PubMed Abstract | Publisher Full Text

Comments on this article Comments (0)

Version 2

VERSION 2 PUBLISHED 28 Oct 2022

Author details Author details

Alaina Shumate
Roles: Conceptualization, Methodology, Software, Writing – Original Draft Preparation

Steven Salzberg
Roles: Funding Acquisition, Supervision, Writing – Review & Editing

Competing interests

No competing interests were disclosed.

Grant information

This work was supported by NIH grant: HG006677
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Article Versions (2)

version 2

Revised

Published: 29 Apr 2024, 11:1230

https://doi.org/10.12688/f1000research.124059.2

version 1

Published: 28 Oct 2022, 11:1230

https://doi.org/10.12688/f1000research.124059.1

© 2024 Shumate A and Salzberg S. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download

Export To

metrics

	Views	Downloads
F1000Research	-	-
PubMed Central Data from PMC are received and updated monthly.	-	-

Citations

SEE MORE DETAILS

CITE

how to cite this article

Shumate A and Salzberg S. LiftoffTools: a toolkit for comparing gene annotations mapped between genome assemblies [version 2; peer review: 2 approved]. F1000Research 2024, 11:1230 (https://doi.org/10.12688/f1000research.124059.2)

NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.

track

receive updates on this article

Track an article to receive email alerts on any updates to this article.

Open Peer Review

Version 2

VERSION 2

PUBLISHED 29 Apr 2024

Revised

Views

Reviewer Report 29 May 2024

Adam Frankish, European Bioinformatics Institute, Wellcome Genome Campus, European Molecular Biology Laboratory, Cambridge, UK

Approved

https://doi.org/10.5256/f1000research.164429.r270095

Thanks to the authors for their clear ... Continue reading

CITE

Report a concern

Respond or Comment

Version 1

VERSION 1

PUBLISHED 28 Oct 2022

Views

Reviewer Report 29 Nov 2023

Mark Borodovsky, School of Computational Science and Engineering, Wallace H. Coulter Department of Biomedical Engineering, Georgia Institute of Technology, Atlanta, Georgia, USA

Approved

https://doi.org/10.5256/f1000research.136230.r212791

The manuscript is well written and presents a useful computational tool for comparison of gene annotations between genome assemblies.

I have two minor comments for the Methods section.

Variants

Do transcripts include UTRs?

If yes, the applicability is limited to genomes with annotated UTRs, if no – the term transcript should be defined as such.

Clusters

In the sentence:

“Next, if Liftoff was run with the -copies option to identify extra gene copies in the target genome, we add the extra copies to the same cluster as their closest paralog.”

The meaning of the “-copies” option, the difference with the default run, was not described.

Is the rationale for developing the new software tool clearly explained?

Yes
Is the description of the software tool technically sound?

Yes
Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?

Yes
Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?

Yes
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?

Yes

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: Bioinformatics, Genome Analysis

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

CITE

Report a concern

Author Response 29 Apr 2024

Alaina Shumate, Biomedical Engineering, Johns Hopkins University, Baltimore, 21218, USA

29 Apr 2024

Author Response

Comment 1
Variants
Do transcripts include UTRs? If yes, the applicability is limited to genomes with annotated UTRs, if no – the term transcript should be defined as such.

... Continue reading Comment 1
Variants
Do transcripts include UTRs? If yes, the applicability is limited to genomes with annotated UTRs, if no – the term transcript should be defined as such.

Response 1
The mRNA transcripts are aligned which we derive by concatenating all of the ‘exon’ features in the GFF3 annotation. Some annotations include UTRs in the 5’ or 3’ exons. In these cases, yes, the UTRs are included. Others annotate them separately as their own feature upstream or downstream of exons. In these cases, they are not included. While the full transcript is used to calculate sequence identity, variants are only called within the amino acid sequence, so the module can be applied to genomes with or without annotated UTRs with no effect on the number or types of variants reported.

Comment 2
Clusters

In the sentence:

“Next, if Liftoff was run with the -copies option to identify extra gene copies in the target genome, we add the extra copies to the same cluster as their closest paralog.”

The meaning of the “-copies” option, the difference with the default run, was not described.

Response 2
I have updated the manuscript to include the text below for clarification:

“Next, if Liftoff was run with the -copies option to identify extra gene copies in the target genome, we add the extra copies to the same cluster as their closest paralog. If Liftoff was run without the -copies option, no extra gene copies will be present in the target annotation, and thus the clusters module will only report instances of copy number loss.”
Comment 1
Variants
Do transcripts include UTRs? If yes, the applicability is limited to genomes with annotated UTRs, if no – the term transcript should be defined as such.

Response 1
The mRNA transcripts are aligned which we derive by concatenating all of the ‘exon’ features in the GFF3 annotation. Some annotations include UTRs in the 5’ or 3’ exons. In these cases, yes, the UTRs are included. Others annotate them separately as their own feature upstream or downstream of exons. In these cases, they are not included. While the full transcript is used to calculate sequence identity, variants are only called within the amino acid sequence, so the module can be applied to genomes with or without annotated UTRs with no effect on the number or types of variants reported.

Comment 2
Clusters

In the sentence:

“Next, if Liftoff was run with the -copies option to identify extra gene copies in the target genome, we add the extra copies to the same cluster as their closest paralog.”

The meaning of the “-copies” option, the difference with the default run, was not described.

Response 2
I have updated the manuscript to include the text below for clarification:

“Next, if Liftoff was run with the -copies option to identify extra gene copies in the target genome, we add the extra copies to the same cluster as their closest paralog. If Liftoff was run without the -copies option, no extra gene copies will be present in the target annotation, and thus the clusters module will only report instances of copy number loss.”
Competing Interests: No competing interests were disclosed. Close
Report a concern
Respond or Comment

COMMENTS ON THIS REPORT

Author Response 29 Apr 2024

Alaina Shumate, Biomedical Engineering, Johns Hopkins University, Baltimore, 21218, USA

29 Apr 2024

Author Response

Comment 1
Variants
Do transcripts include UTRs? If yes, the applicability is limited to genomes with annotated UTRs, if no – the term transcript should be defined as such.

... Continue reading Comment 1
Variants
Do transcripts include UTRs? If yes, the applicability is limited to genomes with annotated UTRs, if no – the term transcript should be defined as such.

Response 1
The mRNA transcripts are aligned which we derive by concatenating all of the ‘exon’ features in the GFF3 annotation. Some annotations include UTRs in the 5’ or 3’ exons. In these cases, yes, the UTRs are included. Others annotate them separately as their own feature upstream or downstream of exons. In these cases, they are not included. While the full transcript is used to calculate sequence identity, variants are only called within the amino acid sequence, so the module can be applied to genomes with or without annotated UTRs with no effect on the number or types of variants reported.

Comment 2
Clusters

In the sentence:

“Next, if Liftoff was run with the -copies option to identify extra gene copies in the target genome, we add the extra copies to the same cluster as their closest paralog.”

The meaning of the “-copies” option, the difference with the default run, was not described.

Response 2
I have updated the manuscript to include the text below for clarification:

“Next, if Liftoff was run with the -copies option to identify extra gene copies in the target genome, we add the extra copies to the same cluster as their closest paralog. If Liftoff was run without the -copies option, no extra gene copies will be present in the target annotation, and thus the clusters module will only report instances of copy number loss.”
Comment 1
Variants
Do transcripts include UTRs? If yes, the applicability is limited to genomes with annotated UTRs, if no – the term transcript should be defined as such.

Response 1
The mRNA transcripts are aligned which we derive by concatenating all of the ‘exon’ features in the GFF3 annotation. Some annotations include UTRs in the 5’ or 3’ exons. In these cases, yes, the UTRs are included. Others annotate them separately as their own feature upstream or downstream of exons. In these cases, they are not included. While the full transcript is used to calculate sequence identity, variants are only called within the amino acid sequence, so the module can be applied to genomes with or without annotated UTRs with no effect on the number or types of variants reported.

Comment 2
Clusters

In the sentence:

“Next, if Liftoff was run with the -copies option to identify extra gene copies in the target genome, we add the extra copies to the same cluster as their closest paralog.”

The meaning of the “-copies” option, the difference with the default run, was not described.

Response 2
I have updated the manuscript to include the text below for clarification:

“Next, if Liftoff was run with the -copies option to identify extra gene copies in the target genome, we add the extra copies to the same cluster as their closest paralog. If Liftoff was run without the -copies option, no extra gene copies will be present in the target annotation, and thus the clusters module will only report instances of copy number loss.”
Competing Interests: No competing interests were disclosed. Close
Report a concern

Views

Reviewer Report 01 Dec 2022

Adam Frankish, European Bioinformatics Institute, Wellcome Genome Campus, European Molecular Biology Laboratory, Cambridge, UK

Approved with Reservations

https://doi.org/10.5256/f1000research.136230.r154534

This manuscript describes a set of tools to analyze gene annotation that has been mapped between one genomic sequence and another closely related genomic sequence. The methods described build on the widely adopted Liftoff annotation transfer tool developed by the same group.

The development of LiftoffTools is clearly relevant and timely as we enter an era of rapidly increasing numbers of high quality genome sequences suitable for comparison to species reference genome assemblies that are generally the focus of gene annotation effort. The description of the tools is sound and we were able to run the code with the instructions provided and generate the correct/expected output. The code is written entirely in Python and looks clear and well-organized.

There is broadly sufficient information to allow interpretation of the results, although a little more guidance might be useful, for example, adding annotated examples of real data from Variant and Cluster analysis to guide users.

Comments:

"...as long as the feature IDs in the reference and target annotations are the same. All three modules can be run with the following command" - Is this somewhat inflexible? Not all transfer methods will preserve feature ID identically across different assemblies as they seek to avoid storing two or more features with identical IDs but different properties (sequence/length/etc). Is it possible to accommodate methods that take this approach?
It is not clear from the manuscript or the supporting information in https://github.com/agshumate/LiftoffTools/blob/master/README.md how LiftoffTools handles genes that are LoF on genome 1 but functional on genome 2. Can these genes and their 'rescuing' variation be identified?
While the list of variant consequences is comprehensive for the annotated CDS, it would be useful to add other LoF consequences such as disruption of core splice site to the analysis.
It would also be useful to specifically state the ranking of consequences in the /README.md file for genes with multiple transcript-affecting variants as only the most significant is provided in the variation output file.
Similarly, as only one variant is reported, does LiftoffTools identify (and/or flag) corrective variation e.g a second frameshift that compensates for an earlier frameshift and restores the CDS with a small aa change?
In the calculation of cluster gain/loss, are haplotypic duplicated pseudogenes considered? i.e. is loss only deletion/absence of the gene or is loss (or gain) of function included as well? An example with real data in /README.md could be helpful.
What is reported for variants in genes that are missing or partial on GRCh38 where it is used as a reference? An example with real data in /README.md could be helpful.
The CHM13 GFF and FNA (fasta) files appear to have different chromosome names, which threw an error when running the code.

Is the rationale for developing the new software tool clearly explained?

Yes
Is the description of the software tool technically sound?

Yes
Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?

Yes
Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?

Partly
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?

Yes

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: Gene annotation

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

CITE

Report a concern

Author Response 29 Apr 2024

Alaina Shumate, Biomedical Engineering, Johns Hopkins University, Baltimore, 21218, USA

29 Apr 2024

Author Response

Comment 1
Is this somewhat inflexible? Not all transfer methods will preserve feature ID identically across different assemblies as they seek to avoid storing two or more features with identical ... Continue reading Comment 1
Is this somewhat inflexible? Not all transfer methods will preserve feature ID identically across different assemblies as they seek to avoid storing two or more features with identical IDs but different properties (sequence/length/etc). Is it possible to accommodate methods that take this approach?

Response 1
While it would be great to accommodate all lift-over tools, without the preservation of feature IDs, identifying the reference/target gene equivalents for comparison becomes a non-trivial task. LiftoffTools was specifically designed as a post-processing toolkit for the output of Liftoff , and in our experience, other commonly used tools such as UCSC liftOver do indeed preserve feature IDs. Furthermore, the GFF3 specification states that all feature IDs must be unique. If a transfer method is changing IDs to eliminate duplicates, this is an issue of the reference annotation failing to adhere to the GFF3 specifications and should be addressed prior to mapping annotations.

Comment 2
It is not clear from the manuscript or the supporting information in https://github.com/agshumate/LiftoffTools/blob/master/README.md how LiftoffTools handles genes that are LoF on genome 1 but functional on genome 2. Can these genes and their 'rescuing' variation be identified?

Response 2
Currently, LiftoffTools does not identify ‘rescuing’ variants. In our particular case, we used the RefSeq annotation as a reference, which requires a valid open reading frame to annotate a coding sequence; therefore, there are no LoF coding sequences annotated in the reference to be mapped. We recognize that this is not true for all reference annotations; many include annotations of coding sequences without a valid open reading frame. There are varying schools of thought on whether this is acceptable, but regardless, we do agree that identifying rescuing or gain of function variants would be useful, and we will consider implementing that feature in the future. We have updated the manuscript to say “the module identifies variants that have a neutral or deleterious effect on the translated amino acid sequences in the target genome.” The README has also been updated.

Comment 3
While the list of variant consequences is comprehensive for the annotated CDS, it would be useful to add other LoF consequences such as disruption of core splice site to the analysis.

Response 3
We do agree that splice site variants would be useful to include, but we are currently only aligning and looking at variants in the mRNA due to computational limitations. Performing Smith-Waterman alignment on transcript sequences including splice-sites and introns would require significant computational resources for human and other eukaryotic genomes. Even with a very fast implementation of the Smith-Waterman alignment, aligning just the mRNA and amino acid sequences is the computational bottleneck of the LiftoffTools pipeline. While faster alignment methods could alleviate this, they would not have the same accuracy as Smith-Waterman, which we feel is necessary for accurately identifying the position and type of variants. We have edited the manuscript to specifically state that we are aligning mRNA sequences.

Comment 4
It would also be useful to specifically state the ranking of consequences in the /README.md file for genes with multiple transcript-affecting variants as only the most significant is provided in the variation output file.

Response 4
Thank you for the suggestion. I have updated the README.md accordingly.

Comment 5
Similarly, as only one variant is reported, does LiftoffTools identify (and/or flag) corrective variation e.g a second frameshift that compensates for an earlier frameshift and restores the CDS with a small aa change?

Response 5
We do not report this explicitly. The intent of the variants module is to provide high level summary information about how many genes were disrupted by variants rather than identifying every variant in every gene. We do however include the amino-acid level sequence identity information, so compensatory frameshifts can be inferred if a sequence is reported to have a frameshift but also retains a high sequence identity to the reference protein. We have added the following text to the manuscript to capture these points.

“While there may be multiple variants within a transcript, the intent of this module is to summarize the functional consequences of variation; therefore, if there is more than one variant, we report only the most severe. For example, if a transcript has a synonymous mutation and a frameshift mutation, we output ‘frameshift’ for that transcript as this would be more disruptive to gene function. Combining the sequence identity information with the variant effect can provide further insights into the severity of the variant. For example, a gene with a frameshift near the 3’ end or a gene with a compensatory frameshift nearby will have a high percent identity at the amino acid level and may still retain function.”

Comment 6
In the calculation of cluster gain/loss, are haplotypic duplicated pseudogenes considered? i.e. is loss only deletion/absence of the gene or is loss (or gain) of function included as well? An example with real data in /README.md could be helpful.

Response 6
This is a good point. Liftoff intentionally avoids annotating extra gene copies in the target genome that are processed pseudogenes by including introns in the initial alignment step. Therefore, when working with the output of Liftoff, it is not something we need to consider. If a different annotation tool was used that does annotate pseudogenes, they will get clustered with their paralogs even if they are not functional. As previously mentioned, LiftoffTools was designed to be used in conjunction with Liftoff, so we have not considered a strategy for removing non-functional pseudogene copies from the clusters.

Comment 7
What is reported for variants in genes that are missing or partial on GRCh38 where it is used as a reference? An example with real data in /README.md could be helpful.

Response 7
Variants are only reported for genes that are in both GRCh38 and CHM13. We state in the manuscript

“The variants module calculates the sequence identity between mRNA transcripts in the reference genome and the corresponding transcripts in the target genome…”

A limitation of lifting over gene annotations is that a partial reference gene will likely also be annotated as partial in the target annotation. There are various lift-over algorithms/strategies; however, they generally rely on converting the start and end coordinates of the gene from reference to target. If the start-end range is only a partial gene, only that part of the gene will be lifted over. In these cases, they will be reported as either a 5’ truncation or a 3’ truncation based on the presence or absence of start and stop codons.

Comment 8
The CHM13 GFF and FNA (fasta) files appear to have different chromosome names, which threw an error when running the code.

Response 8
Thank you for bringing this to our attention. The link to the fasta file has been replaced to a file with the same chromosome names.
Comment 1
Is this somewhat inflexible? Not all transfer methods will preserve feature ID identically across different assemblies as they seek to avoid storing two or more features with identical IDs but different properties (sequence/length/etc). Is it possible to accommodate methods that take this approach?

Response 1
While it would be great to accommodate all lift-over tools, without the preservation of feature IDs, identifying the reference/target gene equivalents for comparison becomes a non-trivial task. LiftoffTools was specifically designed as a post-processing toolkit for the output of Liftoff , and in our experience, other commonly used tools such as UCSC liftOver do indeed preserve feature IDs. Furthermore, the GFF3 specification states that all feature IDs must be unique. If a transfer method is changing IDs to eliminate duplicates, this is an issue of the reference annotation failing to adhere to the GFF3 specifications and should be addressed prior to mapping annotations.

Comment 2
It is not clear from the manuscript or the supporting information in https://github.com/agshumate/LiftoffTools/blob/master/README.md how LiftoffTools handles genes that are LoF on genome 1 but functional on genome 2. Can these genes and their 'rescuing' variation be identified?

Response 2
Currently, LiftoffTools does not identify ‘rescuing’ variants. In our particular case, we used the RefSeq annotation as a reference, which requires a valid open reading frame to annotate a coding sequence; therefore, there are no LoF coding sequences annotated in the reference to be mapped. We recognize that this is not true for all reference annotations; many include annotations of coding sequences without a valid open reading frame. There are varying schools of thought on whether this is acceptable, but regardless, we do agree that identifying rescuing or gain of function variants would be useful, and we will consider implementing that feature in the future. We have updated the manuscript to say “the module identifies variants that have a neutral or deleterious effect on the translated amino acid sequences in the target genome.” The README has also been updated.

Comment 3
While the list of variant consequences is comprehensive for the annotated CDS, it would be useful to add other LoF consequences such as disruption of core splice site to the analysis.

Response 3
We do agree that splice site variants would be useful to include, but we are currently only aligning and looking at variants in the mRNA due to computational limitations. Performing Smith-Waterman alignment on transcript sequences including splice-sites and introns would require significant computational resources for human and other eukaryotic genomes. Even with a very fast implementation of the Smith-Waterman alignment, aligning just the mRNA and amino acid sequences is the computational bottleneck of the LiftoffTools pipeline. While faster alignment methods could alleviate this, they would not have the same accuracy as Smith-Waterman, which we feel is necessary for accurately identifying the position and type of variants. We have edited the manuscript to specifically state that we are aligning mRNA sequences.

Comment 4
It would also be useful to specifically state the ranking of consequences in the /README.md file for genes with multiple transcript-affecting variants as only the most significant is provided in the variation output file.

Response 4
Thank you for the suggestion. I have updated the README.md accordingly.

Comment 5
Similarly, as only one variant is reported, does LiftoffTools identify (and/or flag) corrective variation e.g a second frameshift that compensates for an earlier frameshift and restores the CDS with a small aa change?

Response 5
We do not report this explicitly. The intent of the variants module is to provide high level summary information about how many genes were disrupted by variants rather than identifying every variant in every gene. We do however include the amino-acid level sequence identity information, so compensatory frameshifts can be inferred if a sequence is reported to have a frameshift but also retains a high sequence identity to the reference protein. We have added the following text to the manuscript to capture these points.

“While there may be multiple variants within a transcript, the intent of this module is to summarize the functional consequences of variation; therefore, if there is more than one variant, we report only the most severe. For example, if a transcript has a synonymous mutation and a frameshift mutation, we output ‘frameshift’ for that transcript as this would be more disruptive to gene function. Combining the sequence identity information with the variant effect can provide further insights into the severity of the variant. For example, a gene with a frameshift near the 3’ end or a gene with a compensatory frameshift nearby will have a high percent identity at the amino acid level and may still retain function.”

Comment 6
In the calculation of cluster gain/loss, are haplotypic duplicated pseudogenes considered? i.e. is loss only deletion/absence of the gene or is loss (or gain) of function included as well? An example with real data in /README.md could be helpful.

Response 6
This is a good point. Liftoff intentionally avoids annotating extra gene copies in the target genome that are processed pseudogenes by including introns in the initial alignment step. Therefore, when working with the output of Liftoff, it is not something we need to consider. If a different annotation tool was used that does annotate pseudogenes, they will get clustered with their paralogs even if they are not functional. As previously mentioned, LiftoffTools was designed to be used in conjunction with Liftoff, so we have not considered a strategy for removing non-functional pseudogene copies from the clusters.

Comment 7
What is reported for variants in genes that are missing or partial on GRCh38 where it is used as a reference? An example with real data in /README.md could be helpful.

Response 7
Variants are only reported for genes that are in both GRCh38 and CHM13. We state in the manuscript

“The variants module calculates the sequence identity between mRNA transcripts in the reference genome and the corresponding transcripts in the target genome…”

A limitation of lifting over gene annotations is that a partial reference gene will likely also be annotated as partial in the target annotation. There are various lift-over algorithms/strategies; however, they generally rely on converting the start and end coordinates of the gene from reference to target. If the start-end range is only a partial gene, only that part of the gene will be lifted over. In these cases, they will be reported as either a 5’ truncation or a 3’ truncation based on the presence or absence of start and stop codons.

Comment 8
The CHM13 GFF and FNA (fasta) files appear to have different chromosome names, which threw an error when running the code.

Response 8
Thank you for bringing this to our attention. The link to the fasta file has been replaced to a file with the same chromosome names.
Competing Interests: No competing interests were disclosed. Close
Report a concern
Respond or Comment

COMMENTS ON THIS REPORT

Author Response 29 Apr 2024

Alaina Shumate, Biomedical Engineering, Johns Hopkins University, Baltimore, 21218, USA

29 Apr 2024

Author Response

Comment 1
Is this somewhat inflexible? Not all transfer methods will preserve feature ID identically across different assemblies as they seek to avoid storing two or more features with identical ... Continue reading Comment 1
Is this somewhat inflexible? Not all transfer methods will preserve feature ID identically across different assemblies as they seek to avoid storing two or more features with identical IDs but different properties (sequence/length/etc). Is it possible to accommodate methods that take this approach?

Response 1
While it would be great to accommodate all lift-over tools, without the preservation of feature IDs, identifying the reference/target gene equivalents for comparison becomes a non-trivial task. LiftoffTools was specifically designed as a post-processing toolkit for the output of Liftoff , and in our experience, other commonly used tools such as UCSC liftOver do indeed preserve feature IDs. Furthermore, the GFF3 specification states that all feature IDs must be unique. If a transfer method is changing IDs to eliminate duplicates, this is an issue of the reference annotation failing to adhere to the GFF3 specifications and should be addressed prior to mapping annotations.

Comment 2
It is not clear from the manuscript or the supporting information in https://github.com/agshumate/LiftoffTools/blob/master/README.md how LiftoffTools handles genes that are LoF on genome 1 but functional on genome 2. Can these genes and their 'rescuing' variation be identified?

Response 2
Currently, LiftoffTools does not identify ‘rescuing’ variants. In our particular case, we used the RefSeq annotation as a reference, which requires a valid open reading frame to annotate a coding sequence; therefore, there are no LoF coding sequences annotated in the reference to be mapped. We recognize that this is not true for all reference annotations; many include annotations of coding sequences without a valid open reading frame. There are varying schools of thought on whether this is acceptable, but regardless, we do agree that identifying rescuing or gain of function variants would be useful, and we will consider implementing that feature in the future. We have updated the manuscript to say “the module identifies variants that have a neutral or deleterious effect on the translated amino acid sequences in the target genome.” The README has also been updated.

Comment 3
While the list of variant consequences is comprehensive for the annotated CDS, it would be useful to add other LoF consequences such as disruption of core splice site to the analysis.

Response 3
We do agree that splice site variants would be useful to include, but we are currently only aligning and looking at variants in the mRNA due to computational limitations. Performing Smith-Waterman alignment on transcript sequences including splice-sites and introns would require significant computational resources for human and other eukaryotic genomes. Even with a very fast implementation of the Smith-Waterman alignment, aligning just the mRNA and amino acid sequences is the computational bottleneck of the LiftoffTools pipeline. While faster alignment methods could alleviate this, they would not have the same accuracy as Smith-Waterman, which we feel is necessary for accurately identifying the position and type of variants. We have edited the manuscript to specifically state that we are aligning mRNA sequences.

Comment 4
It would also be useful to specifically state the ranking of consequences in the /README.md file for genes with multiple transcript-affecting variants as only the most significant is provided in the variation output file.

Response 4
Thank you for the suggestion. I have updated the README.md accordingly.

Comment 5
Similarly, as only one variant is reported, does LiftoffTools identify (and/or flag) corrective variation e.g a second frameshift that compensates for an earlier frameshift and restores the CDS with a small aa change?

Response 5
We do not report this explicitly. The intent of the variants module is to provide high level summary information about how many genes were disrupted by variants rather than identifying every variant in every gene. We do however include the amino-acid level sequence identity information, so compensatory frameshifts can be inferred if a sequence is reported to have a frameshift but also retains a high sequence identity to the reference protein. We have added the following text to the manuscript to capture these points.

“While there may be multiple variants within a transcript, the intent of this module is to summarize the functional consequences of variation; therefore, if there is more than one variant, we report only the most severe. For example, if a transcript has a synonymous mutation and a frameshift mutation, we output ‘frameshift’ for that transcript as this would be more disruptive to gene function. Combining the sequence identity information with the variant effect can provide further insights into the severity of the variant. For example, a gene with a frameshift near the 3’ end or a gene with a compensatory frameshift nearby will have a high percent identity at the amino acid level and may still retain function.”

Comment 6
In the calculation of cluster gain/loss, are haplotypic duplicated pseudogenes considered? i.e. is loss only deletion/absence of the gene or is loss (or gain) of function included as well? An example with real data in /README.md could be helpful.

Response 6
This is a good point. Liftoff intentionally avoids annotating extra gene copies in the target genome that are processed pseudogenes by including introns in the initial alignment step. Therefore, when working with the output of Liftoff, it is not something we need to consider. If a different annotation tool was used that does annotate pseudogenes, they will get clustered with their paralogs even if they are not functional. As previously mentioned, LiftoffTools was designed to be used in conjunction with Liftoff, so we have not considered a strategy for removing non-functional pseudogene copies from the clusters.

Comment 7
What is reported for variants in genes that are missing or partial on GRCh38 where it is used as a reference? An example with real data in /README.md could be helpful.

Response 7
Variants are only reported for genes that are in both GRCh38 and CHM13. We state in the manuscript

“The variants module calculates the sequence identity between mRNA transcripts in the reference genome and the corresponding transcripts in the target genome…”

A limitation of lifting over gene annotations is that a partial reference gene will likely also be annotated as partial in the target annotation. There are various lift-over algorithms/strategies; however, they generally rely on converting the start and end coordinates of the gene from reference to target. If the start-end range is only a partial gene, only that part of the gene will be lifted over. In these cases, they will be reported as either a 5’ truncation or a 3’ truncation based on the presence or absence of start and stop codons.

Comment 8
The CHM13 GFF and FNA (fasta) files appear to have different chromosome names, which threw an error when running the code.

Response 8
Thank you for bringing this to our attention. The link to the fasta file has been replaced to a file with the same chromosome names.
Comment 1
Is this somewhat inflexible? Not all transfer methods will preserve feature ID identically across different assemblies as they seek to avoid storing two or more features with identical IDs but different properties (sequence/length/etc). Is it possible to accommodate methods that take this approach?

Response 1
While it would be great to accommodate all lift-over tools, without the preservation of feature IDs, identifying the reference/target gene equivalents for comparison becomes a non-trivial task. LiftoffTools was specifically designed as a post-processing toolkit for the output of Liftoff , and in our experience, other commonly used tools such as UCSC liftOver do indeed preserve feature IDs. Furthermore, the GFF3 specification states that all feature IDs must be unique. If a transfer method is changing IDs to eliminate duplicates, this is an issue of the reference annotation failing to adhere to the GFF3 specifications and should be addressed prior to mapping annotations.

Comment 2
It is not clear from the manuscript or the supporting information in https://github.com/agshumate/LiftoffTools/blob/master/README.md how LiftoffTools handles genes that are LoF on genome 1 but functional on genome 2. Can these genes and their 'rescuing' variation be identified?

Response 2
Currently, LiftoffTools does not identify ‘rescuing’ variants. In our particular case, we used the RefSeq annotation as a reference, which requires a valid open reading frame to annotate a coding sequence; therefore, there are no LoF coding sequences annotated in the reference to be mapped. We recognize that this is not true for all reference annotations; many include annotations of coding sequences without a valid open reading frame. There are varying schools of thought on whether this is acceptable, but regardless, we do agree that identifying rescuing or gain of function variants would be useful, and we will consider implementing that feature in the future. We have updated the manuscript to say “the module identifies variants that have a neutral or deleterious effect on the translated amino acid sequences in the target genome.” The README has also been updated.

Comment 3
While the list of variant consequences is comprehensive for the annotated CDS, it would be useful to add other LoF consequences such as disruption of core splice site to the analysis.

Response 3
We do agree that splice site variants would be useful to include, but we are currently only aligning and looking at variants in the mRNA due to computational limitations. Performing Smith-Waterman alignment on transcript sequences including splice-sites and introns would require significant computational resources for human and other eukaryotic genomes. Even with a very fast implementation of the Smith-Waterman alignment, aligning just the mRNA and amino acid sequences is the computational bottleneck of the LiftoffTools pipeline. While faster alignment methods could alleviate this, they would not have the same accuracy as Smith-Waterman, which we feel is necessary for accurately identifying the position and type of variants. We have edited the manuscript to specifically state that we are aligning mRNA sequences.

Comment 4
It would also be useful to specifically state the ranking of consequences in the /README.md file for genes with multiple transcript-affecting variants as only the most significant is provided in the variation output file.

Response 4
Thank you for the suggestion. I have updated the README.md accordingly.

Comment 5
Similarly, as only one variant is reported, does LiftoffTools identify (and/or flag) corrective variation e.g a second frameshift that compensates for an earlier frameshift and restores the CDS with a small aa change?

Response 5
We do not report this explicitly. The intent of the variants module is to provide high level summary information about how many genes were disrupted by variants rather than identifying every variant in every gene. We do however include the amino-acid level sequence identity information, so compensatory frameshifts can be inferred if a sequence is reported to have a frameshift but also retains a high sequence identity to the reference protein. We have added the following text to the manuscript to capture these points.

“While there may be multiple variants within a transcript, the intent of this module is to summarize the functional consequences of variation; therefore, if there is more than one variant, we report only the most severe. For example, if a transcript has a synonymous mutation and a frameshift mutation, we output ‘frameshift’ for that transcript as this would be more disruptive to gene function. Combining the sequence identity information with the variant effect can provide further insights into the severity of the variant. For example, a gene with a frameshift near the 3’ end or a gene with a compensatory frameshift nearby will have a high percent identity at the amino acid level and may still retain function.”

Comment 6
In the calculation of cluster gain/loss, are haplotypic duplicated pseudogenes considered? i.e. is loss only deletion/absence of the gene or is loss (or gain) of function included as well? An example with real data in /README.md could be helpful.

Response 6
This is a good point. Liftoff intentionally avoids annotating extra gene copies in the target genome that are processed pseudogenes by including introns in the initial alignment step. Therefore, when working with the output of Liftoff, it is not something we need to consider. If a different annotation tool was used that does annotate pseudogenes, they will get clustered with their paralogs even if they are not functional. As previously mentioned, LiftoffTools was designed to be used in conjunction with Liftoff, so we have not considered a strategy for removing non-functional pseudogene copies from the clusters.

Comment 7
What is reported for variants in genes that are missing or partial on GRCh38 where it is used as a reference? An example with real data in /README.md could be helpful.

Response 7
Variants are only reported for genes that are in both GRCh38 and CHM13. We state in the manuscript

“The variants module calculates the sequence identity between mRNA transcripts in the reference genome and the corresponding transcripts in the target genome…”

A limitation of lifting over gene annotations is that a partial reference gene will likely also be annotated as partial in the target annotation. There are various lift-over algorithms/strategies; however, they generally rely on converting the start and end coordinates of the gene from reference to target. If the start-end range is only a partial gene, only that part of the gene will be lifted over. In these cases, they will be reported as either a 5’ truncation or a 3’ truncation based on the presence or absence of start and stop codons.

Comment 8
The CHM13 GFF and FNA (fasta) files appear to have different chromosome names, which threw an error when running the code.

Response 8
Thank you for bringing this to our attention. The link to the fasta file has been replaced to a file with the same chromosome names.
Competing Interests: No competing interests were disclosed. Close
Report a concern

Comments on this article Comments (0)

Version 2

VERSION 2 PUBLISHED 28 Oct 2022

Open Peer Review

Reviewer Status

Reviewer Reports

	Invited Reviewers
	1	2
Version 2 (revision) 29 Apr 24	read
Version 1 28 Oct 22	read	read

Adam Frankish, European Molecular Biology Laboratory, Cambridge, UK
Mark Borodovsky, Georgia Institute of Technology, Atlanta, USA

Comments on this article

All Comments(0)

Add a comment

Browse by related subjects

Back to all reports

Reviewer Report

3 Views

29 May 2024 | for Version 2

Adam Frankish, European Bioinformatics Institute, Wellcome Genome Campus, European Molecular Biology Laboratory, Cambridge, UK

3 Views Cite this report Responses(0)

Approved

Thanks to the authors for their clear responses to questions and updates to the manuscript.

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

Gene annotation

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

Respond to this report

Responses (0)

Back to all reports

Reviewer Report

26 Views

29 Nov 2023 | for Version 1

Mark Borodovsky, School of Computational Science and Engineering, Wallace H. Coulter Department of Biomedical Engineering, Georgia Institute of Technology, Atlanta, Georgia, USA

26 Views Cite this report Responses(1)

Approved

Is the rationale for developing the new software tool clearly explained?

Yes
Is the description of the software tool technically sound?

Yes
Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?

Yes
Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?

Yes
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?

Yes

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

Bioinformatics, Genome Analysis

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

Respond to this report

Responses (1)

Author Response

29 Apr 2024

Alaina Shumate, Biomedical Engineering, Johns Hopkins University, Baltimore, 21218, USA

Comment 1
Variants
Do transcripts include UTRs? If yes, the applicability is limited to genomes with annotated UTRs, if no – the term transcript should be defined as such.

Response 1
The mRNA transcripts are aligned which we derive by concatenating all of the ‘exon’ features in the GFF3 annotation. Some annotations include UTRs in the 5’ or 3’ exons. In these cases, yes, the UTRs are included. Others annotate them separately as their own feature upstream or downstream of exons. In these cases, they are not included. While the full transcript is used to calculate sequence identity, variants are only called within the amino acid sequence, so the module can be applied to genomes with or without annotated UTRs with no effect on the number or types of variants reported.

Comment 2
Clusters

In the sentence:

“Next, if Liftoff was run with the -copies option to identify extra gene copies in the target genome, we add the extra copies to the same cluster as their closest paralog.”

The meaning of the “-copies” option, the difference with the default run, was not described.

Response 2
I have updated the manuscript to include the text below for clarification:

“Next, if Liftoff was run with the -copies option to identify extra gene copies in the target genome, we add the extra copies to the same cluster as their closest paralog. If Liftoff was run without the -copies option, no extra gene copies will be present in the target annotation, and thus the clusters module will only report instances of copy number loss.”

View more View less

Competing Interests

No competing interests were disclosed.

Back to all reports

Reviewer Report

37 Views

01 Dec 2022 | for Version 1

Adam Frankish, European Bioinformatics Institute, Wellcome Genome Campus, European Molecular Biology Laboratory, Cambridge, UK

37 Views Cite this report Responses(1)

Approved With Reservations

"...as long as the feature IDs in the reference and target annotations are the same. All three modules can be run with the following command" - Is this somewhat inflexible? Not all transfer methods will preserve feature ID identically across different assemblies as they seek to avoid storing two or more features with identical IDs but different properties (sequence/length/etc). Is it possible to accommodate methods that take this approach?
It is not clear from the manuscript or the supporting information in https://github.com/agshumate/LiftoffTools/blob/master/README.md how LiftoffTools handles genes that are LoF on genome 1 but functional on genome 2. Can these genes and their 'rescuing' variation be identified?
While the list of variant consequences is comprehensive for the annotated CDS, it would be useful to add other LoF consequences such as disruption of core splice site to the analysis.
It would also be useful to specifically state the ranking of consequences in the /README.md file for genes with multiple transcript-affecting variants as only the most significant is provided in the variation output file.
Similarly, as only one variant is reported, does LiftoffTools identify (and/or flag) corrective variation e.g a second frameshift that compensates for an earlier frameshift and restores the CDS with a small aa change?
In the calculation of cluster gain/loss, are haplotypic duplicated pseudogenes considered? i.e. is loss only deletion/absence of the gene or is loss (or gain) of function included as well? An example with real data in /README.md could be helpful.
What is reported for variants in genes that are missing or partial on GRCh38 where it is used as a reference? An example with real data in /README.md could be helpful.
The CHM13 GFF and FNA (fasta) files appear to have different chromosome names, which threw an error when running the code.

Is the rationale for developing the new software tool clearly explained?

Yes
Is the description of the software tool technically sound?

Yes
Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?

Yes
Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?

Partly
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?

Yes

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

Gene annotation

Respond to this report

Responses (1)

Author Response

29 Apr 2024

Alaina Shumate, Biomedical Engineering, Johns Hopkins University, Baltimore, 21218, USA

Comment 1
Is this somewhat inflexible? Not all transfer methods will preserve feature ID identically across different assemblies as they seek to avoid storing two or more features with identical IDs but different properties (sequence/length/etc). Is it possible to accommodate methods that take this approach?

Response 1
While it would be great to accommodate all lift-over tools, without the preservation of feature IDs, identifying the reference/target gene equivalents for comparison becomes a non-trivial task. LiftoffTools was specifically designed as a post-processing toolkit for the output of Liftoff , and in our experience, other commonly used tools such as UCSC liftOver do indeed preserve feature IDs. Furthermore, the GFF3 specification states that all feature IDs must be unique. If a transfer method is changing IDs to eliminate duplicates, this is an issue of the reference annotation failing to adhere to the GFF3 specifications and should be addressed prior to mapping annotations.

Comment 2
It is not clear from the manuscript or the supporting information in https://github.com/agshumate/LiftoffTools/blob/master/README.md how LiftoffTools handles genes that are LoF on genome 1 but functional on genome 2. Can these genes and their 'rescuing' variation be identified?

Response 2
Currently, LiftoffTools does not identify ‘rescuing’ variants. In our particular case, we used the RefSeq annotation as a reference, which requires a valid open reading frame to annotate a coding sequence; therefore, there are no LoF coding sequences annotated in the reference to be mapped. We recognize that this is not true for all reference annotations; many include annotations of coding sequences without a valid open reading frame. There are varying schools of thought on whether this is acceptable, but regardless, we do agree that identifying rescuing or gain of function variants would be useful, and we will consider implementing that feature in the future. We have updated the manuscript to say “the module identifies variants that have a neutral or deleterious effect on the translated amino acid sequences in the target genome.” The README has also been updated.

Comment 3
While the list of variant consequences is comprehensive for the annotated CDS, it would be useful to add other LoF consequences such as disruption of core splice site to the analysis.

Response 3
We do agree that splice site variants would be useful to include, but we are currently only aligning and looking at variants in the mRNA due to computational limitations. Performing Smith-Waterman alignment on transcript sequences including splice-sites and introns would require significant computational resources for human and other eukaryotic genomes. Even with a very fast implementation of the Smith-Waterman alignment, aligning just the mRNA and amino acid sequences is the computational bottleneck of the LiftoffTools pipeline. While faster alignment methods could alleviate this, they would not have the same accuracy as Smith-Waterman, which we feel is necessary for accurately identifying the position and type of variants. We have edited the manuscript to specifically state that we are aligning mRNA sequences.

Comment 4
It would also be useful to specifically state the ranking of consequences in the /README.md file for genes with multiple transcript-affecting variants as only the most significant is provided in the variation output file.

Response 4
Thank you for the suggestion. I have updated the README.md accordingly.

Comment 5
Similarly, as only one variant is reported, does LiftoffTools identify (and/or flag) corrective variation e.g a second frameshift that compensates for an earlier frameshift and restores the CDS with a small aa change?

Response 5
We do not report this explicitly. The intent of the variants module is to provide high level summary information about how many genes were disrupted by variants rather than identifying every variant in every gene. We do however include the amino-acid level sequence identity information, so compensatory frameshifts can be inferred if a sequence is reported to have a frameshift but also retains a high sequence identity to the reference protein. We have added the following text to the manuscript to capture these points.

“While there may be multiple variants within a transcript, the intent of this module is to summarize the functional consequences of variation; therefore, if there is more than one variant, we report only the most severe. For example, if a transcript has a synonymous mutation and a frameshift mutation, we output ‘frameshift’ for that transcript as this would be more disruptive to gene function. Combining the sequence identity information with the variant effect can provide further insights into the severity of the variant. For example, a gene with a frameshift near the 3’ end or a gene with a compensatory frameshift nearby will have a high percent identity at the amino acid level and may still retain function.”

Comment 6
In the calculation of cluster gain/loss, are haplotypic duplicated pseudogenes considered? i.e. is loss only deletion/absence of the gene or is loss (or gain) of function included as well? An example with real data in /README.md could be helpful.

Response 6
This is a good point. Liftoff intentionally avoids annotating extra gene copies in the target genome that are processed pseudogenes by including introns in the initial alignment step. Therefore, when working with the output of Liftoff, it is not something we need to consider. If a different annotation tool was used that does annotate pseudogenes, they will get clustered with their paralogs even if they are not functional. As previously mentioned, LiftoffTools was designed to be used in conjunction with Liftoff, so we have not considered a strategy for removing non-functional pseudogene copies from the clusters.

Comment 7
What is reported for variants in genes that are missing or partial on GRCh38 where it is used as a reference? An example with real data in /README.md could be helpful.

Response 7
Variants are only reported for genes that are in both GRCh38 and CHM13. We state in the manuscript

“The variants module calculates the sequence identity between mRNA transcripts in the reference genome and the corresponding transcripts in the target genome…”

A limitation of lifting over gene annotations is that a partial reference gene will likely also be annotated as partial in the target annotation. There are various lift-over algorithms/strategies; however, they generally rely on converting the start and end coordinates of the gene from reference to target. If the start-end range is only a partial gene, only that part of the gene will be lifted over. In these cases, they will be reported as either a 5’ truncation or a 3’ truncation based on the presence or absence of start and stop codons.

Comment 8
The CHM13 GFF and FNA (fasta) files appear to have different chromosome names, which threw an error when running the code.

Response 8
Thank you for bringing this to our attention. The link to the fasta file has been replaced to a file with the same chromosome names.

View more View less

Competing Interests

No competing interests were disclosed.

Alongside their report, reviewers assign a status to the article:

Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested

Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.

Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions

[1] Daily J: Parasail: SIMD C library for global, semi-global, and local pairwise sequence alignments. BMC Bioinform. 2016; 17: 1–11. Publisher Full Text

[2] Kuhn RM, et al.: The UCSC genome browser and associated tools. Brief Bioinform. 2013; 14: 144–161. PubMed Abstract | Publisher Full Text

[3] Nurk S, et al.: The complete sequence of a human genome. Science (1979). 2022; 376: 44–53. Publisher Full Text

[4] O’Leary NA, et al.: Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. 2016; 44: D733–D745. PubMed Abstract | Publisher Full Text

[5] Shumate A, Salzberg SL: Liftoff: accurate mapping of gene annotations. Bioinformatics. 2021; 37: 1639–1643. PubMed Abstract | Publisher Full Text

[6] Shumate A: agshumate/LiftoffTools: (v0.4.3.2). [software] Zenodo.2022. Publisher Full Text

[7] Steinegger M, Söding J: MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat. Biotechnol. 2017; 35: 1026–1028. PubMed Abstract | Publisher Full Text

LiftoffTools: a toolkit for comparing gene annotations mapped between genome assemblies

Abstract

Keywords

Revised Amendments from Version 1

Introduction

Methods

Operation

Variants

Synteny

Clusters

Results

Variants

Table 1. Effects of sequence differences on protein-coding transcripts and the number of transcripts affected in CHM13 identified by the LiftoffTools variants module.

Synteny

Figure 1. Dot plot showing the ordinal position of each gene in GRCh38 on the x-axis and the ordinal position in CHM13 on the y-axis.

Clusters

Conclusions

Software availability

Data availability

Underlying data

References

Comments on this article Comments (0)

Open Peer Review

Comments on this article Comments (0)

Open Peer Review

Reviewer Status

Reviewer Reports

Comments on this article

Browse by related subjects

Competing Interests Policy

Stay Updated