LiftoffTools: a toolkit for comparing gene annotations mapped between genome assemblies

In 2020 we published Liftoff, which was the first standalone tool specifically designed for transferring gene annotations between genome assemblies of the same or closely related species. While the gene content is expected to be very similar in closely related genomes, the differences may be biologically consequential, and a computational method to extract all gene-related differences should prove useful in the analysis of such genomes. Here we present LiftoffTools, a toolkit to automate the detection and analysis of gene sequence variants, synteny, and gene copy number changes. We provide a description of the toolkit and an example of its use comparing genes mapped between two human genome assemblies.


Introduction
Liftoff (Shumate and Salzberg, 2021) is a computational tool specifically designed for mapping gene annotations from a reference assembly to a target assembly of the same or closely related species.Liftoff uses sequence alignment software to align the complete exon-intron structure of each annotated transcript from a source to a target, and it can also map virtually any other feature specified as an interval along the genome.It also includes a method to find additional copies of genes that might be present in higher copy numbers in the target genome.After lifting genes over, one of the first questions that many researchers have is how the sets of genes compare between the reference assembly and the target, and in particular whether any of the differences are biologically consequential.
Here we introduce LiftoffTools, a toolkit to compare genes mapped from one assembly to another.LiftoffTools includes three different modules.The first identifies changes in protein-coding genes and their effects on the corresponding genes, including simple amino acid changes as well as more-serious alterations.The second compares the gene synteny (e.g., the preservation of gene order along the chromosomes), and the third clusters genes into groups of paralogs to evaluate gene copy number gain and loss.While LiftoffTools is designed to analyze the output of Liftoff, it is also compatible with the output of other annotation transfer tools such as UCSC liftOver (Kuhn et al., 2013) that preserve the feature IDs between annotations.Here we provide a description of each module as well as results comparing genes in the GRCh38 human reference genome mapped onto CHM13, the first truly complete human genome (Nurk et al., 2022).

Methods
The inputs required for all three modules of LiftoffTools are the sequences of the reference and target assemblies (in FASTA format), and the annotations of the reference and target assemblies (in GFF3 or GTF format).The target annotation can be derived from other lift-over tools besides Liftoff, as long as the feature IDs in the reference and target annotations are the same.All three modules can be run with the following command: liftofftools all -r reference.fasta-t target.fasta-rg reference.gff3-tg target.gff3 Each module can also be run separately as detailed on GitHub.

Operation
LiftoffTools is designed and implemented in Python 3 (requires 3.6 or higher) and is easily installable with PyPi (pip install liftofftools) and bioconda (conda install -c bioconda liftofftools).Details on how to run LiftoffTools are available on the GitHub page.

Variants
The variants module calculates the sequence identity between mRNA transcripts in the reference genome and the corresponding transcripts in the target genome.For protein-coding genes, the module identifies variants that have a neutral or deleterious effect on the translated amino acid sequences in the target genome.The first step in the module will globally align the nucleotide sequences of the reference transcripts to the target transcripts using the Needleman-Wunsch algorithm implemented by Parasail (Daily, 2016), which is a single instruction/multiple data (SIMD) C library for sequence alignment.If the transcript has an annotated protein-coding sequence (a CDS feature), we align the protein sequences again using Parasail.We then identify mismatches and gaps in the alignments and evaluate the effects on the protein sequence.The potential effects we look for are synonymous mutations, nonsynonymous mutations, in-frame deletions, in-frame insertions, start codon loss, 5 0 truncations, 3 0 truncations, frameshifts, and stop codon gain.For all transcripts we output the percent identity at the nucleotide level, and for protein-coding transcripts we also output the protein percent identity and the variant effect if applicable.While there may be multiple variants within a transcript, the intent of this module is to summarize the functional consequences of variation; therefore, if there is more than one variant, we report only the most severe.For example, if a transcript has a synonymous mutation and a frameshift mutation, we output 'frameshift' for that transcript as this would be more disruptive to gene function.Combining the sequence identity information with the variant effect can provide further insights into the severity of the variant.For example, a gene with a frameshift near the 3 0 end or a gene with a second, compensatory frameshift nearby will have a high percent identity at the amino acid level and may still retain function.

REVISED Amendments from Version 1
The differences in this version are just minor updates to the text to further elaborate on points that were unclear to the reviewers.No data, methods, or conclusions have changed.
Any further responses from the reviewers can be found at the end of the article

Synteny
The synteny module compares the gene order in the reference annotation to the order in the target annotation.The genes present in both annotations are sorted first by chromosome and then by start coordinate in each annotation.Each gene is then plotted as a point on a 2D plot where the x-coordinate is the ordinal position (e.g., 1 st , 2 nd , 3 rd , etc.) in the reference genome and the y-coordinate is the ordinal position in the target genome.The color of the point corresponds to the sequence identity between the corresponding genes, where green indicates higher identity and red indicates lower identity.Note this color feature is only available for target annotations created by Liftoff which have the sequence identity information in the GTF/GFF3.The plot and a file with the ordinal positions and sequence identities of each gene is output.The user also has the option to calculate the edit distance between the reference order and the target order.

Clusters
This module clusters the genes into paralogous groups to evaluate gene copy number gain and loss.LiftoffTools first invokes MMSeqs2 (Steinegger and Söding, 2017) to cluster the reference gene sequences.MMSeqs2 clusters the amino acid sequences of the protein-coding genes, and the nucleotide sequences of noncoding genes.For each gene we select only the longest isoform to be included in the clustering.For genes to be considered copies and be clustered together, they must be at least 90% identical across 90% of both of their lengths, although these parameters can be adjusted by the user.After clustering the reference genes, we create the target gene clusters by first iterating through each reference cluster and removing any gene absent in the target genome.Next, if Liftoff was run with the -copies option to identify extra gene copies in the target genome, we add the extra copies to the same cluster as their closest paralog.If Liftoff was run without the -copies option, no extra gene copies will be present in the target annotation, and thus the clusters modules will only report instances of copy number loss.For each cluster, we output the number of reference genes and the number of target genes belonging to that cluster as well as the gene IDs of the cluster members.

Results
To illustrate the use of these tools, we used them to compare the human annotation on the current reference genome, GRCh38, to the same annotation when mapped onto the first-ever complete human genome, CHM13 (Nurk et al., 2022).We first mapped the human annotation onto CHM13 by running Liftoff v1.6.3 (with options -copies -sc 0.95 -polish) to map genes from RefSeq release 110 (O'Leary et al., 2016) from GRCh38 onto CHM13v2.0.(This annotation is available on the Johns Hopkins Center for Computational Biology website) We then ran each module of LiftoffTools on the resulting CHM13 annotation.

Variants
Running the variants module on GRCh38 and CHM13, we found that out of 130,316 protein-coding transcripts in GRCh38, 77,109 CHM13 transcripts were identical, 421 failed to map, and 52,669 had variants with the effects shown in Table 1.The vast majority of these effects were either simple amino acid changes or insertion/deletions (indels) that preserved the reading frame; only 932 of the variants had a major effect on the translated protein sequence.
Table 1.Effects of sequence differences on protein-coding transcripts and the number of transcripts affected in CHM13 identified by the LiftoffTools variants module.In-frame changes refer to insertions or deletions that are a multiple of 3 in length.Truncations are variants that shorten the protein sequence by removing either the 5 0 or 3 0 end of the transcript including the start or stop codon.Start codon loss variants are point mutations in the start codon, and stop codon gain variants are point mutations that result in a premature stop codon.

Stop codon gained 206 Synteny
We ran the synteny module to compare the gene order of CHM13 to GRCh38.The dot plot in Figure 1 shows that the vast majority of genes were collinear and nearly identical in sequence, as expected.The small number of genes which were not collinear generally mapped with a lower sequence identity, suggesting they may have been mapped to a different (nonsyntenic) copy of a gene in a multi-gene family.

Clusters
The clusters module found 5,213 genes in GRCh38 with at least one paralog that met the 90% sequence identity and alignment length minimums.These 5,213 genes were grouped into 1,629 clusters with copy numbers ranging from two to 66.In CHM13, 8,356 genes had at least one paralog.These copies were grouped into 2,089 clusters with copy numbers ranging from two to 228.(Note that the ribosomal DNA gene is the largest cluster, and most copies of this gene are not present in the GRCh38 assembly.)Among clusters with a copy number of at least 2 in GRCh38, 134 clusters had fewer gene copies in CHM13 resulting in a total loss of 188 gene copies.A total of 715 clusters had more copies in CHM13 resulting in a total gain of 3,035 gene copies.

Conclusions
Liftoff gave us the ability to easily map genes between closely related genomes, but further analysis is required to identify similarities and differences between the genes in each assembly that may be biologically important.LiftoffTools enables this analysis by automating the comparison of protein-coding variants, gene synteny, and gene copy loss and gain.Here we provided an example demonstrating the use of LiftoffTools to compare genes mapped between two human assemblies, and we hope this set of tools will be useful for a wide diversity of assembled genomes from species across all domains of life.I have two minor comments for the Methods section.
Variants Do transcripts include UTRs?
If yes, the applicability is limited to genomes with annotated UTRs, if no -the term transcript should be defined as such.

Clusters
In the sentence: "Next, if Liftoff was run with the -copies option to identify extra gene copies in the target genome, we add the extra copies to the same cluster as their closest paralog." The meaning of the "-copies" option, the difference with the default run, was not described.

Is the rationale for developing the new software tool clearly explained? Yes
Is the description of the software tool technically sound?Yes

Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool? Yes
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?Yes Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Bioinformatics, Genome Analysis I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

Alaina Shumate
Comment 1 Variants Do transcripts include UTRs?If yes, the applicability is limited to genomes with annotated UTRs, if no -the term transcript should be defined as such.

Response 1
The mRNA transcripts are aligned which we derive by concatenating all of the 'exon' features in the GFF3 annotation.Some annotations include UTRs in the 5' or 3' exons.In these cases, yes, the UTRs are included.Others annotate them separately as their own feature upstream or downstream of exons.In these cases, they are not included.While the full transcript is used to calculate sequence identity, variants are only called within the amino acid sequence, so the module can be applied to genomes with or without annotated UTRs with no effect on the number or types of variants reported.

Comment 2 Clusters
In the sentence: "Next, if Liftoff was run with the -copies option to identify extra gene copies in the target genome, we add the extra copies to the same cluster as their closest paralog." The meaning of the "-copies" option, the difference with the default run, was not described.
Response 2 I have updated the manuscript to include the text below for clarification: "Next, if Liftoff was run with the -copies option to identify extra gene copies in the target genome, we add the extra copies to the same cluster as their closest paralog.If Liftoff was run without the -copies option, no extra gene copies will be present in the target annotation, and thus the clusters module will only report instances of copy number loss." Competing Interests: No competing interests were disclosed.

Adam Frankish
European Bioinformatics Institute, Wellcome Genome Campus, European Molecular Biology Laboratory, Cambridge, UK This manuscript describes a set of tools to analyze gene annotation that has been mapped between one genomic sequence and another closely related genomic sequence.The methods described build on the widely adopted Liftoff annotation transfer tool developed by the same group.
The development of LiftoffTools is clearly relevant and timely as we enter an era of rapidly increasing numbers of high quality genome sequences suitable for comparison to species reference genome assemblies that are generally the focus of gene annotation effort.The description of the tools is sound and we were able to run the code with the instructions provided and generate the correct/expected output.The code is written entirely in Python and looks clear and well-organized.
There is broadly sufficient information to allow interpretation of the results, although a little more guidance might be useful, for example, adding annotated examples of real data from Variant and Cluster analysis to guide users.
Comments: "...as long as the feature IDs in the reference and target annotations are the same.All three modules can be run with the following command" -Is this somewhat inflexible?Not all transfer methods will preserve feature ID identically across different assemblies as they seek to avoid storing two or more features with identical IDs but different properties (sequence/length/etc).Is it possible to accommodate methods that take this approach?○ It is not clear from the manuscript or the supporting information in https://github.com/agshumate/LiftoffTools/blob/master/README.mdhow LiftoffTools handles genes that are LoF on genome 1 but functional on genome 2. Can these genes and their 'rescuing' variation be identified?

○
While the list of variant consequences is comprehensive for the annotated CDS, it would be useful to add other LoF consequences such as disruption of core splice site to the analysis.

○
It would also be useful to specifically state the ranking of consequences in the /README.mdfile for genes with multiple transcript-affecting variants as only the most significant is provided in the variation output file.

○
Similarly, as only one variant is reported, does LiftoffTools identify (and/or flag) corrective variation e.g a second frameshift that compensates for an earlier frameshift and restores the CDS with a small aa change?

○
In the calculation of cluster gain/loss, are haplotypic duplicated pseudogenes considered?i.e. is loss only deletion/absence of the gene or is loss (or gain) of function included as well?An example with real data in /README.mdcould be helpful.

○
What is reported for variants in genes that are missing or partial on GRCh38 where it is used ○ as a reference?An example with real data in /README.mdcould be helpful.
The CHM13 GFF and FNA (fasta) files appear to have different chromosome names, which threw an error when running the code.Reviewer Expertise: Gene annotation I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.
IDs must be unique.If a transfer method is changing IDs to eliminate duplicates, this is an issue of the reference annotation failing to adhere to the GFF3 specifications and should be addressed prior to mapping annotations.

Comment 2
It is not clear from the manuscript or the supporting information in https://github.com/agshumate/LiftoffTools/blob/master/README.mdhow LiftoffTools handles genes that are LoF on genome 1 but functional on genome 2. Can these genes and their 'rescuing' variation be identified?
Response 2 Currently, LiftoffTools does not identify 'rescuing' variants.In our particular case, we used the RefSeq annotation as a reference, which requires a valid open reading frame to annotate a coding sequence; therefore, there are no LoF coding sequences annotated in the reference to be mapped.We recognize that this is not true for all reference annotations; many include annotations of coding sequences without a valid open reading frame.There are varying schools of thought on whether this is acceptable, but regardless, we do agree that identifying rescuing or gain of function variants would be useful, and we will consider implementing that feature in the future.We have updated the manuscript to say "the module identifies variants that have a neutral or deleterious effect on the translated amino acid sequences in the target genome."The README has also been updated.

Comment 3
While the list of variant consequences is comprehensive for the annotated CDS, it would be useful to add other LoF consequences such as disruption of core splice site to the analysis.

Response 3
We do agree that splice site variants would be useful to include, but we are currently only aligning and looking at variants in the mRNA due to computational limitations.Performing Smith-Waterman alignment on transcript sequences including splice-sites and introns would require significant computational resources for human and other eukaryotic genomes.Even with a very fast implementation of the Smith-Waterman alignment, aligning just the mRNA and amino acid sequences is the computational bottleneck of the LiftoffTools pipeline.While faster alignment methods could alleviate this, they would not have the same accuracy as Smith-Waterman, which we feel is necessary for accurately identifying the position and type of variants.We have edited the manuscript to specifically state that we are aligning mRNA sequences.

Comment 4
It would also be useful to specifically state the ranking of consequences in the /README.mdfile for genes with multiple transcript-affecting variants as only the most significant is provided in the variation output file.
Variants are only reported for genes that are in both GRCh38 and CHM13.We state in the manuscript "The variants module calculates the sequence identity between mRNA transcripts in the reference genome and the corresponding transcripts in the target genome…" A limitation of lifting over gene annotations is that a partial reference gene will likely also be annotated as partial in the target annotation.There are various lift-over algorithms/strategies; however, they generally rely on converting the start and end coordinates of the gene from reference to target.If the start-end range is only a partial gene, only that part of the gene will be lifted over.In these cases, they will be reported as either a 5' truncation or a 3' truncation based on the presence or absence of start and stop codons.

Comment 8
The CHM13 GFF and FNA (fasta) files appear to have different chromosome names, which threw an error when running the code.

Response 8
Thank you for bringing this to our attention.The link to the fasta file has been replaced to a file with the same chromosome names.
Competing Interests: No competing interests were disclosed.
The benefits of publishing with F1000Research: Your article is published within days, with no editorial bias • You can publish traditional articles, null/negative results, case reports, data notes and more • The peer review process is transparent and collaborative • Your article is indexed in PubMed after passing peer review • Dedicated customer support at every stage • For pre-submission enquiries, contact research@f1000.com

Figure 1 .
Figure 1.Dot plot showing the ordinal position of each gene in GRCh38 on the x-axis and the ordinal position in CHM13 on the y-axis.The color of each point indicates the sequence identity, and the gray lines separate the chromosomes.

○
Is the rationale for developing the new software tool clearly explained?Yes Is the description of the software tool technically sound?YesAre sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?YesIs sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?PartlyAre the conclusions about the tool and its performance adequately supported by the findings presented in the article?Yes Competing Interests: No competing interests were disclosed.
This is an open access peer review report distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.