HGNChelper: identification and correction of invalid gene symbols for human and mouse

Gene symbols are recognizable identifiers for gene names but are unstable and error-prone due to aliasing, manual entry, and unintentional conversion by spreadsheets to date format. Official gene symbol resources such as HUGO Gene Nomenclature Committee (HGNC) for human genes and the Mouse Genome Informatics project (MGI) for mouse genes provide authoritative sources of valid, aliased, and outdated symbols, but lack a programmatic interface and correction of symbols converted by spreadsheets. We present HGNChelper, an R package that identifies known aliases and outdated gene symbols based on the HGNC human and MGI mouse gene symbol databases, in addition to common mislabeling introduced by spreadsheets, and provides corrections where possible. HGNChelper identified invalid gene symbols in the most recent Molecular Signatures Database (MSigDB 7.0) and in platform annotation files of the Gene Expression Omnibus, with prevalence ranging from ~3% in recent platforms to 30-40% in the earliest platforms from 2002-03. HGNChelper is installable from CRAN.


Introduction
Gene symbols are widely used in biomedical research because they provide descriptive and memorable nomenclature for communication. However, gene symbols are constantly updated through the discoveries and re-identification of genes, resulting in new names or aliases. For example, GCN5L2 (General Control of amino acid synthesis protein 5-Like 2) is a gene symbol that was later discovered to function as a histone acetyltransferase and therefore renamed as KAT2A (K(lysine) Acetyl Transferase 2A) 1 . In addition to the rapid and constant updates on valid gene symbols, commonly used spreadsheet software, such as Microsoft Excel, modify some gene symbols, converting them into dates or floating-points numbers 2, 3 . For example, 'DEC1', a symbol for 'Deletion in Esophageal Cancer 1' gene, can be exported in date format, '1-DEC'. There have been attempts to rectify gene symbol issues, but they have largely been limited to Excel-modified gene symbols. Also the suggested solutions often reference static files with the corrections curated at the time of publication 3 or comprise scripts for detecting the existence of Excel-modified gene symbols without correction 2 . In recognition of the importance of the spreadsheet modification issues, HGNC offers its own symbol correction tool, the Multi-symbol checker, and also recently announced that all symbols that auto-convert to dates in Excel have been changed 4 . However, much literature and public data still contains outdated and incorrect gene symbols, motivating a convenient method of systematic detection and correction. To systematically identify historical aliases, correct for capitalization differences, and simultaneously correct spreadsheet-modified gene symbols, we built the HGNChelper R package. HGNChelper maps different aliases and spreadsheet-modified gene symbols to approved gene symbols maintained by The HUGO Gene Nomenclature Committee (HGNC) database 5 . HGNChelper also supports mouse gene symbol correction based on the Mouse Genome Informatics (MGI) database 6 .

Implementation
Source data. Human gene symbols are accessed from HGNC Database ftp site (ftp://ftp.ebi.ac.uk/pub/databases/genenames/ new/tsv/hgnc_complete_set.txt) 7 and mouse gene symbols are acquired from MGI Database (http://www.informatics.jax.org/ downloads/reports/MGI_EntrezGene.rpt) 6 . These URLs, and their access and processing, are handled by HGNChelper so the user does not interact directly with them.
Algorithm. Human gene symbol correction is processed in three steps. First, capitalization is fixed: all letters are converted to upper-case, except the open reading frame (orf) nomenclature, which is written in lower-case. Second, dates or floating-point numbers generated via Excel-modification are corrected using a custom index generated by importing all human gene symbols into Excel, exporting them in all available date formats, and collecting any gene symbols that are different from the originals. In the last and most commonly applied step, aliases are updated to approved gene symbols in the HGNC database. Mouse gene symbol correction follows the same three steps as in human gene symbol correction, except the capitalization step since mouse gene symbols begin with an uppercase character, followed by all lowercase.
User interface. The user interface of HGNChelper does not include any local input or output files; instead it uses R data structures as function arguments and output. Base R data export functions such as write.

species:
A required character vector of length 1, either "human" (default) or "mouse". checkGeneSymbols returns an R data.frame with one row per input gene and three columns: 1. The first column of the data frame shows the input gene symbols.
2. The second column indicates whether the input symbols are valid.
3. The third column provides a corrected gene symbol where possible.
A message is printed indicating when the package's built-in map was last updated. Because the gene symbol databases are updated as frequently as every day, we provide the getCurrentHumanMap and getCurrentMouseMap functions for updatingthe reference map without requiring an HGNChelper software update. These functions fetch the most up-to-date version of the map from HGNC and MGI, respectively, and users can provide the output of these functions through the map argument of checkGeneSymbols function. However, fetching a new map requires internet access and takes longer than using the package's built-in index.

Amendments from Version 1
This revision addresses comments raised by reviewers, with the most significant changes being 1) addition of a Limitations section, 2) comparison to the limma packages alias2Symbol and alias2SymbolTable functions, and 3) improvement of the readability of the figure.
Any further responses from the reviewers can be found at the end of the article

Operation
HGNChelper is an R package installable from CRAN on Linux, Windows, and OSX. It requires a base installation of R (> 3.5.0) and no other dependencies, and has minimal hardware requirements that should be met by any computer capable of installing the R dependency.

Results
To evaluate the performance of HGNChelper, we quantified the extent of invalid gene symbols present in platform annotation files in the Gene Expression Omnibus (GEO) database from 2002 to 2020. We downloaded 20,716 GEO platform annotation (GPL) files using GEOquery::getGEO 8 , of which 2,044 platforms were suspected to contain gene symbol information based on matching to valid symbols. There is a clear trend of increasing proportion of invalid gene symbols with age of platform submission (Figure 1), ranging from an average of ~3% for recent platforms and increasing with age to ~20% in 2010 and 30-40% in the earliest platforms from 2002-03. The overall proportion of valid gene symbols was 79%, increasing to 92% after HGNChelper correction. We also checked the validity of gene symbols in the Molecular Signatures Database (MSigDB 7.0) 9 . Out of 38,040 gene symbols used in MSigDB version 7.0, 850 were invalid, and this number reduces to 453 after HGNChelper correction, of which the majority were lncRNA and a few withdrawn symbols.
The limma 10 Bioconductor package provides related functionality; however, limma::alias2Symbol and limma::alias2SymbolTable are intended only to translate known gene aliases, whereas HGNChelper is intended for heterogeneous input that may include aliases, valid symbols, Excel-modified symbols, incorrect capitalization, and unmappable symbols, and to provide a map between input and output. limma::alias2SymbolTable maintains the length of the output vector as same as the input, but if there are multiple aliases, it displays only the one with the lowest Entrez ID number, whereas HGNChelper returns a delimited vector of all aliases. Older entries show a smaller fraction of valid gene symbols than more recent entries (Before, white box), but many of which are successfully corrected by HGNChelper (After, grey box).

Discussion
Gene symbols are error-prone and unstable, but remain in common use for their memorability and interpretability. Our analysis of public databases containing gene symbols emphasizes the need for gene symbol correction particularly when using symbols from older datasets and reported results. Such correction should be routinely done when gene symbols are part of high-throughput analysis, such as re-analysis of targeted gene panels for precision medicine, which tend to be annotated with gene symbols (e.g. 11), in Gene Set Enrichment Analysis using the gene symbol versions of popular databases such as MSigDB 9 or GeneSigDB 12 , or when performing systematic review or meta-analysis of published multi-gene signatures (e.g. 13). HGNChelper implements a programmatic and straightforward approach to the routine identification and correction of invalid gene symbols.

Limitations
We reduced the fraction of invalid gene symbols in GPL files using HGNChelper ( Figure 1), but there are still 8% remaining, invalid gene symbols. We further investigated the cases where HGNChelper failed to fix and identified the following situations: 1. Long non-coding RNAs (e.g. "lnc-ARMCX4-1", "lnc-SOX11-1") 2. Withdrawn symbol (e.g. "OCLM") 3. Uncharacterized gene (e.g. "LOC644669"): Symbols beginning with LOC. When a published symbol is not available, and orthologs have not yet been determined, this may be represented as 'LOC' + the GeneID.
4. Non-human gene symbol 5. Missing data 6. Commercial product name (e.g. Probe ID) Another limitation with HGNChelper is that it cannot always provide the correct answer for which gene a symbol refers to. For example, FHL1 is both an approved symbol and an alias of CFH, so unless the chromosome of CFH is specified, FHL1 will be just returned as a valid symbol. Thus, we recommend users to provide as much information as possible and still be cautious in interpretation of its output.

Software availability
Package The paper is a short description of the motivation and implementation. With a short study detailing the evolution of gene symbolss.
What I miss in the paper is an assessment of HGNChelper failures (to increase my confidence in the tool). For example how often are symbols converted incorrectly because the same gene symbol was used, at different times, to denote different genes.

Is the rationale for developing the new software tool clearly explained? Yes
Is the description of the software tool technically sound? Yes

Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool? Yes
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article? Partly To address this point we manually reviewed many cases where HGNChelper correction efficiency is low in Figure 1 and almost all unmapped symbols fell into one of the following categories: Long non-coding RNAs (e.g. "lnc-ARMCX4-1", "lnc-SOX11-1") 1.
Uncharacterized gene (e.g. "LOC644669"): Symbols beginning with LOC. When a published symbol is not available, and orthologs have not yet been determined, this may be represented as 'LOC' + the GeneID.
We have added this to the manuscript section "Limitations".

Susan Tweedie
HUGO Gene Nomenclature Committee (HGNC), European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, UK The paper describes an R package for that checks whether human and mouse symbols match an HGNC or MGI approved symbol and if not suggests a replacement by, correction of capitalization, correction of Excel date and floating-point transformations and matching to alias symbols.
The rationale for developing the new software tool is generally clearly explained. As the authors point out, symbols do change (although the HGNC are now committed to making as few symbol changes as possible) and there is a need to check symbols are valid. As the human symbols that convert to dates in Excel have all been changed, this should be less of a problem going forward. However, these mangled symbols persist in historic data sets and some authors will undoubtedly continue to use problematic aliases such as OCT3 so that aspect of the tool is helpful. It may be worth adding that conversion of gene symbols to floating-point numbers in Excel is more of an issue for mouse genes with RIKEN identifiers than human gene symbols.
The authors should also address whether there are any other R packages that have similar functionality. The name HGNChelper could lead some to think this is an HGNC endorsed tool; given that a symbol checking tool is already available from the HGNC ( https://www.genenames.org/tools/multi-symbol-checker/) (albeit not an R package) the authors should mention this exists and ideally compare the functionality of their tool versus the HGNC tool.
While this is likely to be a useful tool for R users it should come with a few words of caution given that you cannot always be completely sure which gene a symbol refers to in the absence of confirmation via an ID or other additional information. For example, FHL1 is both an approved symbol and an alias of CFH so while FHL1 is a valid symbol the input data may refer to CFH. There are also cases where a symbol is an alias for several genes but not an approved symbol itself e.g. NIP (or Nip) which not an approved symbol but is an alias of GIPC1, DUOXA1, and CRPPA. The authors should clarify how the algorithm deals with cases where an input symbol matches more than one gene. This would address my concerns about whether this is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool. I note that the input contains optional chromosome information but there is no mention of how this is used -does the algorithm take this information into account when determining whether a symbol is valid or not?
The authors note that some symbols in their test that could not be updated were lncRNAs and pseudogenes. As both of these classes of gene are named by HGNC it would be good to expand on why the tool failed with these genes -do these particular genes lack an HGNC symbol?

Are the conclusions about the tool and its performance adequately supported by the findings presented in the article? Partly
Competing Interests: The HGNC has a symbol checking tool with some of the functionality of the tool described in the paper.

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.
Author Response 27 Apr 2022 Levi Waldron, Graduate School of Public Health and Health Policy, City University of New York, USA Thank you for reviewing our manuscript and for your constructive comments. Below are our responses to the individual comments.

Comment 1: It may be worth adding that conversion of gene symbols to floating-point numbers in Excel is more of an issue for mouse genes with RIKEN identifiers than human gene symbols.
The reviewer is correct that human gene symbols prone to Excel conversion have now been changed, but many still exist in the literature and public databases as demonstrated in Figure 1. HGNChelper does not currently fix RIKEN identifiers, so we don't draw this comparison in the manuscript.
Comment 2: The authors should also address whether there are any other R packages that have similar functionality. The name HGNChelper could lead some to think this is an HGNC endorsed tool; given that a symbol checking tool is already available from the HGNC (https://www.genenames.org/tools/multi-symbol-checker/) (albeit not an R package) the authors should mention this exists and ideally compare the functionality of their tool versus the HGNC tool.
We now compare HGNChelper with the function alias2Symbol from the limma package. This is described in the response to comment 1 from Reviewer 1. Thank you for pointing us to the HGNC's own tool, the Multi-symbol checker. We understand that our package name, HGNChelper, can potentially imply the endorsement from HGNC, so we clarified in the manuscript that Multi-symbol checker is the tool supported by HGNC. We also compared the HGNChelper and Multi-symbol checker from HGNC. Here are the major points that differentiate these tools: Implementation: Multi-symbol checker is a web-based UI tool. Users can provide an input as a comma-or space-separated list of gene symbols, directly typing-in or uploading the file. Outputs are displayed in the interface as a sortable table and users can choose to download it as a csv file. HGNChelper is a R package, which takes an input as a character vector and outputs the result as a data frame, which can be saved and exported in a different format, such as csv, tsv, rds, etc.
symbol. Chromosome location: Multi-symbol checker provides the chromosome location as a part of the default output, if the approved symbol is available for a given input. HGNChelper provides the chromosome information only if it is provided with the input gene symbol -it validates whether the input chromosome information is correct or not, and if it's wrong, gives the correct chromosome location.

3.
Comment 3: While this is likely to be a useful tool for R users it should come with a few words of caution given that you cannot always be completely sure which gene a symbol refers to in the absence of confirmation via an ID or other additional information. For example, FHL1 is both an approved symbol and an alias of CFH so while FHL1 is a valid symbol the input data may refer to CFH.
Comment 4: There are also cases where a symbol is an alias for several genes but not an approved symbol itself e.g. NIP (or Nip) which is not an approved symbol but is an alias of GIPC1, DUOXA1, and CRPPA. The authors should clarify how the algorithm deals with cases where an input symbol matches more than one gene. This would address my concerns about whether this is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool.
If there is only one valid gene symbol matched with the input, HGNChelper simply evaluates whether the provided chromosome information is correct or not, and if it's incorrect, outputs the correct chromosome location under the 'Correct.chromosome' column. For example, HGNChelper::checkGeneSymbols("NIP", chromosome = 1) will return "CRPPA /// DUOXA1 /// GIPC1'' as the suggested symbol and "7 /// 15 /// 19'' as the correct chromosome.
○ If the input matches more than one gene, the chromosome information is used to specify the suggested gene symbol. For example, HGNChelper::checkGeneSymbols("NIP", chromosome = 7) will return "CRPPA'' as the suggested symbol and "7'' as the correct chromosome.
○ lncRNAs and pseudogenes can be updated as long as they are not 'uncharacterized ○ genes', whose symbols start with 'LOC'. Based on NCBI, when a published symbol is not available and orthologs have not yet been determined, gene will provide a symbol that is constructed as 'LOC' + the GeneID. So HGNChelper can not update them because there are no approved gene symbols for them. Comment 5: Figure 1 -Improve the clarity of the figure We apologize for the confusing color display. Color schema for Figure 1 is fixed in the updated manuscript. Minor comments about the manuscript include: The limma R package has the alias2Symbol function with similar functionality. How the functionality of HGNChelper differs or improves upon this function? ○ Figure 1 -Before/after boxplots look identical in the black and white version. Please, correct.

○
The following comments about the package interface are suggestive. The current output of the checkGeneSymbols() function returns a data frame with three columns (x, Approved, Suggested.Symbol). Suggesting including an argument "simplify" (TRUE by default) that will return one vector of the same length and order as the original vector of gene symbols, with NAs replacing non-mappable symbols. The rationale is to use this function as a wrapper around the original vector of gene symbols, e.g., checkGeneSymbols(my_genes), returning a drop-in replacement vector of corrected gene symbols. An example is the p.adjust() function that, given a vector of p-values, returns a vector of p-values corrected for multiple testing. ○ incorrect capitalization, and unmappable symbols, and to provide a map between input and output. limma::alias2SymbolTable maintains the length of the output vector as same as the input, but if there are multiple aliases, it displays only the one with the lowest Entrez ID number, whereas HGNChelper returns a delimited vector of all aliases. The following example demonstrates these differences: > library(HGNChelper) > input = c("FN1", "TP53", "UNKNOWNGENE", + "7-Sep", "9/7", "1-Mar", + "Oct4", "4-Oct", "OCT4-PG4", + "C19ORF71", "C19orf71", "NIP") > checkGeneSymbols(input) Maps last updated on: Mon Sep 28 18:31:21 2020 x Approved "TP53" "C19orf71" "GIPC1" "DUOXA1" > alias2SymbolTable(alias = input) [1] "FN1" "TP53" NA NA NA NA [7] NA NA NA NA "C19orf71" "GIPC1" Warning message: In alias2SymbolTable(alias = input) : Multiple symbols ignored for one or more aliases Additionally, limma::alias2Symbol uses Bioconductor org*.db packages to map aliases for multiple organisms. org*.db packages in turn pull data from NCBI and update it with each Bioconductor release. HGNChelper is a CRAN package with no dependency on non-base packages, and instead downloads aliases directly from the HUGO and MGI projects. limma::alias2Symbol however provides an advantage of supporting any organism for which an org*.db package is available, whereas HGNChelper supports only human and mouse.
Comment 2: Figure 1 -Before/after boxplots look identical in the black and white version. Please, correct.
We apologize for the confusing color display. The color schema for Figure 1 is fixed in the updated manuscript.
Comment 3: The current output of the checkGeneSymbols() function returns a data frame with three columns (x, Approved, Suggested.Symbol). Suggesting including an argument "simplify" (TRUE by default) that will return one vector of the same length and order as the original vector of gene symbols, with NAs replacing non-mappable symbols. The rationale is to use this function as a wrapper around the original vector of gene symbols, e.g., checkGeneSymbols(my_genes), returning a drop-in replacement vector of corrected gene symbols. An example is the p.adjust() function that, given a vector of p-values, returns a vector of p-values corrected for multiple testing.
This is a good use case, but we are reluctant to allow a function argument to change the class (and the contract) of what the function returns. Motivated by arguments for "Type consistency" such as by Gillespie and Lovelace (Efficient R programming, https://csgillespie.github.io/efficientR/, section 3.5.2), we think it is less error-prone to require a simple but explicit step to change data class. We've added an example to the checkGeneSymbols help page to provide a straightforward solution in this use case: result in restrictions on the bulk downloads we rely on. To maintain reproducibility by default we require the user to download and save the map if they want a version newer than what HGNChelper has, an approach also compatible with caching programs like BiocFileCache. We have added the following explanation to the vignette under the title, "Updating maps of aliased gene symbols": We intentionally avoid automatic update of the map to maintain reproducibility, because the same code from the same version of HGNChelper could produce different results at any time with automatic map update.
Competing Interests: No competing interests were disclosed.
The benefits of publishing with F1000Research: Your article is published within days, with no editorial bias • You can publish traditional articles, null/negative results, case reports, data notes and more • The peer review process is transparent and collaborative • Your article is indexed in PubMed after passing peer review • Dedicated customer support at every stage • For pre-submission enquiries, contact research@f1000.com