ALL Metrics
-
Views
-
Downloads
Get PDF
Get XML
Cite
Export
Track
Software Tool Article
Revised

GARCOM: A user-friendly R package for genetic mutation counts

[version 2; peer review: 2 approved, 1 approved with reservations]
PUBLISHED 17 May 2024
Author details Author details
OPEN PEER REVIEW
REVIEWER STATUS

This article is included in the RPackage gateway.

This article is included in the Cell & Molecular Biology gateway.

Abstract

Next-generation sequencing (NGS) has enabled analysis of rare and uncommon variants in large study cohorts. A common strategy to overcome these low frequencies and/or small effect sizes relies on collapsing strategies, i.e. to bin variants within genes/regions. Several tools are now available for advanced statistical analyses; however, tools to perform basic tasks such as obtaining allelic counts within defined gene/region boundaries are unavailable or require complex coding. GARCOM (“Gene And Region Count Of Mutations”) library, an open-source freely available package in R language, returns a matrix with allelic counts within genes/regions per sample. GARCOM accepts input data in PLINK or VCF formats, with additional options to subset data for refined analyses.

Keywords

mutation, plink, allele, genetics, VCF

Revised Amendments from Version 1

We improved the quality of the manuscript and fixed typos and errors from previous version.

See the authors' detailed response to the review by Ettore Mosca

Introduction

Genome-wide association studies (GWAS) have led to the identification of several genomic common variants associated with complex diseases,1 yet missing heritability remains extensive. Importantly, most of the disease-causing variants are rare in nature2 (minor allele frequency < 1%) where common variants serve as a proxy. Rapid decline in sequencing costs have enabled in-depth analysis of rare variants (RVs) through Whole-Genome sequencing (WGS) and Whole-Exome Sequencing (WES). Furthermore, large-scale reference panels have allowed for RV imputation.3-5 Power to identify statistically significant RVs decreases as the minor allele frequency decreases: therefore, an ideal method to overcome this limitation is to group RV at the gene/region level, usually via a collapsing test.

Despite the availability of sophisticated tools for annotation, quality-control and association analyses, tools to perform basic tasks, for instance, obtaining allelic count within defined genetic boundaries (genes and/or regions) are lacking, to our knowledge. R libraries such as BEDMatrix and bigsnpr6 provide allelic counts for each SNP per individual but algorithms to extract information within genetic boundaries in a collapsed fashion are unavailable.

Here we introduce a user-friendly R package, GARCOM (“Gene And Region Count Of Mutations”) that computes allelic counts per individual within user-provided gene or genomic region boundaries.

Methods

GARCOM is written and developed in open-source R7 statistical and programming language. GARCOM imports data.table,8 vcfR,9 bigstatr, bigsnpr and stats libraries for internal data transformation and processing. A stable version is released and publicly available on the CRAN repository.

install.packages(“GARCOM”)

Operation

GARCOM was developed on R (≥4.0) (RRID:SCR_017299) with other dependencies and minimum versions as: data.table (≥1.12.8), vcfR (≥1.12.0), bigsnpr (≥1.4.11). Full documentation of dependencies and installation is available at GARCOM GitHub repository. There is no minimum memory (RAM) requirement, but that may vary according to the nature and size of input genetics data. GARCOM was developed on Unix platform but can also be used on other platforms (e.g. Windows, Ubuntu).

Implementation

GARCOM operates through two main functions: “gene_pos_counts” accepts PLINK10 (RRID:SCR_001757) input data, whereas “vcf_counts_SNP_genecoords” accepts VCF11 input format. After reading in the data, these functions perform operations to count variants within genes/genomic regions for each individual included in the file.

output <- gene_pos_counts(recoded_genetic_data, gene_boundaries, snp_locations)

output <- vcf_counts_SNP_genecoords(recoded_genetic_data, gene_boundaries, snp_locations)

where, “output” is the object generated by GARCOM after a successful run of function; “recoded_genetic_data” is the main input file in PLINK or VCF format; “gene_boundaries”, and “snp_locations” are additional input files for gene/regions and SNP information, respectively.

Typical workflow is shown in Figure 1. In brief, the “gene_pos_counts” function will process genetic input data (“recoded_genetic_data”) generated from the PLINK software through the --recode A option. Data are read in standard matrix format using the data.table R library. For VCF files, the “vcf_counts_SNP_genecoords” function reads the VCF input file employing the extract.gt function from the vcfR library. The genotype values are read within the “GT” field.

e389931b-7f2d-4c39-9c1f-49bfaf495236_figure1.gif

Figure 1. Workflow for standard GARCOM functions.

In addition to the --recode A option, GARCOM requires genetic boundaries information and SNP information as shown in Table 2 and Table 3, respectively.

The output generated by GARCOM is a matrix, with M rows and N columns, where M represents the genes/genomic regions with at least one allele count and N represents the individuals. Genes/regions with zero allelic counts across all individuals are excluded from the final output. Missing values are counted as zero in final output. When no allelic counts are present in the user-defined genes, NULL value is returned. GARCOM allows missing values (NA) in input data.

In addition to the functions above described, GARCOM provides several options for user flexibility. For instance, GARCOM can be ran restricting analyses to 1) a list of genes, or 2) filter SNPs and extract individuals of interest. For instance, users can provide list of individuals using the “keep_indiv” parameter; similarly, genes can be filtered in by using the “filter_gene” parameter.

output <- gene_pos_counts(recoded_genetic_data, gene_boundaries, snp_locations, keep_indiv=mylist.txt)

output <- gene_pos_counts(recoded_genetic_data, gene_boundaries, snp_locations, filter_gene=mysetofgenes.txt)

Use cases

The input PLINK file has a matrix structure of N rows with M columns, where N rows represent individuals (one for each ID). The first six columns are family ID, individual ID, paternal ID, maternal ID, sex, and phenotype (standard output from PLINK (Table 1)). The following columns consist of the variants included in the analyses.

Table 1. Sample rows and columns for input genetics data recoded from PLINK software (--recode A).

FIDIIDPATMATSEXPHENOTYPESNP1_ASNP2_TSNP3_GSNP4_CSNP5_C
FID1IID_sample1001NA110NANA
FID2IID_sample2001NA010NA0
FID3IID_sample3001100100
FID4IID_sample4001100100
FID5IID_sample5001100100

Table 2. Sample data for gene/region boundaries.

Data must contain “GENE”, “START” and “END” column names.

GENESTARTEND
GENE1100180
GENE2200400

Table 3. Sample data for SNP information where SNP and BP column names must be present in input data, where SNP is the single nucleotide polymorphism identifier and BP is base pair location.

SNPBP
SNP1100
SNP2101
SNP3201

Table 4. Sample output, where GENE column identifies the gene/region names with corresponding allele counts per individual.

Individual_ID1, Individual_ID2 and Individual_ID3 are sample individual IDs, where values represent allelic counts within gene for individual.

GENEIndividual_ID1Individual_ID2Individual_ID3
GENE11021
GENE2210

The input VCF file follows the standard VCF formats (please refer to the vcfR library documentation).

Toy data (gene and SNP coordinates) are shared within the package as “genecoord” and “snppos”, respectively.

Simulation

We performed simulation on real data for chromosome 1 (“CHR1”, # of variants = 23,456) and CHR22 (# of variants = 4,814) on randomly sampled individuals (N = 100, 200, 500, 1000, 5000, 10,000) extracted from whole-exome sequencing dataset as described in the study by Tosto et al.12 Genetics data were recoded using PLINK --recode A flag. For both chromosomes we found increased memory consumption and time (Figure 2) as we increased the number of individuals included in the simulation. Memory consumption for CHR22 was significantly lower due to a smaller number of variants and genomic boundaries. Simulations were performed with 16GB memory (RAM) requested on computing cluster node.

e389931b-7f2d-4c39-9c1f-49bfaf495236_figure2.gif

Figure 2. Comparison of memory (in MB) and CPU time (in seconds) for CHR1 and CHR22 on different sample sizes.

Graph A represents memory consumption; graph B shows processing time in seconds for various sample sizes on CHR1 and CHR22.

All simulations were conducted on R (v4.0), data.table (v1.13.6) with default 16 threads, GARCOM (v1.40), bigsnpr (v1.6.1).

Discussion

GARCOM is easy to use where basic knowledge of R programming language is desired. GARCOM is designed by harnessing existing libraries, such as data.table, that allow for efficient handling of large data. GARCOM data processing is independent of the reference genome build. GARCOM can be used on several platforms (e.g., Unix, Windows). GARCOM comes with certain limitations: genomic boundaries and variants' location need to be specified, as mentioned in the package documentation. GARCOM collapses variants based on base-pair (BP) location, hence processing multiple chromosomes at one time would add additional burden (time and memory) and can be hazardous because of identical BP across different chromosomes. In case of large-sized studies, e.g. UK Biobank (N ≥ 200K), processing data per chromosome is highly recommended due to above mentioned memory limitations. Lastly, GARCOM depends on public and freely available R packages.

Future

VCF format can accommodate locus annotation performed by software such as ANNOVAR.13 To this end, GARCOM plans to accommodate annotation filters in addition to the existing ones. One challenge associated with annotated VCF is the resulting large file size; we will try to add this functionality, keeping RAM limitations and processing time in mind. We plan to add features to handle bgen format which stores large amount of genetics data, with appropriate R library (http://www.well.ox.ac.uk/~gav/resources/rbgen_v1.1.5.tgz).

Data and software availability

Sample data associated with the package where applicable are provided within the library with proper documentation. No additional source data are required. We distribute the package under the MIT license. GARCOM can be downloaded from CRAN and GitHub from https://cran.r-project.org/web/packages/GARCOM/index.html and https://github.com/sariya/GARCOM respectively.

Reporting guidelines: Bugs and suggestions are welcome at the GitHub repository.

Author Contribution: SS, GT

Ethical Statement: Informed consent was obtained from all participants. For the whole-exome sequencing, the study protocol was approved by the Institutional Review Board (IRB) of Columbia university Medical Center (CUMC) (Approval number: AAAP0477). The study was conducted according to the principles expressed in the Declaration of Helsinki.

Comments on this article Comments (0)

Version 2
VERSION 2 PUBLISHED 01 Jul 2021
Comment
Author details Author details
Competing interests
Grant information
Copyright
Download
 
Export To
metrics
Views Downloads
F1000Research - -
PubMed Central
Data from PMC are received and updated monthly.
- -
Citations
CITE
how to cite this article
Sariya S and Tosto G. GARCOM: A user-friendly R package for genetic mutation counts [version 2; peer review: 2 approved, 1 approved with reservations]. F1000Research 2024, 10:524 (https://doi.org/10.12688/f1000research.53858.2)
NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.
track
receive updates on this article
Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status: ?
Key to Reviewer Statuses VIEW
ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions
Version 2
VERSION 2
PUBLISHED 17 May 2024
Revised
Views
5
Cite
Reviewer Report 12 Mar 2025
Tao Wang, School of Computer Science, Northwestern Polytechnical University, Xi’an, China 
Approved
VIEWS 5
In this work, the authors developed an R package which could retrive the allelic counts within a genomic region. I have the following suggestions:
As the genomic files are usually very large, it is necessary to take care of ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Wang T. Reviewer Report For: GARCOM: A user-friendly R package for genetic mutation counts [version 2; peer review: 2 approved, 1 approved with reservations]. F1000Research 2024, 10:524 (https://doi.org/10.5256/f1000research.166261.r365266)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
Views
4
Cite
Reviewer Report 14 Jun 2024
Ettore Mosca, National Research Council, Institute of Biomedical Technologies, Segrate (Milan), Italy 
Approved
VIEWS 4
The authors amended the manuscript and answered ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Mosca E. Reviewer Report For: GARCOM: A user-friendly R package for genetic mutation counts [version 2; peer review: 2 approved, 1 approved with reservations]. F1000Research 2024, 10:524 (https://doi.org/10.5256/f1000research.166261.r279700)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
Version 1
VERSION 1
PUBLISHED 01 Jul 2021
Views
10
Cite
Reviewer Report 29 Nov 2021
Stephen M. Pederson, Dame Roma Mitchell Cancer Research Laboratories, Adelaide Medical School, University of Adelaide, Adelaide, SA, Australia 
Approved with Reservations
VIEWS 10
The authors have provided four primary functions for summarising SNP counts using regions or genes as mapping architecture. The data.table package is known for its speed and this may lend a significant performance advantage to these functions, particularly when dealing ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Pederson SM. Reviewer Report For: GARCOM: A user-friendly R package for genetic mutation counts [version 2; peer review: 2 approved, 1 approved with reservations]. F1000Research 2024, 10:524 (https://doi.org/10.5256/f1000research.57282.r98677)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
Views
25
Cite
Reviewer Report 18 Oct 2021
Ettore Mosca, National Research Council, Institute of Biomedical Technologies, Segrate (Milan), Italy 
Approved with Reservations
VIEWS 25
The authors present GARCOM, a tool for quantifying allelic counts within defined genetic boundaries. Overall, the article is well-written; the software is available in CRAN and GitHub, and it is well-documented. At the same time, the article is quite short.
... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Mosca E. Reviewer Report For: GARCOM: A user-friendly R package for genetic mutation counts [version 2; peer review: 2 approved, 1 approved with reservations]. F1000Research 2024, 10:524 (https://doi.org/10.5256/f1000research.57282.r95650)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
  • Author Response 17 May 2024
    Sanjeev Sariya, Taub Institute for Research on Alzheimer’s Disease and the Aging Brain, Columbia University Medical Center, New York, 10032, USA
    17 May 2024
    Author Response
    We would like to thank the reviewer for providing feedback and reviewing our manuscript.

    1) The authors present GARCOM, a tool for quantifying allelic counts within defined genetic
    boundaries. ... Continue reading
COMMENTS ON THIS REPORT
  • Author Response 17 May 2024
    Sanjeev Sariya, Taub Institute for Research on Alzheimer’s Disease and the Aging Brain, Columbia University Medical Center, New York, 10032, USA
    17 May 2024
    Author Response
    We would like to thank the reviewer for providing feedback and reviewing our manuscript.

    1) The authors present GARCOM, a tool for quantifying allelic counts within defined genetic
    boundaries. ... Continue reading

Comments on this article Comments (0)

Version 2
VERSION 2 PUBLISHED 01 Jul 2021
Comment
Alongside their report, reviewers assign a status to the article:
Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions
Sign In
If you've forgotten your password, please enter your email address below and we'll send you instructions on how to reset your password.

The email address should be the one you originally registered with F1000.

Email address not valid, please try again

You registered with F1000 via Google, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Google account password, please click here.

You registered with F1000 via Facebook, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Facebook account password, please click here.

Code not correct, please try again
Email us for further assistance.
Server error, please try again.