Improving rigor and reproducibility in chromatin immunoprecipitation assay data analysis workflows with Rocketchip

Viktoria Haghani; Aditi Goyal; Alan Zhang; Osman Sharifi; Natasha Mariano; Dag Yasui; Ian Korf; Janine LaSalle

doi:10.12688/f1000research.164319.1

Home Browse Improving rigor and reproducibility in chromatin immunoprecipitation...

ALL Metrics

-

Views

-

Downloads

Get PDF

Get XML

Export

▬

✚

Software Tool Article

Improving rigor and reproducibility in chromatin immunoprecipitation assay data analysis workflows with Rocketchip

[version 1; peer review: 1 approved, 1 approved with reservations]

Viktoria Haghani^1,2, Aditi Goyal², Alan Zhang², [...] Osman Sharifi^1,2, Natasha Mariano², Dag Yasui¹, Ian Korf², Janine LaSalle ¹

Viktoria Haghani^1,2, Aditi Goyal², [...] Alan Zhang², Osman Sharifi^1,2, Natasha Mariano², Dag Yasui¹, Ian Korf², Janine LaSalle ¹

PUBLISHED 25 Jun 2025

Author details Author details

¹ Department of Medical Microbiology and Immunology, University of California Davis, Davis, California, 95616, USA
² Department of Molecular and Cellular Biology, University of California Davis, Davis, California, 95616, USA

Viktoria Haghani
Roles: Conceptualization, Data Curation, Formal Analysis, Funding Acquisition, Investigation, Methodology, Project Administration, Software, Validation, Visualization, Writing – Original Draft Preparation, Writing – Review & Editing

Aditi Goyal
Roles: Conceptualization, Data Curation, Methodology, Software, Validation, Writing – Review & Editing

Alan Zhang
Roles: Methodology, Software, Validation, Visualization, Writing – Review & Editing

Osman Sharifi
Roles: Conceptualization, Methodology, Writing – Review & Editing

Natasha Mariano
Roles: Validation, Writing – Review & Editing

Dag Yasui
Roles: Methodology, Supervision, Writing – Review & Editing

Ian Korf
Roles: Conceptualization, Data Curation, Investigation, Methodology, Resources, Software, Supervision, Validation, Writing – Review & Editing

Janine LaSalle
Roles: Conceptualization, Funding Acquisition, Methodology, Project Administration, Resources, Supervision, Writing – Original Draft Preparation, Writing – Review & Editing

OPEN PEER REVIEW

REVIEWER STATUS

Abstract

Background

As genome sequencing technologies advance, the growing accumulation of sequencing data in public databases highlights the need for more robust and adaptable analysis workflows for chromatin immunoprecipitation assays. These workflows must promote reproducibility and replicability of results while effectively leveraging publicly available data to enhance biological insights.

Methods

Here, we present Rocketchip, a Python-based command-line tool that integrates existing software through Snakemake, enabling efficient, automated, and reproducible analysis of both newly generated chromatin immunoprecipitation data, including chromatin immunoprecipitation followed by sequencing (ChIP-seq), cleavage under targets and release using nuclease (CUT&RUN), and cleavage under targets and tagmentation (CUT&Tag) as well as existing datasets.

Results

Rocketchip can analyze chromatin immunoprecipitation data using any available software combination for any reference genome. It enhances the utilization of existing datasets, promotes replicability in analyses, and serves as a platform for benchmarking algorithms.

Conclusions

By facilitating the reanalysis of existing data and allowing for flexible analysis of newly generated data, Rocketchip allows researchers to identify and apply the most accurate and efficient analytical approaches to their data.

Keywords

ChIP-seq, CUT&RUN, CUT&Tag, bioinformatics, workflow, snakemake

Corresponding author: Janine LaSalle

Competing interests: No competing interests were disclosed.

Grant information: This work was supported by the National Institutes of Health [R01 AA027075 to J.M.L.]; and the Autism Science Foundation and Rett Syndrome Research Trust [22-001 to V.H].
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Copyright: © 2025 Haghani V et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

How to cite: Haghani V, Goyal A, Zhang A et al. Improving rigor and reproducibility in chromatin immunoprecipitation assay data analysis workflows with Rocketchip [version 1; peer review: 1 approved, 1 approved with reservations]. F1000Research 2025, 14:625 (https://doi.org/10.12688/f1000research.164319.1) First published: 25 Jun 2025, 14:625 (https://doi.org/10.12688/f1000research.164319.1) Latest published: 25 Jun 2025, 14:625 (https://doi.org/10.12688/f1000research.164319.1)

Introduction

As genome sequencing technologies and their applications continue to rapidly evolve to include epigenomic information, a vast amount of sequencing data is accumulating.¹ Journals now commonly mandate the deposition of raw sequence data to public databases, such as the International Nucleotide Sequence Database Collaboration Sequence Read Archive (SRA), generating a substantial volume of sequencing data.² These mandates are further supported by funding agencies, such as the National Institutes of Health (NIH), which requires NIH-funded research to publish sequence data as part of their Genomic Data Sharing Policy.³ Therefore, there is an increasing need for more comprehensive and biologically relevant data analysis workflows that promote reproducibility of results and allow for increased leverage and comparisons of publicly available data. Although mandated availability of data is beneficial to science, data analysis pipelines are often complicated, with divergent results due to variation in analysis steps, parameter usage, software used, and software version. These problems can partly be solved by workflow managers, which control for analysis step order, software versions via virtual environments, software parameters, etc. This is especially important for workflows requiring multiple analysis steps, such as sequence data produced by chromatin immunoprecipitation assays, where small differences in analytical steps can yield different results.

Chromatin immunoprecipitation followed by sequencing (ChIP-seq) is a technique commonly used to identify protein binding sites in the genome.^4,5 Briefly, DNA associated proteins are chemically fixed onto DNA with a crosslinking agent. The resulting chromatin is then fragmented by either sonication or enzymatic digestion into 100-500 base pair fragments. Next, chromatin fragments bearing the protein of interest are immunoprecipitated using protein-specific antibodies. Typically, ChIP-seq experiments utilize two controls: (1) an input control to correct for differences in sonication and genomic DNA sequence bias and (2) a mock IP to account for nonspecific interactions of the antibody used.^6–8 Ultimately, the chemical cross-link is reversed, allowing the DNA bound by the protein to be purified and then sequenced. In addition to ChIP-seq, other chromatin immunoprecipitation assays, such as cleavage under targets and release using nuclease (CUT&RUN) and cleavage under targets and tagmentation (CUT&Tag) are similarly utilized to assess protein binding sites.^9,10 In CUT&RUN, antibody binding occurs directly on protein-bound DNA fragments within intact nuclei, with DNA fragmentation accomplished enzymatically using a fusion protein containing Protein A and micrococcal nuclease (MNase), in contrast with ChIP-seq’s reliance on sonication for DNA fragmentation. In CUT&Tag, a fusion protein combines Protein A and Tn5 transposase, allowing for simultaneous tagging and binding of sequencing adapters to protein-bound DNA fragments.

Following the unique approaches of ChIP-seq, CUT&RUN, and CUT&Tag in isolating protein-bound DNA, the generated DNA sequence data is aligned to a reference genome, and a peak caller is typically used to identify regions of interest. These peaks are “broad” for proteins with large areas of interactions with DNA, such as histones and DNA methylation interacting proteins (e.g. H3K27me3, H3K36me3, H3K9me3 H3K4me1, etc.). Conversely, proteins that act as transcription factors with sequence specific regions of interaction and select histone marks produce “narrow” peaks (e.g. c-Myc, CTCF, REST, H3K27ac, H3K4me3). Analyzing these different types of DNA binding proteins can sometimes call for different computational controls, requiring researchers to identify analysis software best suited for analyzing their data. Given the variation in how peaks are generated, there is a significant amount of statistical noise in these experiments, which can hinder the efficacy of peak calling algorithms. This is further complicated by sequencing issues associated with strand bias, GC content, PCR amplification, library preparation, primer choice, sequencing platform, and antibody choice.^8,11–20 Consequently, it is increasingly difficult to reproduce or replicate experimental findings. From this point forward, “reproducibility” refers to an individual’s or group’s ability to generate the same findings as another study, whereas “replicability” represents an individual’s or group’s ability to recreate findings multiple times using the same data input. Overall, this highlights the need for consistent and controlled data analysis practices, ensuring reproducibility and replicability of results.

Here, we present Rocketchip, available at https://github.com/vhaghani26/rocketchip, to address key aspects of these problems. Rocketchip reduces variation in data analysis methodologies, increase reproducibility and replicability of experimental results, and encourage greater usage of publicly available sequence data.

Methods

Implementation

Rocketchip is an automated bioinformatics workflow written in the Python-based (v3.10.12) workflow manager, Snakemake (v7.32.4)²¹ ( Figure 1). Rocketchip takes full advantage of Snakemake’s capabilities to ensure a highly flexible and adaptable workflow. By consolidating Snakemake rules from various software tools into a single file tailored to the user’s data, Rocketchip supports the seamless incorporation of new tools and updates. This design is crucial in the rapidly evolving field of bioinformatics, allowing Rocketchip to stay current with the latest advancements without requiring a complete overhaul of the existing pipeline. Additionally, Rocketchip efficiently handles large-scale genomic studies by leveraging Snakemake’s ability to process hundreds of files and utilize parallelization based on available computational resources. This makes it suitable for handling extensive datasets on high-performance computing (HPC) systems, ensuring that users can perform complex analyses with optimal resource allocation. Furthermore, Rocketchip utilizes Snakemake’s error-handling capabilities, including a log of standard error outputs per sample, enabling users to investigate and address specific data processing issues as well as utilizing the capability to resume workflows from the point of failure rather than re-running the entire pipeline. This error-handling capability is further enhanced by the inclusion of two FastQC steps in Rocketchip’s workflow, one at the beginning and one after deduplication, to provide additional checkpoints for assessing sequence quality and detecting potential issues.

Figure 1. Rocketchip pipeline for ChIP-seq data analysis.

Rocketchip begins by downloading raw sequencing data and the reference genome, followed by an initial quality control check. Next, the raw data is aligned to the reference genome using the user’s choice of alignment software, and PCR duplicates are optionally removed. A second quality control check is performed after deduplication to assess alignment quality and any remaining artifacts. Aligned data is then converted into bigwig files for visualization, and peak calling is performed using customizable options for peak type and control usage. The final outputs include processed data ready for downstream analysis or visualization in genome browsers.

Operation

Rocketchip downloads ChIP-seq data directly from the SRA, the largest publicly available sequence database, using the SRA Toolkit (sra-tools, 3.0.9).^2,22 The SRA Toolkit downloads data via the prefetch and fasterq-dump functions, then splits the raw DNA sequence read file into respective paired-end (PE) read files using the flag --split-files and converts them into FASTQ file formats, whereas for single-end (SE) read files, the SRA Toolkit is utilized solely for the download and conversion of raw read data into FASTQ format. Rocketchip also presents the option for users to use local (i.e. non-SRA sourced) ChIP-seq data by providing their own FASTQ files. In parallel, Rocketchip downloads and processes a reference genome of the user’s choice from the UCSC Genome Browser.²³ After downloading and converting files, files are stored in the local file system in FASTA and FASTQ formats. Users also have the option of storing their own custom genomes and sequencing reads that are not yet publicly available.

In Rocketchip, raw sequence data undergoes a quality control step using FastQC (v0.12.1) with default parameters to assess potential problems that could impact downstream analysis, including levels of data duplication, base pair quality, adapter contamination, and overall sequence quality.²⁴ This initial FastQC step provides users with the ability to investigate samples if downstream analysis fails, which can occur if there are problems involving sequence quality, contamination, or high levels of data duplication. Raw sequence data is then aligned to the reference genome using the user’s choice of alignment software from BWA-MEM (v0.7.17), Bowtie2 (v2.5.2), or STAR (v2.7.11a), each with the default parameters.^25–27 Intermediate data files are processed as necessary for deduplication. PCR duplicates are removed from sequence data using a user’s choice of deduplication software from Samtools (v1.18) with --mode s (i.e. standard PCR duplicate detection), Picard (v3.1.1) using MarkDuplicates, Sambamba (v1.0.0) with default parameters, or no deduplication.^28–30 Deduplicated data files are subject to another quality control step using FastQC to provide further information, such as alignment quality, validating processing steps (i.e. success of deduplication), and detecting any remaining biases or artifacts (e.g. GC content, adapter contamination, etc.) to ensure overall data integrity. Next, Deeptools’ (v3.5.4) bamCoverage function with default parameters is used to convert data from the BAM file format to the bigwig file format, which can be used for visualization of ChIP-seq data in the UCSC Genome Browser or other visualization tools.³¹ Finally, in Rocketchip, peaks are called with a user’s choice of software from MACS3 (v3.0.0b3) with the --bdg option, Genrich (v0.6.1) using default parameters, PePr (v1.1.24) using default parameters, or CisGenome (v2.0).^32–35 For CisGenome, if a control was used, the seqpeak command was employed with the default options. Without a control, CisGenome ran two rounds of peak calling using the hts_peakdetectorv2 command with options “-w 100 -s 25 -c 10 -br 1 -brl 30 -ssf 1.”

Peak-calling options in Rocketchip include narrow versus broad peak calling and use of a control where appropriate. Software version control is handled using Conda to ensure reproducibility of results and compatibility between software versions.³⁶ Software options were chosen based on options for command line use, ease of installation, and their standard use in the field.

Users interested in using Rocketchip can install the source code via PyPi (i.e. `pip install Rocketchip`). Rocketchip requires a minimum Python version >=3.6. Associated dependencies are Conda installable and are stored in an environment file (https://github.com/vhaghani26/rocketchip/blob/main/environment.yml) with recommended software versions compatible with Rocketchip. This allows user flexibility in installing only dependencies of interest (e.g. only MACS3 instead of MACS3, Genrich, and PePr) and changing dependency versions if desired for algorithm benchmarking purposes or to ensure compatibility with the user’s environment during installation. After installation, the user must create a project YAML file containing relevant project details, including which genome to use for alignment, sample names and paths if using local data, and software selections for the workflow to execute. Rocketchip will use this input to generate a Snakefile tailored specifically to the user’s data, project details, and requested software combination with appropriate parameters. Then, the user will run the Snakefile using Snakemake’s command-line interface arguments, which are briefly detailed in the Rocketchip GitHub repository (https://github.com/vhaghani26/rocketchip). Overall, this approach offers users flexibility to efficiently test various software combinations for their data. It simplifies the execution of an entire workflow into just a few steps: installation, creating a project file, running the Rocketchip command, and executing the workflow with Snakemake. Additionally, it enhances the reproducibility of results by documenting all inputs, outputs, and software versions, while leveraging the automated nature of Snakemake’s workflow execution.

Use cases

Conducting ChIP-seq data analysis via all software combinations

In order to ensure that all combinations of software can be integrated seamlessly and used for ChIP-seq data analysis, we selected a deeply sequenced experimental ChIP-seq study targeting a transcription factor, MeCP2, in mouse main olfactory epithelium conducted by Rube et al. 2016, hereafter referred to as “Rube”,³⁷ and a ChIP-seq study targeting NRF2 in human non-small lung cells conducted by Namani et al. 2019, hereafter referred to as “Namani”.³⁸ To facilitate this comprehensive assessment, these data sets were run through Rocketchip using all four peak callers, namely MACS3, CisGenome, Genrich, and PePr. This was used in combination with each of the three aligners, BWA-MEM, Bowtie2, and STAR and four deduplication techniques, Samtools, Sambamba, Picard, and no deduplication. Additionally, to validate that Rocketchip can be run with or without a control (i.e. input or IgG control), this analysis was conducted with both the usage and omission of the corresponding control for each data set, with the exceptions of CisGenome and PePr, which can only be run if a control is used in the analysis. Each test was run three times to assess Rocketchip’s ability to replicate experimental results.

Each algorithm demonstrated varying peak-calling efficiency, influenced by several key factors ( Figure 2, Supplementary Table 1). Most notably, the source of the data (i.e, Namani vs. Rube) revealed dramatic differences in the performance of peak-calling algorithms. For the Rube data, MACS3 consistently yielded the highest number of called peaks. All methods of deduplication produced comparable peak counts. Bowtie2 and BWA-MEM performed similarly, identifying slightly more peaks compared to STAR. CisGenome called the fewest peaks, significantly influenced by the deduplicator and aligner used. Without deduplication, CisGenome identified the fewest peaks across any software combination, while peak counts increased with any deduplication method. When STAR was used as the aligner with CisGenome, peak counts were the lowest. BWA-MEM yielded a higher peak count than Bowtie2 but exhibited more non-deterministic behavior compared to Bowtie2 and STAR. For Genrich, peak counts increased when the control was omitted during peak calling. In contrast to CisGenome, Genrich had the highest peak counts when no deduplication was used and when STAR was used for alignment. The other deduplicators and aligners showed negligible differences in peak counts. When using PePr for peak calling on the Rube data, no deduplication yielded the highest peak count, with other deduplication methods yielding similar results. Unlike the other algorithms, each aligner performed differently with PePr, with STAR identifying the most peaks, followed by BWA-MEM and then Bowtie2. Both the narrow- and broad-peak-calling algorithms yielded negligible differences, except for CisGenome, which had less deterministic results when peaks were defined as narrow and run through Rocketchip.

Figure 2. Peak counts for all software combinations.

ChIP-seq data from Namani et al. 2019 and Rube et al. 2016 were run through all software combinations in Rocketchip three times each. The data source (i.e. Rube vs. Namani) is yellow, with the light yellow corresponding to the Namani data and darker yellow corresponding to the Rube data. A subset of analyses were run with a control vs. no control. This is represented by light pink for a control being used during peak-calling and dark pink if a control was omitted during peak-calling. The data was run through the software combinations as both a broad peak (light orange) and narrow peak (dark orange) to evaluate algorithm performances, though the Namani data uses NRF2, which is characterized best by narrow peaks, and the Rube data uses MeCP2, which is characterized best by broad peaks. The aligners are represented in red, with Bowtie2 being a light red, BWA-MEM being a medium shade of red, and STAR being a dark red. The peak-caller is highlighted in green with shades of green getting progressively darker for Cisgenome, Genrich, MACS3, and PePr. Similarly, the deduplicator is represented by shades of purple, with no deduplication (“None”) being the lightest, followed by Picard, Sambamba, and then Samtools, which is the darkest. Because each combination was run three times, each of the three trials is plotted by shades of blue. The first trial is a light shade of blue, the second trial is a medium shade, and the third trial is the darkest shade. Raw peak counts were log2 transformed and plotted in the heatmap. Darker red corresponds to higher peak counts while darker blue corresponds to lower peak counts. Gray corresponds to “NA” values, as PePr and CisGenome cannot be run without a control. Raw peak counts can be found in Supplementary Table 1. The heatmap was created using R (v4.2.3) with a kernel (r-irkernel v1.3.2) in Jupyter Notebook (v1.0.0) using the following packages: ComplexHeatmap (v2.14.0), dplyr (v1.1.4), tidyr (v1.3.1), reshape2 (v1.4.4), stringr (v1.5.1), and MASS (v7.3.60.0.1).

For the Namani data, notable contrasts in peak-calling outcomes were observed across various algorithms. Genrich consistently yielded the highest number of peaks, contrasting with the Rube data where MACS3 showed higher peak counts. Interestingly, PePr consistently produced the lowest peak counts for the Namani dataset despite performing second best for the Rube data. The choice of aligner did not significantly impact peak counts overall; however, omitting deduplication generally resulted in slightly higher peak counts across all peak-callers. Unlike the Rube data, where the distinction between narrow- and broad-peak calling showed minimal differences across algorithms, the Namani data exhibited variability based on this distinction. Specifically, MACS3 identified higher peak counts under the assumption of broad peaks compared to narrow peaks.

In addition to assessing the results of the different algorithms, we also assessed Rocketchip’s ability to replicate experimental findings. Both the Namani and Rube data sets were run through every software combination in Rocketchip three times, totaling 288 software combinations with three trials per combination. Among the 288 software combinations tested, 274 combinations (95.14%) demonstrated perfect replication of peak counts across all three trials (i.e. every trial yielded identical peak counts). Surprisingly, despite identical software versions, computational resources, and inputs, certain combinations of data and software were non-deterministic, with 4.86% (14 out of 288) exhibiting variability in peak counts ( Table 1). Given that non-deterministic results were observed exclusively for the Rube data, this data was downsampled to 5,000 reads per sample using Seqtk (v1.4) and rerun through every combination 100 times to further assess the source of variation. While variation in peak counts persisted, the maximum range between the minimum and maximum observed peaks was limited to one across all combinations (see Supplementary Table 2). Notably, the software combinations that produced non-deterministic results differed from those in the initial analysis. Therefore, we hypothesize that the low level of variation observed in our results may stem from runtime errors.

Table 1. Variation in called peaks.

Project	Control	Peak type	Aligner	Peak caller	Deduplicator	Trial 1 peak count	Trial 2 peak count	Trial 3 peak count	Difference in range of called peaks
Rube	No Control	Broad	STAR	Genrich	Sambamba	252158	252160	252158	2
		Broad	STAR	MACS3	Sambamba	1157882	1157876	1157881	6
		Narrow	STAR	Genrich	Sambamba	252160	252158	252159	2
		Narrow	STAR	MACS3	Sambamba	1168346	1168345	1168346	1
	With Control	Broad	STAR	Genrich	Sambamba	127543	127542	127542	1
				MACS3	No Deduplication	1002670	1002671	1002670	1
				MACS3	Sambamba	1004045	1004044	1004045	1
				PePr	Sambamba	568556	568556	568554	2
		Narrow	STAR	Genrich	Sambamba	127542	127543	127543	1
				MACS3	Sambamba	1143542	1143541	1143542	1
				PePr	No Deduplication	1713483	1713484	1713484	1
				PePr	Sambamba	721309	721307	721309	2
			BWA-MEM	Cisgenome	Picard	175097	15553	15553	159544
			BWA-MEM	Cisgenome	Sambamba	17690	15553	35909	20356

Overall, these findings underscore four critical points. First, the choice of software combination, including the algorithm, deduplication method, and aligner, has a significant impact on peak-calling outcomes. Second, even with strict control of factors impacting the analysis, we still observe some variability in peak calling, albeit in a small subset of cases. Third, dataset-specific nuances significantly impact the performance of different software combinations, resulting in a differing consensus on what the “best” software combination is. Finally, the results validate that all available software combinations can be successfully executed within Rocketchip, ensuring flexibility and robustness in ChIP-seq analyses.

Benchmarking peak detection on simulated data and evaluating algorithm performance using F1 scores, precision, and sensitivity

Benchmarking algorithm performance on real biological data is challenging, as true binding sites are not known. To illustrate Rocketchip’s ability to benchmark algorithm performance, six simulated test data sets were designed for each combination of PE and SE reads with narrow and broad peaks. Each data set has two replicates and a synthetic genome composed of uniformly distributed base pairs. The dataset is modeled to have precisely 1000 peaks with varying numbers of 150 base pair long reads per peak to represent various peak coverage depths for different data sets: 4X, 8X, 16X, 32X, 64X, and 128X. Peaks were evenly distributed 2500 base pairs apart for narrow peaks and 5000 base pairs apart for broad peaks. Read locations within peaks were simulated with a Gaussian distribution centered on the peak summit, spanning 400 base pairs for narrow peaks and 1500 base pairs for broad peaks. Reads were dispersed with a standard deviation of 0.1 to ensure variability around the peak summit. A study assessing PCR duplication in ChIP-seq found that PCR duplication ranged from 2.94-55.38% among their tested ChIP-seq data sets, ultimately concluding that a rate of 20% duplication or less is most effective for read mapping.³⁹ Thus, PCR duplicates were simulated at a 20% rate. Input controls, which represent sonicated DNA fragments, were modeled as randomly distributed genomic reads.

This dataset was run through all combinations in Rocketchip, including the inclusion and exclusion of the input control, which accounts for genomic noise and is believed to improve accuracy. CisGenome and PePr require a control, so they were only tested with the input control. We varied read type (PE and SE) and peak type (narrow and broad) to systematically evaluate software performance across all combinations. The objective of this experiment was to assess accuracy, precision, and sensitivity of each combination, where accuracy, which is measured using F1 scores, measures how well peaks match true binding sites, precision indicates the proportion of true peaks among identified peaks, and sensitivity measures the proportion of true peaks detected.

F1, precision, and sensitivity scores were calculated based on true positive (TP), false positive (FP), or false negative (FN) base pairs within peaks. TP represents overlapping base pairs between expected and observed peaks, FP represents base pairs in observed peaks not matching expected peaks, and FN represents base pairs in expected peaks not observed. F1 scores were calculated as [2TP/(2TP + FP + FN)], precision as [TP/(TP + FP)], and sensitivity as [TP/(TP + FN)]. For Cisgenome and PePr, TP, FP, and FN were set to zero, as they cannot be run without a control, indicating an inability to be accurate, precise, or sensitive without using a control.

The results of this experiment revealed that, generally, narrow peak-calling had higher accuracy, precision, and sensitivity than broad peak-calling ( Figure 3). Interestingly, the peak callers with the highest F1, precision, and sensitivity scores, Genrich and MACS3, performed better when run without an input control compared to with a control. MACS3 and Genrich had the highest accuracy, precision, and sensitivity, followed by PePr and Cisgenome, which is consistent with the analyses using the Rube and Namani data. No major differences were observed based on the aligner or deduplicator used in any of the analysis. Overall, this analysis provides further insights into the algorithm performances, allowing for more robust and informed decisions in ChIP-seq data analysis.

Figure 3. Accuracy, precision, and sensitivity measures for rocketchip software combinations.

Simulated ChIP-seq data was run through all software combinations in Rocketchip. For each heatmap, the endedness of the data (PE vs. SE) is pink, with the light pink corresponding to PE data and dark pink corresponding to SE data. The data was run using broad peak (light orange) and narrow peak (dark orange) algorithms depending on whether broad or narrow peaks were modeled, respectively. The aligners are represented in red, with Bowtie2 being a light red, BWA-MEM being a medium shade of red, and STAR being a dark red. The peak-caller is highlighted in green with shades of green getting progressively darker for Cisgenome, Genrich, MACS3, and PePr. Similarly, the deduplicator is represented by shades of purple, with no deduplication (“None”) being the lightest, followed by Picard, Sambamba, and then Samtools, which is the darkest. The peak coverage, ranging from 4X-128X is depicted using shades of blue, with the shade getting darker as peak coverage increases for the given data set. Darker red shading in the plotted heatmap values corresponds to higher accuracy, precision, or sensitivity, while darker blue corresponds to lower accuracy, precision, or sensitivity. (A) The plotted values represent the calculated F1 scores when incorporating a control during peak-calling. (B) The plotted values represent the calculated F1 scores when omitting a control during peak-calling. (C) The plotted values represent the calculated precision scores when incorporating a control during peak-calling. (D) The plotted values represent the calculated precision scores when omitting a control during peak-calling. (E) The plotted values represent the calculated sensitivity scores when incorporating a control during peak-calling. (F) The plotted values represent the calculated sensitivity scores when omitting a control during peak-calling. The heatmaps were created using R (v4.2.3) with a kernel (r-irkernel v1.3.2) in Jupyter Notebook (v1.0.0) using the following packages: ComplexHeatmap (v2.14.0), dplyr (v1.1.4), tidyr (v1.3.1), reshape2 (v1.4.4), stringr (v1.5.1), and MASS (v7.3.60.0.1).

Demonstrating rocketchip’s compatibility with experimental ChIP-seq data from varying genomes

Rocketchip was tested for compatibility with a range of genomes from the UCSC Genome Browser, which includes the seven commonly used genomes displayed by default in the “Genomes” tab: human (hg38), mouse (mm10), rat (rn6), zebrafish (danRer11), fruitfly (dm6), worm (ce11), and yeast (sacCer3). Three published ChIP-seq data sets on the SRA with varying read coverage were selected per genome and run through Rocketchip ( Figure 4).^40–51 All experiments used PE data and were conducted using BWA-MEM for alignment, Samtools for deduplication, and MACS3 for peak-calling. These experiments were run on an HPC with 64 CPUs and 250 GB of memory available, running one job at a time without parallelization. To ensure consistency, genome copies were deleted between runs using the same genome to ensure that the run time accounts for the full workflow. The results of this experiment validated Rocketchip’s compatibility with the seven primary genomes from the UCSC Genome Browser and use of sequence data hosted on the SRA. Additionally, it provides users with estimates of how long Rocketchip should run for different genomes to allow for appropriate computational resource requests. Run times ranged from 0.55 hours to 20.54 hours depending on the genome size and read data used.

Figure 4. Rocketchip execution times per genome.

This is a horizontal bar plot representing how long (in hours) each sample per genome took to run. The label on the bars represents the time in strictly hours (i.e. not hours and minutes) that each sample took to run through the pipeline from beginning to end assuming no prior storage of any data required of the analysis. The color of the bars corresponds to which organism the sample comes from and which genome it was run with: red for dm6, orange for hg38, green for mm10, turquoise for rn6, blue for ce11, purple for sacCer3, and pink for danRer11. The bar plot was created using R (v4.2.3) with a kernel (r-irkernel v1.3.2) in Jupyter Notebook (v1.0.0) using the following packages: ggplot2 (v3.4.4) and dplyr (v1.1.4).

Conducting CUT&RUN and CUT&Tag data analysis with rocketchip

As of June 18, 2024, there are 363,213 ChIP-seq, 494,128 CUT&RUN, and 19,492 CUT&Tag data sets available on the SRA. Due to the increasing use of CUT&RUN and CUT&Tag, we wanted to assess Rocketchip’s ability to effectively process data generated by these mapping techniques. Therefore, we applied Rocketchip to CUT&RUN and CUT&Tag data generated by Akdogan-Ozdilek et al. that sought to characterize the zebrafish epigenome during embryogenesis.⁵² This data set was chosen due to the thorough documentation of the results and high sequence data quality.

In evaluating the performance of Rocketchip for CUT&RUN and CUT&Tag data analysis, we assessed several key metrics, including the percentage of reads aligned, sequence duplication percentages, and peak counts ( Table 2). The alignment percentage was chosen as a metric due to its significance in assessing the overall data processing efficiency and alignment accuracy. Alignment percentage serves as a key indicator of how effectively Rocketchip handles the unique characteristics of CUT&RUN and CUT&Tag data sets, ensuring that a substantial proportion of reads are appropriately mapped to the reference genome and available for further analysis. Sequence duplication percentages were also evaluated, as they indicate Rocketchip’s ability to reduce data redundancy while maintaining data quality throughout the pipeline. Additionally, peak counts provide insight into Rocketchip’s ability identify and quantify expected protein binding sites. Comparisons between processed read counts and peak counts demonstrate that Rocketchip scales appropriately with the number of reads remaining after data processing, reflecting its capacity to accurately assess and interpret the genomic data.

Table 2. Evaluation of rocketchip using CUT&RUN and CUT&Tag data.

SRA ID	Origin	Endedness	Metadata	Raw reads	% Aligned (Akdogan-Ozdilek et al.)	% Aligned (Rocketchip)	Sequence duplication	Read counts after processing	Peak counts
SRR14850825	CUT&RUN	Single	6hpf_H3K4me3_rep1	31,343,063	77.19%	95.68%	40.94%	17,710,734	144732
SRR14850826	CUT&RUN	Single	6hpf_H3K4me3_rep2	70,960,918	76.16%	95.41%	53.75%	31,313,977	144732
SRR14850827	CUT&RUN	Single	6hpf_H3K27me3_rep1	26,960,471	70.93%	95.97%	59.18%	10,560,814	71402
SRR14850828	CUT&RUN	Single	6hpf_H3K27me3_rep2	27,851,277	70.16%	95.34%	58.47%	11,026,320	71402
SRR14850829	CUT&RUN	Single	6hpf_H3K9me3_rep1	22,214,067	22.03%	96.25%	42.26%	12,345,779	139743
SRR14850830	CUT&RUN	Single	6hpf_H3K9me3_rep2	43,462,568	23.13%	96.19%	48.13%	21,685,002	139743
SRR14850831	CUT&RUN	Paired	6hpf_pol2_rep1	9,842,850	73.38%	92.91%	6.85%	17,037,027	29723
SRR14850832	CUT&RUN	Paired	6hpf_pol2_rep2	5,707,930	72.58%	92.45%	4.78%	10,050,071
SRR14850833	CUT&RUN	Paired	6hpf_IgG	8,094,912	59.14%	78.45%	12.87%	11,065,472
SRR14870792	CUT&Tag	Paired	6hpf_H2AZ_Rep1	1,233,216	91.34%	84.53%	49.67%	1,049,096	45588
SRR14870793	CUT&Tag	Paired	6hpf_H2AZ_Rep2	1,447,412	90.22%	82.65%	53.82%	1,104,820
SRR14870794	CUT&Tag	Paired	6hpf_H2AZ_Rep3	3,135,635	90.80%	78.47%	40.58%	2,923,649
SRR14870795	CUT&Tag	Paired	24hpf_H2AZ_Rep1	23,376,962	93.17%	85.90%	61.00%	15,657,482	74207
SRR14870796	CUT&Tag	Paired	24hpf_H2AZ_Rep2	28,652,092	94.05%	85.68%	72.26%	13,612,305
SRR14870797	CUT&Tag	Paired	24hpf_H2AZ_Rep3	23,810,168	93.61%	85.51%	66.74%	13,536,403

The CUT&RUN data set consisted of nine samples. Six samples were SE reads and corresponded to two replicates each for detection of H3K4me3, H3K27me3, and H3K9me3 (SRR14850825 and SRR14850826, SRR14850827 and SRR14850828, and SRR14850829 and SRR14850830, respectively). The remaining three sets were PE read samples that corresponded to two replicates for the detection of RNA polymerase II and mock IP control using the IgG antibody (SRR14850831 and SRR14850832, and SRR14850833, respectively). The study that originally produced these data sets employed Bowtie2 to align sequences, Samtools to filter aligned sequences, and HOMER for peak-calling. Rocketchip was run using Bowtie2, Samtools, and MACS3 for broad peak-calling. As the data was partially SE and partially PE reads, analyses were conducted separately, with the control being used for the PE analysis due to compatibility. Sequence duplication percentages varied across samples, with some samples exhibiting high duplication rates, which is indicative of data redundancy that Rocketchip managed effectively. Comparisons between processed read counts and peak counts demonstrated that Rocketchip scales appropriately with the number of reads remaining after data processing, reflecting its capacity to accurately assess and interpret genomic data. Original alignment percentages were compared to those obtained via Rocketchip. A paired t-test was conducted using each SRA input as an observation, yielding a p-value of 0.00302, with Rocketchip consistently producing significantly better alignment percentages for the CUT&RUN data compared to the original analysis. The mean difference in the alignment percentages was 32.66%. We also noted that Rocketchip detected a substantial number of peaks in relation to the processed reads, indicating its ability to effectively parse and process signal (peaks) versus biological noise. We hypothesize that the alignment accuracy differences are related to the use of different Bowtie2 versions, as the original study used Bowtie 2.4.1 to align their CUT&RUN data, whereas Rocketchip used version 2.5.2. The GitHub change log for Bowtie2 version 2.5.1, one version earlier than the one Rocketchip uses, notes: “fixed an issue affecting bowtie2 alignment accuracy” and “fixed a segmentation fault that would occur while aligning SRA data.” This highlights the need to revisit and utilize publicly available data, as software updates can improve data processing and thus yield more accurate results. This also highlights Rocketchip’s ability to accurately and effectively process CUT&RUN data.

The CUT&Tag data set used for testing in Rocketchip was comprised of six PE read samples. There are three replicates each of H2A.Z at 6 hours and 24 hours post fertilization (SRR14870792, SRR14870793, and SRR14870794 and SRR14870795, SRR14870796, and SRR14870797, respectively). The original study used Bowtie2 for alignment, Samtools for filtering, Picard for deduplication, and MACS2 for peak-calling. We ran Rocketchip using Bowtie2, Samtools, and MACS3. Similar to the CUT&RUN data, sequence duplication percentages varied, and Rocketchip was able to efficiently process the data without encountering any errors. Original alignment percentages were compared to those obtained via Rocketchip. A paired t-test was conducted using each SRA input as an observation, yielding a p-value of 0.00015, with Rocketchip resulting in lower alignment percentages for the CUT&Tag data compared to the original analysis. The magnitude of difference in alignment percentages, however, was less than that of the CUT&RUN data, as the mean difference by Rocketchip was -8.41% for CUT&Tag as opposed to +32.66% for CUT&RUN. However, the alignment percentages achieved in our Rocketchip CUT&Tag analysis still surpassed every alignment percentage reported in the original study for their CUT&RUN data, suggesting that Rocketchip is suitable for analyzing CUT&Tag data. We hypothesize that the alignment difference may be due to the usage of both Samtools and Picard in the original CUT&Tag analysis as opposed to just using Samtools in Rocketchip.

Discussion

Rocketchip is a distinct, novel, and innovative tool due to its unique approach to automating and allowing for flexibility in ChIP-seq, CUT&RUN, and CUT&Tag data analysis workflows. Unlike traditional methods that often require manual intervention and lack reproducibility, Rocketchip provides a straightforward solution by integrating existing software to automatically run analyses for large-scale datasets. Researchers can easily interchange analysis components and rerun their analysis to identify the most appropriate software options for their data. Additionally, Rocketchip was designed to be user-friendly, making it more accessible to researchers with limited bioinformatics expertise compared to traditional methods that require users to navigate software installation, parameter determination, and command inputs from scratch, promoting broader utilization of publicly available sequence data.

In our analyses of Rocketchip using published experimental ChIP-seq, CUT&RUN, and CUT&Tag data, we demonstrated Rocketchip’s ability to handle diverse software combinations seamlessly, which is critical given the variability observed in peak-calling efficiency across different datasets and software tools. We found that the choice of peak caller, aligner, and deduplication method significantly influenced peak-calling outcomes in a data-specific manor. These disparities persisted even when identical software combinations were employed, highlighting the significant impact of dataset-specific factors on the results. This can likely be attributed to NRF2 exhibiting a narrow-peak pattern, while MeCP2 exhibits a broad-peak pattern with a low signal-to-noise ratio.^37,38,53 These variations underscore the importance of selecting the appropriate software combination tailored to specific experimental contexts.

Rocketchip demonstrated a notable ability to replicate results across multiple runs, achieving a perfect replication rate of 95.14% across 288 software combinations tested (i.e. 274/288 combinations yielded identical peak counts in all three trials conducted for each combination). However, 4.86% of the combinations (14/288) exhibited variability in peak counts, highlighting instances of non-deterministic behavior despite using the same software versions, computational resources, and input data. This inconsistency underscores the complexity of peak calling algorithms and their sensitivity to specific data and software combinations. In our follow-up analysis with the downsampled Rube data, peak counts once again showed non-deterministic behavior, but variations were limited to a maximum range of one peak across all combinations. Interestingly, the software combinations yielding these inconsistent results differed from those in the initial analysis, indicating that runtime errors or concurrency errors related to multithreading may contribute to the observed variability. Ultimately, the persistence of any variation in results using Rocketchip highlights the need for the further investigation of determinism in algorithms commonly used for analyzing genomic data. Furthermore, there is a seemingly constant push to standardize pipelines and tools, but these analyses demonstrate that standardization is likely not possible due to differences in genomic data. There is, therefore, an increasing need for flexible workflows that provide a streamlined approach to facilitate robust data analysis.

In our analyses of Rocketchip using simulated data, we evaluated the accuracy, precision, and sensitivity of various software combinations. Notably, our results indicate that peak-calling performance varied significantly with different configurations and peak types, particularly at lower coverage levels. Consistent trends observed across both simulated and experimental datasets, including the Rube and Namani data, highlight the importance of careful selection of software in the analysis of ChIP-seq data. While synthetic datasets provide a controlled environment for testing algorithm performance, they inherently lack the complexity of real biological samples, necessitating further investigation in more diverse biological contexts.

For the CUT&RUN analysis, Rocketchip significantly improved the percentage of mapped reads, likely due to the use of the updated Bowtie2 version. However, for CUT&Tag, Rocketchip resulted in lower read alignment that could not be easily explained by differences in the aligner or deduplicator used prior to peak-calling. This unsolved discrepancy highlights the necessity of documenting software versions and parameters used in analyses to enable replication of results.

When using Rocketchip, a few possible limitations to Rocketchip should be considered. First, it should be noted that broad and narrow-peak-calling must be done in separate Rocketchip runs. This is due to the inherent variability in how peaks are represented via read counts. Similarly, SE and PE data sets must be run separately, as these data types are processed differently at the start of the analysis. Additionally, updating software may yield incompatibilities between dependencies; however, Rocketchip ensures version control via Conda to eliminate potential problems with version incompatibilities.

Future goals for Rocketchip include packaging Rocketchip source code and all dependencies to be Conda installable. We are also interested in updating Rocketchip to facilitate cloud computing options on a per rule basis (i.e. submitting each rule as its own job). We also look forward to expanding Rocketchip’s selection of software, including other peak-calling algorithms, to provide further user customization options. For instance, WASP⁵⁴ has recently become a leading method for deduplication of sequence data. It takes an allele-aware approach to mapping reads back to the reference genome, and discards reads that fail to map to the same region of the genome when the complementary read is considered. This approach reduces false positives for allele imbalance and can help improve peak quality in ChIP-seq data. This algorithm has already been incorporated into STAR alignment. Thus, we aim to release a future version of Rocketchip with an option of using WASP, which will thereby circumvent the following deduplication step in the pipeline.

Currently, Rocketchip’s core function is to provide researchers with flexibility using different avenues to analyze their data, eventually resulting in the identification of enriched regions (i.e. binding sites). However, differences in experimental design pose challenges for downstream analysis because ChIP-seq can be used for various purposes, such as motif finding, differential binding analysis, or classifying the epigenomic landscape. For instance, experimental uses of ChIP-seq can vary from characterizing the motif of a newly identified transcription factor to understanding protein binding behavior across different concentrations of protein expression and more. Essentially, the diversity in ChIP-seq applications highlights the difficulty of creating a one-size-fits-all solution for post-peak-calling analysis. Nevertheless, Rocketchip’s adaptability allows it to serve as a robust tool for a wide range of research goals, leaving downstream analysis customizable based on the specific needs of researchers. Should researchers choose to implement motif-finding for their data, a plethora of options are available for use, including HOMER⁵⁵, HOCOMOCO,⁵⁶ MEME SUITE,⁵⁷ XXmotif,⁵⁸ etc. Other options for downstream analysis include differential binding analysis using tools such as BEDTools⁵⁹ or GenomicRanges⁶⁰ and gene ontology/enrichment analysis using tools such as Enrichr⁶¹ and GOnet.⁶²

Future goals for analyses using Rocketchip include conducting a meta-analysis of all published data sets for a specific transcription factor. With increased sample sizes and varying coverage, this may yield improved accuracy of transcription factor motifs and in vivo binding properties across a variety of different tissue types and conditions. Additionally, we hope to conduct tests on simulated ChIP-seq data to better understand what factors impact ChIP-seq data and how they do so. This includes modeling narrow vs. broad peak regions, as well as varying peak density and coverage, GC-rich regions, overlapping and bimodal peak regions, and levels of PCR duplication. Synthetic data with these modeled characteristics can be run through the various software combinations within Rocketchip to better understand which tools are better suited for different types of data. Ultimately, this would further researchers’ ability to better tailor specific analysis tools to their data.

Software availability

- Source code available from: https://github.com/vhaghani26/rocketchip
- Archived software available from: https://doi.org/10.5281/zenodo.15319648
- License: MIT License.

Ethics and consent

Ethical approval and consent were not required.

Data availability

No data associated with this article.

Extended data

Zenodo: Rocketchip_Supplementary_Files: Archive Data, https://doi.org/10.5281/zenodo.15399822⁶³

- This project contains the following extended data:
- ○ Supplementary Table 1. Peak Counts for all Software Combinations. ChIP-seq data from Namani et al. 2019 and Rube et al. 2016 were run through all software combinations in Rocketchip three times each. The “File Name” column corresponds to the full name of the Snakefile used to generate the peak counts. “Project” defines whether the entry corresponds to the Rube or Namani project data. “Control” delineates whether a control (“with_control”) or no control (“no_control”) was used during peak-calling. “Peak Type” defines whether the analysis was run modeling a “narrow” or “broad” peak pattern. “Aligner” notes which aligner was used for the analysis. “Peak Caller” notes which peak caller was used in the analysis. “Deduplicator” notes which deduplicator, including none (“no_deduplication”), was used in the analysis. “Test” represents the trial number, with “test1” being the first, “test2” being the second, and “test3” being the third trial for that test. “Peak Counts” documents the total peak number identified in the analysis. These results were plotted as a heatmap in Figure 2.
- ○ Supplementary Table 2. Deep Dive into Variation in Called Peaks. ChIP-seq data from Rube et al. 2016 was run through all software combinations in Rocketchip 100 times each. This table depicts all combinations of software and data in which peak counts were not replicated perfectly each of the three runs, exhibiting variation in peak-calling. “Project” details the source of the data set. “Control” refers to whether a control was used or excluded during peak-calling. “Aligner”, “Peak Caller”, and “Deduplicator” correspond to the sequence aligner, peak caller, and deduplicator tool used for the Rocketchip run, respectively. “Difference in Range of Called Peaks” represents the range between the minimum and maximum peak counts, highlighting the magnitude of variation in peak-calling across each of the 100 trial runs.

Data is available under Creative Commons Zero v1.0 Universal license.

Acknowledgements

The authors would like to thank C. Titus Brown, Katie Lee, Brandon Hom, and Aryss Hearne for their computational expertise, statistical expertise, and guidance.

References

1. Bansal V, Boucher C: Sequencing Technologies and Analyses: Where Have We Been and Where Are We Going? iScience. 2019; Vol. 18: pp. 37–41. Elsevier Inc. PubMed Abstract | Publisher Full Text | Free Full Text
2. Leinonen R, Sugawara H, Shumway M: The Sequence Read Archive.2011; 39(November 2010): 2010–2012.
3. National Institutes of Health: Genomics Data Sharing Policy Overview.[cited 2023 Sep 18]. Reference Source
4. Johnson DS, Mortazavi A, Myers RM, et al.: Genome-Wide Mapping of in vivo Protein-DNA Interactions. Science (1979). 2007; 316(5830): 1497–1502. Publisher Full Text
5. Robertson G, Hirst M, Bainbridge M, et al.: Genome-wide profiles of STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencing. Nat. Methods. 2007; 4(8): 651–657. PubMed Abstract | Publisher Full Text
6. Xu J, Kudron MM, Victorsen A, et al.: To mock or not: A comprehensive comparison of mock IP and DNA input for ChIP-seq. Nucleic Acids Res. 2021; 49(3): e17–e13. Publisher Full Text
7. Kidder BL, Hu G, Zhao K: ChIP-Seq: Technical considerations for obtaining high-quality data. Nat. Immunol. 2011; 12: 918–922. PubMed Abstract | Publisher Full Text | Free Full Text
8. Liang K, Keles undüz: Normalization of ChIP-seq data with control.2012; 13. Reference Source
9. Skene PJ, Henikoff S: An efficient targeted nuclease strategy for high-resolution mapping of DNA binding sites.
10. Kaya-Okur HS, Wu SJ, Codomo CA, et al.: CUT&Tag for efficient epigenomic profiling of small samples and single cells. Nat. Commun. 2019 Dec 1; 10(1): 1930. PubMed Abstract | Publisher Full Text | Free Full Text
11. Guo Y, Li J, Li CI, et al.: The effect of strand bias in Illumina short-read sequencing data. BMC Genomics. 2012; 13(1): 1–11. Publisher Full Text
12. Benjamini Y, Speed TP: Summarizing and correcting the GC content bias in high-throughput sequencing. Nucleic Acids Res. 2012; 40(10): e72–e14. Publisher Full Text
13. Sato MP, Ogura Y, Nakamura K, et al.: Comparison of the sequencing bias of currently available library preparation kits for Illumina sequencing of bacterial genomes and metagenomes. DNA Res. 2019; 26(5): 391–398. PubMed Abstract | Publisher Full Text | Free Full Text
14. Quail MA, Smith M, Coupland P, et al.: A tale of three next generation sequencing platforms: comparison of Ion Torrent, Pacific Biosciences and Illumina MiSeq sequencers. BMC Genomics. 2012; 13(1): 341. BMC Genomics. Publisher Full Text
15. Aird D, Ross MG, Chen WS, et al.: Analyzing and minimizing PCR amplification bias in Illumina sequencing libraries. Genome Biol. 2011; 12(2): R18. Reference Source
16. Gohl DM, Magli A, Garbe J, et al.: Measuring sequencer size bias using REcount: A novel method for highly accurate Illumina sequencing-based quantification. Genome Biol. 2019; 20(1): 1–17. Publisher Full Text
17. Ross MG, Russ C, Costello M, et al.: Characterizing and measuring bias in sequence data. Genome Biol. 2013; 02(5): 1. Reference Source
18. Schirmer M, Ijaz UZ, D’Amore R, et al.: Insight into biases and sequencing errors for amplicon sequencing with the Illumina MiSeq platform. Nucleic Acids Res. 2015; 43(6): e37. PubMed Abstract | Publisher Full Text | Free Full Text
19. Wardle FC, Tan H: A ChIP on the shoulder? Chromatin immunoprecipitation and validation strategies for ChIP antibodies. F1000Res. 2015; Vol. 4. F1000 Research Ltd.
20. Teng M, Irizarry RA: Accounting for GC-content bias reduces systematic errors and batch effects in ChIP-seq data. Genome Res. 2017 Nov 1; 27(11): 1930–1938. PubMed Abstract | Publisher Full Text | Free Full Text
21. Koster J, Rahmann S: Snakemake — a scalable bioinformatics workflow engine. Bioinformatics. 2012; 28(19): 2520–2522. PubMed Abstract | Publisher Full Text
22. National Center for Biotechnology Information: Sequence Read Archive Toolkit.[cited 2023 Sep 18]. Reference Source
23. Kent WJ, Sugnet CW, Furey TS, et al.: The Human Genome Browser at UCSC.2002; 996–1006.
24. Andrews S: FastQC: A Quality Control Tool for High Throughput Sequencing Data.2010. Reference Source
25. Li H, Durbin R: Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics. 2009; 25(14): 1754–1760. PubMed Abstract | Publisher Full Text | Free Full Text
26. Langmead B, Salzberg SL: Fast gapped-read alignment with Bowtie 2. Nat. Methods. 2012 Apr; 9(4): 357–359. PubMed Abstract | Publisher Full Text | Free Full Text
27. Dobin A, Davis CA, Schlesinger F, et al.: STAR: Ultrafast universal RNA-seq aligner. Bioinformatics. 2013 Jan; 29(1): 15–21. PubMed Abstract | Publisher Full Text | Free Full Text
28. Li H, Handsaker B, Wysoker A, et al.: The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009; 25(16): 2078–2079. PubMed Abstract | Publisher Full Text | Free Full Text
29. Broad Institute: Picard Toolkit.2019 [cited 2023 Sep 18]. Reference Source
30. Tarasov A, Vilella AJ, Cuppen E, et al.: Sambamba: fast processing of NGS alignment formats.Reference Source
31. Ram F, Friederike D, Diehl S: deepTools: a flexible platform for exploring deep-sequencing data. Nucleic Acids Res. 2014; 42: W187–W191. Publisher Full Text
32. Zhang Y, Liu T, Meyer CA, et al.: Model-based analysis of ChIP-Seq (MACS). Genome Biol. 2008; 9(9): R137. PubMed Abstract | Publisher Full Text | Free Full Text
33. Gaspar J: Genrich: Detecting Sites of Genomic Enrichment.2021 [cited 2023 Sep 18]. Reference Source
34. Zhang Y, Lin YH, Johnson TD, et al.: PePr: A peak-calling prioritization pipeline to identify consistent or differential peaks from replicated ChIP-Seq data. Bioinformatics. 2014 Sep 15; 30(18): 2568–2575. PubMed Abstract | Publisher Full Text | Free Full Text
35. Ji H, Jiang H, Ma W, et al.: An integrated software system for analyzing ChIP-chip and ChIP-seq data. Nat. Biotechnol. 2008 Nov; 26(11): 1293–1300. PubMed Abstract | Publisher Full Text | Free Full Text
36. Anaconda Software Distribution: Conda Computer Software.2023 [cited 2023 Sep 18]. Reference Source
37. Rube HT, Lee W, Hejna M, et al.: Sequence features accurately predict genome-wide MeCP2 binding in vivo. Nat. Commun. 2016; 7. PubMed Abstract | Publisher Full Text | Free Full Text
38. Namani A, Liu K, Wang S, et al.: Genome-wide global identification of NRF2 binding sites in A549 non-small cell lung cancer cells by ChIP-Seq reveals NRF2 regulation of genes involved in focal adhesion pathways. Aging. 2019; 11(24): 12600–12623. PubMed Abstract | Publisher Full Text | Free Full Text
39. Tian S, Peng S, Kalmbach M, et al.: Identification of factors associated with duplicate rate in ChIP-seq data. PLoS One. 2019 Apr 1; 14(4): e0214723. PubMed Abstract | Publisher Full Text | Free Full Text
40. Richart L, Picod-Chedotel ML, Wassef M, et al.: XIST loss impairs mammary stem cell differentiation and increases tumorigenicity through Mediator hyperactivation. Cell. 2022 Jun 9; 185(12): 2164–2183.e25. PubMed Abstract | Publisher Full Text
41. Morita T, Hayashi K: Actin-related protein 5 functions as a novel modulator of MyoD and MyoG in skeletal muscle and in rhabdomyosarcoma. elife. 2022 Mar 1; 11. PubMed Abstract | Publisher Full Text | Free Full Text
42. Pradhan SJ, Reddy PC, Smutny M, et al.: Satb2 acts as a gatekeeper for major developmental transitions during early vertebrate embryogenesis. Nat. Commun. 2021 Dec 1; 12(1): 6094. PubMed Abstract | Publisher Full Text | Free Full Text
43. Sidlowski P, Czerwinski A, Liu Y, et al.: OLA1 Phosphorylation Governs the Mitochondrial Bioenergetic Function of Pulmonary Vascular Cells. Am. J. Respir. Cell Mol. Biol. 2023 Apr 1; 68(4): 395–405. PubMed Abstract | Publisher Full Text | Free Full Text
44. Jaura R, Yeh SY, Montanera KN, et al.: Extended intergenic DNA contributes to neuron-specific expression of neighboring genes in the mammalian nervous system. Nat. Commun. 2022 Dec 1; 13(1): 2733. PubMed Abstract | Publisher Full Text | Free Full Text
45. Edwards SL, Erdenebat P, Morphis AC, et al.: Insulin/IGF-1 signaling and heat stress differentially regulate HSF1 activities in germline development. Cell Rep. 2021 Aug 31; 36(9): 109623. PubMed Abstract | Publisher Full Text | Free Full Text
46. Rawal Y, Qiu H, Hinnebusch AG: Distinct functions of three chromatin remodelers in activator binding and preinitiation complex assembly. PLoS Genet. 2022 Jul 6; 18(7): e1010277. PubMed Abstract | Publisher Full Text | Free Full Text
47. Bellec M, Dufourt J, Hunt G, et al.: The control of transcriptional memory by stable mitotic bookmarking. Nat. Commun. 2022 Dec 1; 13(1): 1176. PubMed Abstract | Publisher Full Text | Free Full Text
48. Wu M, Xu Y, Li J, et al.: Genetic and epigenetic orchestration of Gfi1aa-Lsd1-cebpa in zebrafish neutrophil development. Development. 2021 Sep 1; 148(17). PubMed Abstract | Publisher Full Text
49. Wei W, Liu Y, Qiu Y, et al.: Characterization of Acetylation of Histone H3 at Lysine 9 in the Trigeminal Ganglion of a Rat Trigeminal Neuralgia Model. Oxidative Med. Cell. Longev. 2022; 2022. PubMed Abstract | Publisher Full Text | Free Full Text
50. Xu W, He C, Kaye EG, et al.: Dynamic control of chromatin-associated m6A methylation regulates nascent RNA synthesis. Mol. Cell. 2022 Mar 17; 82(6): 1156–1168.e7. PubMed Abstract | Publisher Full Text | Free Full Text
51. Pelletier A, Mayran A, Gouhier A, et al.: Pax7 pioneer factor action requires both paired and homeo DNA binding domains. Nucleic Acids Res. 2021 Jul 21; 49(13): 7424–7436. PubMed Abstract | Publisher Full Text | Free Full Text
52. Akdogan-Ozdilek B, Duval KL, Meng FW, et al.: Identification of chromatin states during zebrafish gastrulation using CUT&RUN and CUT&Tag. Dev. Dyn. 2022 Apr 1; 251(4): 729–742. PubMed Abstract | Publisher Full Text | Free Full Text
53. Chorley BN, Campbell MR, Wang X, et al.: Identification of novel NRF2-regulated genes by ChiP-Seq: Influence on retinoid X receptor alpha. Nucleic Acids Res. 2012 Aug; 40(15): 7416–7429. PubMed Abstract | Publisher Full Text | Free Full Text
54. Van De Geijn B, Mcvicker G, Gilad Y, et al.: WASP: Allele-specific software for robust molecular quantitative trait locus discovery. Nat. Methods. 2015; Vol. 12: p. 1061–3. Nature Publishing Group. PubMed Abstract | Publisher Full Text | Free Full Text
55. Heinz S, Benner C, Spann N, et al.: Simple Combinations of Lineage-Determining Transcription Factors Prime cis-Regulatory Elements Required for Macrophage and B Cell Identities. Mol. Cell. 2010 May 28; 38(4): 576–589. PubMed Abstract | Publisher Full Text | Free Full Text
56. Kulakovskiy IV, Vorontsov IE, Yevshin IS, et al.: HOCOMOCO: Towards a complete collection of transcription factor binding models for human and mouse via large-scale ChIP-Seq analysis. Nucleic Acids Res. 2018 Jan 1; 46(D1): D252–D259. PubMed Abstract | Publisher Full Text | Free Full Text
57. Bailey TL, Boden M, Buske FA, et al.: MEME Suite: Tools for motif discovery and searching. Nucleic Acids Res. 2009; 37(SUPPL. 2): W202–W208. PubMed Abstract | Publisher Full Text | Free Full Text
58. Luehr S, Hartmann H, Söding J: The XXmotif web server for eXhaustive, weight matriX-based motif discovery in nucleotide sequences. Nucleic Acids Res. 2012 Jul; 40(W1): W104–W109. Publisher Full Text
59. Quinlan AR, Hall IM: BEDTools: A flexible suite of utilities for comparing genomic features. Bioinformatics. 2010; 26(6): 841–842. PubMed Abstract | Publisher Full Text | Free Full Text
60. Lawrence M, Huber W, Pagès H, et al.: Software for Computing and Annotating Genomic Ranges. PLoS Comput. Biol. 2013; 9(8).
61. Kuleshov MV, Jones MR, Rouillard AD, et al.: Enrichr: a comprehensive gene set enrichment analysis web server 2016 update. Nucleic Acids Res. 2016 Jul 8; 44(1): W90–W97. PubMed Abstract | Publisher Full Text | Free Full Text
62. Pomaznoy M, Ha B, Peters B: GOnet: A tool for interactive Gene Ontology analysis. BMC Bioinformatics. 2018 Dec 7; 19(1): 470. PubMed Abstract | Publisher Full Text | Free Full Text
63. Haghani V: vhaghani26/Rocketchip_Supplementary_Files: Archive Data (archive). Zenodo. 2025. Publisher Full Text

Comments on this article Comments (0)

Version 1

VERSION 1 PUBLISHED 25 Jun 2025