DirectRepeateR: An R package for annotating direct repeats in genome assemblies

Megan Copeland; Andres Barboza; Joseph S. Romanowski; Zach N. Adelman; Heath Blackmon

doi:10.12688/f1000research.170810.1

Home Browse DirectRepeateR: An R package for annotating direct repeats in genome...

ALL Metrics

-

Views

-

Downloads

Get PDF

Get XML

Export

▬

✚

Software Tool Article

DirectRepeateR: An R package for annotating direct repeats in genome assemblies

[version 1; peer review: awaiting peer review]

Megan Copeland¹, Andres Barboza², Joseph S. Romanowski^2,3, Zach N. Adelman^2,3, Heath Blackmon^1,2,4

Megan Copeland¹, Andres Barboza², [...] Joseph S. Romanowski^2,3, Zach N. Adelman^2,3, Heath Blackmon^1,2,4

PUBLISHED 21 Oct 2025

Author details Author details

¹ Biology, Texas A&M University, College Station, TX, 77843, USA
² Interdisciplinary Program in Genetics and Genomics, Texas A&M University, College Station, Texas, 77843, USA
³ Entomology, Texas A&M University, College Station, Texas, 77843, USA
⁴ Interdisciplinary Program in Ecology and Evolutionary Biology, Texas A&M University, College Station, Texas, 77843, USA

Megan Copeland
Roles: Conceptualization, Data Curation, Formal Analysis, Methodology, Software, Visualization, Writing – Original Draft Preparation

Andres Barboza
Roles: Software, Writing – Review & Editing

Joseph S. Romanowski
Roles: Conceptualization, Writing – Review & Editing

Zach N. Adelman
Roles: Conceptualization, Writing – Review & Editing

Heath Blackmon
Roles: Conceptualization, Formal Analysis, Methodology, Project Administration, Resources, Software, Supervision, Writing – Review & Editing

OPEN PEER REVIEW

REVIEWER STATUS AWAITING PEER REVIEW

This article is included in the Genomics and Genetics gateway.

This article is included in the RPackage gateway.

Abstract

Background

Direct repeats in close proximity are targets of the single-strand annealing (SSA) pathway, a mutagenic DNA repair process that impacts genome integrity. Understanding the evolution and consequences of these sequences is a critical part of understanding eukaryotic genome evolution.

Methods

DirectRepeateR, an open-source R package that scans FASTA assemblies for exact, co-oriented repeats within a user-defined spacer window. We illustrate the utility of our software in an analysis of the Aedes aegypti genome by testing whether the distribution of direct repeats is consistent with selection acting against repeats in genic regions.

Results

Our results suggest that selection has acted against direct repeats that flank or overlap with protein-coding DNA sequences.

Conclusion

Our software provides an accurate and computationally efficient, user-friendly, and tailorable approach for detecting direct repeats.

Keywords

Direct repeats, DNA repair mechanisms, R package, repeat annotation

Corresponding author: Heath Blackmon

Competing interests: No competing interests were disclosed.

Grant information: MC, AB, and HB were supported by the National Institute of General Medical Sciences at the National Institutes of Health R35GM138098. JSR and ZNA were supported by the National Institute of Allergies and Infectious Diseases, National Institutes of Health (AI148787 to ZNA).
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Copyright: © 2025 Copeland M et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

How to cite: Copeland M, Barboza A, S. Romanowski J et al. DirectRepeateR: An R package for annotating direct repeats in genome assemblies [version 1; peer review: awaiting peer review]. F1000Research 2025, 14:1147 (https://doi.org/10.12688/f1000research.170810.1) First published: 21 Oct 2025, 14:1147 (https://doi.org/10.12688/f1000research.170810.1) Latest published: 21 Oct 2025, 14:1147 (https://doi.org/10.12688/f1000research.170810.1)

Introduction

The discovery of repetitive DNA sequences through reassociation experiments marked a foundational moment in genome biology.¹ However, the annotation of repetitive elements remains technically challenging, and despite their widespread presence, our understanding of their functional impacts continues to lag behind that of more well-characterized genomic features. In part, repeat annotation is hampered by the sheer diversity of repetitive sequences. Repeats can occur as long tandem arrays, others as dispersed copies, and they span a wide range of sizes, sequence identities, and evolutionary origins.²

These repeat architectures have variable impacts on genomes and are now recognized for their roles in genome stability/instability, evolution, and disease.^3–5 Repeats can influence mutation and recombination rates,^6,7 and their expansion/contraction underlie human disorders like Huntington’s disease and fragile-X syndrome.⁸ Additionally, microsatellites have been shown to evolve rapidly and vary considerably in abundance even among closely related taxa, underscoring their dynamic evolution⁹; they also contribute to epigenetic regulation and can influence gene expression by activating promoters or serving as transcription factor binding sites.^10,11

Proximal direct repeat pairs are of particular interest. These are two identical sequences in the same orientation, separated by a spacer region. In both bacteria and mammals, the distribution and abundance of long direct repeats appear to be shaped to minimize genome instability, with constrained chromosomal positioning in bacteria¹² and a potential role for natural selection in reducing repeat-mediated mutagenesis in the mitochondrial DNA of longer-lived mammals.¹³ This mutagenic potential is also evident at the molecular level, where direct repeat pairs facilitate the DNA repair pathway single-strand annealing (SSA) (Figure 1). After a double-strand break occurs, the 5’ ends of each strand are resected, and complementary repeats are used to anneal the strands. The repair mechanism is considered mutagenic because the intervening DNA between repeats and the downstream repeat is lost during the repair process.¹⁴

Figure 1. Single-strand annealing during DNA repair.

This pathway is initiated by a double-strand break between homologous DNA repeats followed by the resection of the 5' end strands, creating 3' overhangs. These overhangs find and anneal to complementary sequences, resulting in the loss of intervening DNA and the downstream repeat. The final step is ligation, where the DNA strands are rejoined, completing the repair process. Created in BioRender. Copeland, M. (2025) https://BioRender.com/d9g3lm9.

Tools such as RepeatMasker,¹⁵ HelitronScanner,¹⁶ ab initio programmes like PILER,¹⁷ and combinatorial pipelines such as RepeatModeler¹⁸ are widely used to perform repeat annotation or classification. More recently, Repeater¹⁹ introduced fast, alignment-free profiling of diverse repeat classes and produces informative whole-chromosome visualisations. While these tools are well suited for identifying broad repeat families or consensus-based repeat structures, they are not able to pinpoint exact, spatially localized repeat pairs and provide detailed annotation information for the repeats.

Here we present DirectRepeateR, an open-source R package that fills this gap. DirectRepeateR performs de novo identification of exact direct repeat pairs within a user-specified distance and immediately returns analysis-ready CSV files or optional GFF files. DirectRepeateR is implemented entirely in R, and all user interaction remains within the R environment, making it easily accessible to researchers with minimal coding experience. DirectRepeateR offers a complementary approach by focusing specifically on the identification of exact direct repeat pairs at user-defined spatial and length resolutions. This fine-scale, length-consistent detection is particularly useful for studying localized structural features, such as repeat-mediated recombination or deletion events.

To illustrate its use, we map direct repeat pairs in the Aedes aegypti genome and, using a Monte Carlo simulation, test whether the distribution of direct repeat pairs in Aedes aegypti is consistent with selection against their presence flanking exons.

Methods

Implementation

The DirectRepeateR package

DirectRepeateR is an R package designed to identify, annotate, and visualize nearby direct repeat sequences within genome assemblies. It offers a flexible solution for detecting direct repeats using a de novo approach, making it useful for model and non-model organisms. This package features three functions: GetRepeats, ConvertToGFF, and PlotRepeats. We made a vignette that provides guidance on how to use each function (S1).

GetRepeats function

The function GetRepeats is designed to identify direct repeat sequences from genome assemblies, leveraging R for a simple user interface and C++ for efficient processing. While this package incorporates a C++ backend for performance, all user interaction takes place through the R interface, keeping the software user-friendly and accessible to researchers with basic R skills. The function takes a genome assembly in FASTA format as the input along with the parameters query_length (length of the substring used to search for repeats), maxdist (the window size in which we search for repeats), and minlength (the minimum length to be considered a repeat). These parameters allow user control over the query sequence length, the maximum distance between repeat copies, and the minimum repeat length, respectively. If these parameters are not provided, default values are used (query_length = 25, maxdist = 20,000, and minlength = 50). These default settings are based on common expectations about the repeat structures targeted by the SSA pathway.¹⁴

The function begins by using the C++ backend function through the Rcpp package.²⁰ For each sequence in the FASTA file, this C++ function extracts chromosome lengths and names from the genome and uses these to organize the repeat information. The implemented algorithm is based on sliding across the genome in steps equal to the query length. For each query, the algorithm searches for all exact matches within the range defined by maxdist, representing the maximum allowable distance between repeat copies. This process is repeated until the end of the sequence is reached. When matches are found, the start and end positions of the first element and the match are recorded. As this process is completed for each sequence in the FASTA file, data frames containing the position information are written out as temporary chromosomal CSV files. After the C++ routine completes, GetRepeats processes all the generated chromosomal CSV files inside of R. These temporary files are processed using the minlength argument and are combined to provide the final comprehensive list of detected direct repeats, including their start and end positions along with the positions of their corresponding matches.

ConvertToGFF function

The ConvertToGFF function is designed to convert repeat data into a GFF (General Feature Format) file, a standard format for describing genes and other features in genomes. The user provides the data frame, provided by the GetRepeats function containing the identified direct repeats, and the function works by preallocating vectors to store the GFF entries, including fields for chromosomes, source, feature types (repeat_region for the full length and repeat_unit for individual copies), start and end positions, and attributes. Each repeat in the data frame is processed to generate three GFF entries: one for the full repeat region and two for the individual elements of the repeat.

PlotRepeats function

The PlotRepeats function uses the ggplot2 package²¹ and generates visualizations of repeat densities across chromosomes using a sliding window approach. This function uses the data generated by the GetRepeats function and allows users to specify window and step sizes (defaulting to 200 Kb for both if not provided). The function processes each chromosome by first calculating the midpoint of each repeat in the file and then sliding windows across the chromosome length. For each window, it counts the number of repeats in the window. It then generates a plot of repeat density along the chromosome.

Operation

The DirectRepeateR package requires Rcpp (>= 1.1.0), and therefore requires a version of R >= 4.4.1 (available from www.r-project.org). Users can use the devtools package to install DirectRepeateR from GitHub (https://github.com/coleoguy/DirectRepeateR).

Calculation of observed vs. expected counts of Flanked Exons

To illustrate the use of our package, we analyzed the A. aegypti genome (GCF_002204515.2) with DirectRepeateR. The A. aegypti genome is approximately 1.3 Gb in size, with 65% being repetitive content.²² We used the DirectRepeateR package to map the location of direct repeats within this genome. For this, we define direct repeats as sequences that have a minimum element size of 50 bp, are oriented in the same direction, and are separated by no more than 20 Kb.

We then used the output from DirectRepeateR to explore whether the distribution of direct repeats was consistent with the hypothesis that selection should limit their proximity to exons. For this project, we considered an exon flanked if it was either spanned or overlapped by direct repeats. Essentially, any orientation where single-strand annealing would be expected to lead to the loss of exonic DNA was considered a flanking arrangement.

We calculated the observed and expected numbers of exons flanked by direct repeats in the A. aegypti genome. To get the observed counts, we used the direct repeat information produced by DirectRepeateR and the annotation file from NCBI (GCF_002204515.2). The GFF file was filtered to retain only unique protein-coding exons, and these exon locations were then compared with repeat locations to identify exons flanked by repeats. To determine expected counts, we employed a Monte Carlo simulation to create a null distribution of the number of flanked exons. We generated direct repeat positions by randomly sampling locations based on the length of each chromosome while maintaining the size of each repeat copy and the distance between copies. When the gap between repeats was under 26 bp, the subsequent repeat was placed by offsetting from the repositioned end of the prior repeat by the same gap, maintaining clustering. After positioning the repeats randomly, we used the same method described above to evaluate the number of flanked exons. This process was repeated 100 times to generate a null distribution. While not included in the DirectRepeateR package, the scripts used for this analysis are provided in a GitHub repository for researchers to use and adapt for their own needs. AI was used to revise software and analysis code to maximize efficiency.

Results

Performance

We assessed the runtime of our repeat detection algorithm across a range of species with varying genome sizes. We used the genomes of Caenorhabditis elegans (GCF_000002985.6), Anopheles gambiae (GCF_943734735.2), Vitis vinifera (GCF_030704535.1), Oryzias latipes (GCF_002234675.1), Gallus gallus (GCF_016699485.2), Aedes aegypti (GCF_002204515.2), Mus musculus (GCF_000001635.27), and Homo sapiens (GCF_000001405.40), which span genome sizes from 100.3 Mb to over 3.1 Gb (Figure 2). Runtime scaled linearly with genome size (slope = 0.0677, R² = 0.96, p = 2.33 × 10⁻⁵), indicating a strong relationship with minimal residual variation. This supports the consistent performance of the method across a wide range of genome sizes. Smaller genomes, such as C. elegans (100.3 Mb), completed in under five minutes, while the largest genome, H. sapiens (3.1 Gb), required approximately 3.6 hours.

Figure 2. Runtime of the repeat detection algorithm across genomes of varying sizes.

The plot shows the relationship between genome size (in megabases, Mb) and runtime (in minutes) for eight species. The blue trend line indicates linear regression (R² = 0.96).

Observed vs. expected counts of Flanked Exons

In the A. aegypti genome, we observed that 5,782 out of 80,498 exons were flanked by direct repeats. The expected number, based on simulations, was calculated as just under 40,000, highlighting a significant deviation from what our null model predicted (Figure 3). When we analyzed the data by chromosome, chromosome 1 had 1,247 flanked exons, chromosome 2 had 2,427, and chromosome 3 had 2,108 (Figure 3). In all cases, we found that observed values were markedly lower than the expected numbers, with the simulations predicting nearly 14,000 flanked exons for chromosomes 2 and 3, and approximately 10,000 for chromosome 1. These results reveal a consistent pattern across the genome, where the number of exons flanked by direct repeats is far below random expectation.

Figure 3. Empirical vs. null counts of flanked exons in the Aedes aegypti genome.

The null distribution (colored points) represents the expected number of flanked exons under a random model, while the empirical values (colored points with black border) show the actual observed counts. Each chromosome and the genome-wide totals are presented separately. The left y-axis (blue) corresponds to chromosome-level comparisons, and the right y-axis (orange) represents genome-wide values.

To investigate the relationship between gene density and repeat content, we analyzed the distribution of genes and direct repeats across 200 Kb windows along three chromosomes of Aedes aegypti. For each chromosome, gene and direct repeat counts were calculated within each sliding window. To visualize the data, we created a scatterplot of repeat counts on a log scale against gene counts. Fitting a linear model to this data showed a significant negative relationship (slope = -10.34, R² = 0.006, p-value = 1.9e-09), suggesting that regions with higher gene densities tend to have lower numbers of repeats (Figure 4).

Figure 4. Scatterplot of repeat counts against gene counts in 200 Kb windows across the Aedes aegypti genome.

To improve visualization, repeat counts were log-transformed prior to plotting; however, the vertical axis was back-transformed to ensure values remain directly interpretable. Each point represents a window, with the horizontal showing the gene count per window and the vertical axis showing the repeat count. The blue trend line indicates linear regression (R² = 0.006).

Discussion

Purifying selection on repeats

Our analysis shows that exons are flanked by proximal direct repeat pairs less frequently than expected in A. aegypti, and repeat density decreases in gene-rich regions. Since direct repeats can cause deletions via the SSA pathway, their depletion near genes is likely due to purifying selection acting to minimize mutational risk. Across multiple species, purifying selection appears to act on repeats. Studies have suggested that selection may limit TE accumulation near genes,²³ restrict TEs to AT-rich, gene-poor regions,²⁴ and constrain microsatellite variation due to its potential to disrupt gene function.²⁵ Together, these findings support the view that purifying selection acts broadly across genome landscapes to restrict repeats from genic regions where their expansion or instability would be most harmful.

Limitations & Future work

The current version of DirectRepeateR is intentionally streamlined for one very specific task—detecting exact, proximal direct repeat pairs. As such, several broader repeat-analysis features are outside its present scope. First, the algorithm currently searches for perfect matches in the same 5′→3′ orientation. The package does not yet identify degenerated, inverted, or mirror repeats. Additionally, DirectRepeateR identifies exact matches by stepping across the genome in increments equal to the specified query length (e.g., 25 base pairs), which can result in missing some portion of the start or end of a repeat if the window does not align perfectly with the repeat’s boundaries. Specifically, this approach may miss anywhere from zero base pairs to the query length minus one at both the beginning and the end of each repeat. This number of base pairs lost from each repeat is uniformly distributed given that repeats can start at any arbitrary position along the genome. This limitation creates a trade-off between runtime and completeness. A smaller query length increases the number of iterations and runtime but improves completeness by aligning the window more accurately with repeat boundaries, reducing missed base pairs. A larger query length decreases runtime by reducing iterations but may miss more base pairs, compromising completeness. These constraints reflect deliberate design choices that prioritise computational efficiency and user-ease for direct repeat studies, but they also highlight clear avenues for future development, most notably adding support for near-identical repeats and inverted orientations.

DirectRepeateR offers a targeted and accessible approach for identifying exact, proximal direct repeat pairs in genome assemblies. Researchers can fine-tune resolution and spatial sensitivity using adjustable parameters, such as maximum repeat distance, to detect repeat structures relevant to genome stability. Fully implemented in R, the package provides a lightweight interface suitable for users with minimal coding experience. DirectRepeateR complements existing repeat-annotation tools by filling a practical gap in the detection of spatially localized, length-consistent repeat pairs.

Code availability

Code for analysis available from: https://github.com/coleoguy/DRproj²⁶

Archived source at time of publication: 10.5281/zenodo.17073598

This project contains:

• scripts/* (R scripts for repeat annotation and observed flanking exon analysis)
• results/* (Outputs from the R scripts including repeat counts and flanked exon results)
• figures/* (Visualizations generated for the manuscript)

License: Creative Commons Attribution 4.0 International (CC-BY 4.0)

Software availability

Source code available from: https://github.com/coleoguy/DirectRepeateR²⁷

Archived source at time of publication: 10.5281/zenodo.16275409

License: OSI approved open license software is under MIT

Data availability

Underlying data

Reference genomes and annotation files used for repeat annotation analysis were downloaded from NCBI:

• Aedes aegypti: GCF_002204515.2

Acknowledgements

Not applicable.

References

1. Britten RJ, Kohne DE: Repeated sequences in DNA. Hundreds of thousands of copies of DNA sequences have been incorporated into the genomes of higher organisms. Science. 1968; 161: 529–540. Publisher Full Text
2. Liao X, Zhu W, Zhou J, et al.: Repetitive DNA sequence detection and its role in the human genome. Commun Biol. 2023; 6: 954. PubMed Abstract | Publisher Full Text | Free Full Text
3. Palazzo AF, Gregory TR: The case for junk DNA. PLoS Genet. 2014; 10: e1004351. PubMed Abstract | Publisher Full Text | Free Full Text
4. Neguembor MV, Gabellini D: In junk we trust: repetitive DNA, epigenetics and facioscapulohumeral muscular dystrophy. Epigenomics. 2010; 2: 271–287. PubMed Abstract | Publisher Full Text
5. Makalowski W: Not Junk After All. Science. 2003; 300: 1246–1247. Publisher Full Text
6. Ellegren H: Heterogeneous mutation processes in human microsatellite DNA sequences. Nat. Genet. 2000; 24: 400–402. PubMed Abstract | Publisher Full Text
7. Wahls WP, Wallace LJ, Moore PD: The Z-DNA motif d(TG)30 promotes reception of information during gene conversion events while stimulating homologous recombination in human cells in culture. Mol. Cell. Biol. 1990; 10: 785–793. PubMed Abstract
8. Mirkin SM: DNA structures, repeat expansions and human hereditary disorders. Curr. Opin. Struct. Biol. 2006; 16: 351–358. Publisher Full Text
9. Jonika M, Lo J, Blackmon H: Mode and Tempo of Microsatellite Evolution across 300 Million Years of Insect Evolution. Genes. 2020; 11: 11. PubMed Abstract | Publisher Full Text | Free Full Text
10. Gebhardt F, Zänker KS, Brandt B: Modulation of epidermal growth factor receptor gene transcription by a polymorphic dinucleotide repeat in intron 1. J. Biol. Chem. 1999; 274: 13176–13180. PubMed Abstract | Publisher Full Text
11. Contente A, Dittmer A, Koch MC, et al.: A polymorphic microsatellite that mediates induction of PIG3 by p53. Nat. Genet. 2002; 30: 315–320. PubMed Abstract | Publisher Full Text
12. Malhotra N, Seshasayee ASN: Replication-dependent organization constrains positioning of long DNA repeats in bacterial genomes. Genome Biol. Evol. 2022; 14: evac102. PubMed Abstract | Publisher Full Text | Free Full Text
13. Khaidakov M, Siegel ER, Shmookler Reis RJ: Direct repeats in mitochondrial DNA and mammalian lifespan. Mech. Ageing Dev. 2006; 127: 808–812. PubMed Abstract | Publisher Full Text
14. Bhargava R, Onyango DO, Stark JM: Regulation of Single-Strand Annealing and its Role in Genome Maintenance. Trends Genet. 2016; 32: 566–575. PubMed Abstract | Publisher Full Text | Free Full Text
15. Nishimura D: RepeatMasker. Biotech Software & Internet Report. 2000; 1: 36–39. Publisher Full Text
16. Xiong W, He L, Lai J, et al.: HelitronScanner uncovers a large overlooked cache of Helitron transposons in many plant genomes. Proc. Natl. Acad. Sci. 2014; 111: 10263–10268. PubMed Abstract | Publisher Full Text | Free Full Text
17. Edgar RC, Myers EW: PILER: identification and classification of genomic repeats. Bioinformatics. 2005; 21 Suppl 1: i152–i158. PubMed Abstract | Publisher Full Text
18. Flynn JM, Hubley R, Goubert C, et al.: RepeatModeler2 for automated genomic discovery of transposable element families. Proc. Natl. Acad. Sci. USA. 2020; 117: 9451–9457. PubMed Abstract | Publisher Full Text | Free Full Text
19. Kalendar R, Kairov U: Genome-wide tool for sensitive de novo identification and visualisation of interspersed and tandem repeats. Bioinform Biol Insights. 2024; 18: 11779322241306391. PubMed Abstract | Publisher Full Text | Free Full Text
20. Eddelbuettel D, Balamuta JJ: Extending R with C++: A brief introduction to Rcpp. Am. Stat. 2018; 72: 28–36. Publisher Full Text
21. Wickham H, Chang W, Wickham MH: Package “ggplot2.” Create elegant data visualisations using the grammar of graphics Version.2016; 2: 1–189.
22. Matthews BJ, Dudchenko O, Kingan SB, et al.: Improved reference genome of Aedes aegypti informs arbovirus vector control. Nature. 2018; 563: 501–507. PubMed Abstract | Publisher Full Text
23. Horvath R, Minadakis N, Bourgeois Y, et al.: The evolution of transposable elements in Brachypodium distachyon is governed by purifying selection, while neutral and adaptive processes play a minor role. elife. 2024; 12: 12. PubMed Abstract | Publisher Full Text | Free Full Text
24. Rouxel T, Grandaubert J, Hane JK, et al.: Effector diversification within compartments of the Leptosphaeria maculans genome affected by Repeat-Induced Point mutations. Nat. Commun. 2011; 2: 202.
25. Press MO, McCoy RC, Hall AN, et al.: Massive variation of short tandem repeats with functional consequences across strains of Arabidopsis thaliana. Genome Res. 2018; 28: 1169–1178. PubMed Abstract | Publisher Full Text | Free Full Text
26. Copeland M, Blackmon H: coleoguy/DRproj: Initial release. Zenodo. 2025. Publisher Full Text
27. Copeland M, Blackmon H: coleoguy/DirectRepeateR: DirectRepeateR - Initial Release. Zenodo. 2025. Publisher Full Text

Comments on this article Comments (0)

Version 1

VERSION 1 PUBLISHED 21 Oct 2025

Author details Author details

¹ Biology, Texas A&M University, College Station, TX, 77843, USA
² Interdisciplinary Program in Genetics and Genomics, Texas A&M University, College Station, Texas, 77843, USA
³ Entomology, Texas A&M University, College Station, Texas, 77843, USA
⁴ Interdisciplinary Program in Ecology and Evolutionary Biology, Texas A&M University, College Station, Texas, 77843, USA

Megan Copeland
Roles: Conceptualization, Data Curation, Formal Analysis, Methodology, Software, Visualization, Writing – Original Draft Preparation

Andres Barboza
Roles: Software, Writing – Review & Editing

Joseph S. Romanowski
Roles: Conceptualization, Writing – Review & Editing

Zach N. Adelman
Roles: Conceptualization, Writing – Review & Editing

Heath Blackmon
Roles: Conceptualization, Formal Analysis, Methodology, Project Administration, Resources, Software, Supervision, Writing – Review & Editing

Competing interests

No competing interests were disclosed.

Grant information

MC, AB, and HB were supported by the National Institute of General Medical Sciences at the National Institutes of Health R35GM138098. JSR and ZNA were supported by the National Institute of Allergies and Infectious Diseases, National Institutes of Health (AI148787 to ZNA).
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Article Versions (1)

version 1

Published: 21 Oct 2025, 14:1147

https://doi.org/10.12688/f1000research.170810.1

Copyright

© 2025 Copeland M et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download

Export To

metrics

	Views	Downloads
F1000Research	-	-
PubMed Central Data from PMC are received and updated monthly.	-	-

Citations

0

SEE MORE DETAILS

CITE

how to cite this article

Copeland M, Barboza A, S. Romanowski J et al. DirectRepeateR: An R package for annotating direct repeats in genome assemblies [version 1; peer review: awaiting peer review]. F1000Research 2025, 14:1147 (https://doi.org/10.12688/f1000research.170810.1)

NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.

track

receive updates on this article

Track an article to receive email alerts on any updates to this article.

Open Peer Review

Comments on this article Comments (0)

Version 1

VERSION 1 PUBLISHED 21 Oct 2025

Open Peer Review

Reviewer Status

AWAITING PEER REVIEW

Comments on this article

All Comments(0)

Add a comment

Sign up for content alerts

Browse by related subjects

[1] 1. Britten RJ, Kohne DE: Repeated sequences in DNA. Hundreds of thousands of copies of DNA sequences have been incorporated into the genomes of higher organisms. Science. 1968; 161: 529–540. Publisher Full Text

[2] 2. Liao X, Zhu W, Zhou J, et al.: Repetitive DNA sequence detection and its role in the human genome. Commun Biol. 2023; 6: 954. PubMed Abstract | Publisher Full Text | Free Full Text

[3] 3. Palazzo AF, Gregory TR: The case for junk DNA. PLoS Genet. 2014; 10: e1004351. PubMed Abstract | Publisher Full Text | Free Full Text

[4] 4. Neguembor MV, Gabellini D: In junk we trust: repetitive DNA, epigenetics and facioscapulohumeral muscular dystrophy. Epigenomics. 2010; 2: 271–287. PubMed Abstract | Publisher Full Text

[5] 5. Makalowski W: Not Junk After All. Science. 2003; 300: 1246–1247. Publisher Full Text

[6] 6. Ellegren H: Heterogeneous mutation processes in human microsatellite DNA sequences. Nat. Genet. 2000; 24: 400–402. PubMed Abstract | Publisher Full Text

[7] 7. Wahls WP, Wallace LJ, Moore PD: The Z-DNA motif d(TG)30 promotes reception of information during gene conversion events while stimulating homologous recombination in human cells in culture. Mol. Cell. Biol. 1990; 10: 785–793. PubMed Abstract

[8] 8. Mirkin SM: DNA structures, repeat expansions and human hereditary disorders. Curr. Opin. Struct. Biol. 2006; 16: 351–358. Publisher Full Text

[9] 9. Jonika M, Lo J, Blackmon H: Mode and Tempo of Microsatellite Evolution across 300 Million Years of Insect Evolution. Genes. 2020; 11: 11. PubMed Abstract | Publisher Full Text | Free Full Text

[10] 10. Gebhardt F, Zänker KS, Brandt B: Modulation of epidermal growth factor receptor gene transcription by a polymorphic dinucleotide repeat in intron 1. J. Biol. Chem. 1999; 274: 13176–13180. PubMed Abstract | Publisher Full Text

[11] 11. Contente A, Dittmer A, Koch MC, et al.: A polymorphic microsatellite that mediates induction of PIG3 by p53. Nat. Genet. 2002; 30: 315–320. PubMed Abstract | Publisher Full Text

[12] 12. Malhotra N, Seshasayee ASN: Replication-dependent organization constrains positioning of long DNA repeats in bacterial genomes. Genome Biol. Evol. 2022; 14: evac102. PubMed Abstract | Publisher Full Text | Free Full Text

[13] 13. Khaidakov M, Siegel ER, Shmookler Reis RJ: Direct repeats in mitochondrial DNA and mammalian lifespan. Mech. Ageing Dev. 2006; 127: 808–812. PubMed Abstract | Publisher Full Text

[14] 14. Bhargava R, Onyango DO, Stark JM: Regulation of Single-Strand Annealing and its Role in Genome Maintenance. Trends Genet. 2016; 32: 566–575. PubMed Abstract | Publisher Full Text | Free Full Text

[15] 15. Nishimura D: RepeatMasker. Biotech Software & Internet Report. 2000; 1: 36–39. Publisher Full Text

[16] 16. Xiong W, He L, Lai J, et al.: HelitronScanner uncovers a large overlooked cache of Helitron transposons in many plant genomes. Proc. Natl. Acad. Sci. 2014; 111: 10263–10268. PubMed Abstract | Publisher Full Text | Free Full Text

[17] 17. Edgar RC, Myers EW: PILER: identification and classification of genomic repeats. Bioinformatics. 2005; 21 Suppl 1: i152–i158. PubMed Abstract | Publisher Full Text

[18] 18. Flynn JM, Hubley R, Goubert C, et al.: RepeatModeler2 for automated genomic discovery of transposable element families. Proc. Natl. Acad. Sci. USA. 2020; 117: 9451–9457. PubMed Abstract | Publisher Full Text | Free Full Text

[19] 19. Kalendar R, Kairov U: Genome-wide tool for sensitive de novo identification and visualisation of interspersed and tandem repeats. Bioinform Biol Insights. 2024; 18: 11779322241306391. PubMed Abstract | Publisher Full Text | Free Full Text

[20] 20. Eddelbuettel D, Balamuta JJ: Extending R with C++: A brief introduction to Rcpp. Am. Stat. 2018; 72: 28–36. Publisher Full Text

[21] 21. Wickham H, Chang W, Wickham MH: Package “ggplot2.” Create elegant data visualisations using the grammar of graphics Version.2016; 2: 1–189.

[22] 22. Matthews BJ, Dudchenko O, Kingan SB, et al.: Improved reference genome of Aedes aegypti informs arbovirus vector control. Nature. 2018; 563: 501–507. PubMed Abstract | Publisher Full Text

[23] 23. Horvath R, Minadakis N, Bourgeois Y, et al.: The evolution of transposable elements in Brachypodium distachyon is governed by purifying selection, while neutral and adaptive processes play a minor role. elife. 2024; 12: 12. PubMed Abstract | Publisher Full Text | Free Full Text

[24] 24. Rouxel T, Grandaubert J, Hane JK, et al.: Effector diversification within compartments of the Leptosphaeria maculans genome affected by Repeat-Induced Point mutations. Nat. Commun. 2011; 2: 202.

[25] 25. Press MO, McCoy RC, Hall AN, et al.: Massive variation of short tandem repeats with functional consequences across strains of Arabidopsis thaliana. Genome Res. 2018; 28: 1169–1178. PubMed Abstract | Publisher Full Text | Free Full Text

[26] 26. Copeland M, Blackmon H: coleoguy/DRproj: Initial release. Zenodo. 2025. Publisher Full Text

[27] 27. Copeland M, Blackmon H: coleoguy/DirectRepeateR: DirectRepeateR - Initial Release. Zenodo. 2025. Publisher Full Text

DirectRepeateR: An R package for annotating direct repeats in genome assemblies

Abstract

Background

Methods

Results

Conclusion

Keywords

Introduction

Figure 1. Single-strand annealing during DNA repair.

Methods

Implementation

Operation

Calculation of observed vs. expected counts of Flanked Exons

Results

Performance

Figure 2. Runtime of the repeat detection algorithm across genomes of varying sizes.

Observed vs. expected counts of Flanked Exons

Figure 3. Empirical vs. null counts of flanked exons in the Aedes aegypti genome.

Figure 4. Scatterplot of repeat counts against gene counts in 200 Kb windows across the Aedes aegypti genome.

Discussion

Purifying selection on repeats

Limitations & Future work

Code availability

Software availability

Data availability

Underlying data

Acknowledgements

References

Comments on this article Comments (0)

Open Peer Review

Comments on this article Comments (0)

Open Peer Review

Reviewer Status

Comments on this article

Browse by related subjects

Competing Interests Policy

Stay Updated