target: an R package to predict combined function of transcription factors [version 1; peer review: 2 approved with reservations]

Researchers use ChIP binding data to identify potential transcription factor binding sites. Similarly, they use gene expression data from sequencing or microarrays to quantify the effect of the factor overexpression or knockdown on its targets. Therefore, the integration of the binding and expression data can be used to improve the understanding of a transcription factor function. Here, we implemented the binding and expression target analysis (BETA) in an R/Bioconductor package. This algorithm ranks the targets based on the distances of their assigned peaks from the factor ChIP experiment and the signed statistics from gene expression profiling with factor perturbation. We further extend BETA to integrate two sets of data from two factors to predict their targets and their combined functions. In this article, we briefly describe the workings of the algorithm and provide a workflow with a real dataset for using it. The gene targets and the aggregate functions of transcription factors YY1 and YY2 in HeLa cells were identified. Using the same datasets, we identified the shared targets of the two factors, which were found to be, on average, more cooperatively regulated. and functions of (TF). factor the gene targets of a Ahmed and Kim developed an R package to implement the binding and expression target analysis (BETA) package and extend the application to cases involving two transcription factors. The package predicts the potential target genes for binding sites from individual TFs or shared binding sites from two factors. There are some major concerns that need to be addressed.


Introduction
The binding of a transcription factor to a genomic region (e.g., gene promoter) can have the effect of inducing or repressing its expression Latchman 1 . The binding sites can be identified using ChIP experiments. High through-put ChIP experiments produce hundreds or thousands of binding sites for most factors Johnson et al. 2 . Therefore, methods to determine which of these sites are true targets and whether they are functional or not are needed Ucar et al. 3 . On the other hand, perturbing the transcription factor by over-expression or knockdown and measuring the gene expression changes provide valuable information on the function of the factor Tran et al. 4 . Methods exist to integrate the binding data and the factor perturbation gene expression to predict the real target regions (e.g., genes) 5,6 . This article presents a workflow for using the target package to integrate binding and expression data to predict the shared targets and the combined function of two transcription factors.
To illustrate the utility of this workflow, we applied it to binding and expression data of the transcription factors YY1 and YY2. We asked whether the two factors cooperate or compete on their shared targets in HeLa cells.

Implementation
We developed an open-source R/Bioconductor package target to implement BETA for predicting direct transcription factor targets from binding and expression data. The details of the algorithm were described here Wang et al. 6 . In addition, our implementation extends BETA to apply for factor combinations (Ahmed et al. 7 ). Briefly, factor potential binding sites are identified by ChIP-sequencing and gene expression under factor perturbation by microarrays or sequencing. The distances between the peaks and the transcription start sites are used to calculate the peak scores. The sum of the scores of the individual peaks in a certain region of interest is the region's regulatory potential. A signed statistics (fold-change or t-statistics) from the differential gene expression of the factor perturbation is used to estimate the factor function. The product of the ranks of the regulatory potential and the signed statistics is the final rank of the regions.
To predict the combined function of two factors, two sets of data are required. The overlapping peaks are the potential binding sites. The product of the two signed statistics is the factor function. When the two factors agree in the direction of the regulation of a region where they both bind, they could be said to cooperate on this region. When the sign is opposite, they could be said to regulate that region competitively.
The package leverages the Bioconductor data structures such as GRanges and DataFrame to provide fast and flexible computation on the data Huber et al. 8 . Similar to the original python implementation, the input data are the identified peaks from the ChIP-Seq experiment and the expression data from RNA-Seq or microarrays perturbation experiment. The final output is the peaks associated with the factor binding and the predicted direct targets. We use the terms peaks to refer to the GRanges object that contains the coordinates of the peaks. We use the term region to refer to a similar object that contains the information on the regions of interest; genes, transcripts, promoter regions, etc. In both cases, any number of additional information on the ranges can be added to the object as metadata.

Operation
The algorithm was implemented in R (>=3.6) and should be able to run on any operating system. Similarly, it is a zinc finger protein with both activation or repression functions on its targets. Using the target analysis, we will attempt to answer the following questions. Do the two transcription factors share the same target genes? What are the consequences of the binding of each factor on its targets? If the two factors share binding sites, what is the function of the binding of the two factors to these sites?
To answer these questions, we use publicly available datasets to model the binding and gene expression under the transcription factors perturbations (Table 1). This dataset was obtained in the form of differential expression between the two conditions from KnockTF. The first dataset is gene expression profiling using microarrays of YY1/YY2 knockdown and control HeLa cells. The binding sites of the factors in HeLa cells were determined using two ChIP-Seq datasets. The ChIP peaks were obtained in the form of bed files from ChIP-Atlas. Finally, we used the USSC hg19 human genome to extract the genomic annotations.
Briefly, we first prepared the three sources of data for the target analysis. Then we predict the specific targets for each individual factor. Third, we predict the combined function of the two factors on the shared target genes. Finally, we show an example of a motif analysis of the competitively and cooperatively regulated targets.

Preparing the binding data
The ChIP peaks were downloaded in the form of separate bed files for each factor. We first locate the files in the data/ directory and load the files using import.bed. Then the data is transformed into a suitable format, GRanges. The resulting object, peaks, is a list of two GRanges items, one for each factor.

Predicting gene targets of individual factors
The standard target analysis includes identifying associated peaks using associated_peaks and direct targets using direct_targets. The inputs for these functions are the objects peaks and regions from the previous steps in addition to the column names for regions regions_col or the region and the statistics column stats_col, which is the fold-change in this case. The resulting objects are GRanges for the identified peaks assigned to the regions ap or the ranked targets. Several columns are added to the metadata objects of the GRanges to save the calculations.
# get associated peaks ap <-map2(peaks, regions, ~associated_peaks(peaks=.x, regions = .y, regions_col = 'tx_id')) # get direct targets dt <-map2(peaks, regions, ~direct_targets(peaks=.x, regions = .y, regions_col = 'tx_id', stats_col = 'fc')) To determine the dominant function of a factor, we divide the targets by the direction of the effect of knock-down of the factor on the expression of the target and show the regulatory potential of the target on these groups. We use the empirical distribution function (ECDF) to show the fraction of targets with a specified regulatory potential or less. Because the ranks rather than the absolute value of the regulatory potential are used, the lower the value, the higher the potential. Then the groups of targets can be compared to each other or a theoretical distribution.
# Table 2 # test individual factor functions map2(dt, groups, ~test_predictions(.x$rank, group = .y, compare = c('Down', 'Up'))) To formally test these observations, we use the Kolmogorov-Smirnov (KS) test. The distributions of the two groups are compared for equality. If one lies on either side of the other, then they must be drawn from different distributions. Here, we compared the up and down-regulated functions for both factors (Table 2). In both cases, the distributions of the two groups were significantly different from one another.
Predicting the shared targets of two factors Using target to predict the shared target genes and the combined function of the two factors is a variation of the previous analysis. First, the shared/common peaks are generated using the overlap of their genomic coordinates, subsetByOverlaps. Second, Instead of one, two columns for the differential expression statistics, one for each factor is needed; these are supplied to the argument stats_col in the same way. Here, common_peaks and both_regions are the main inputs for the analysis functions. The output of associated_peaks is the same as before. direct_targets is the same, but the stat and the stat_rank carry the product of the two statistics provided in the previous step and the rank of that product.
The output can also be visualized the same way. The targets are divided into three groups based on the statistics product. When the two statistics agree in the sign, the product is positive. This means the knockdown of either transcription factor results in the same direction change in the target gene expression. Therefore, the two factors would cooperate if they bind to the same site on that gene. The reverse is true for targets with oppositely signed statistics. On these targets, the two factors would be expected to compete for inducing opposing changes in the expression.
# Figure 3 par(mfrow = c(1, 2)) # plot distiace by score for associated peaks plot(common_ap$distance, common_ap$peak_score, xlab = 'Distance', ylab = 'Peak Score') title('(A)') # make labels, colors and gorups labs <-c('Competitive', 'None', 'Cooperative') cols <-c('green', 'gray', 'red') # make three groups by quantiles common_groups <-cut(common_dt$stat, breaks = 3, labels = labs) # plot predicted function plot_predictions(common_dt$score_rank, group = common_groups, colors = cols, labels = labs, xlab = 'Regulatory Interaction', ylab = 'ECDF') title('(B)') The common peak distances and scores take the same shape ( Figure 3A). The two factors seem to cooperate on more of the common target than any of the two other possibilities ( Figure 3B). This observation can be tested using the KS test. The curve of the cooperative targets lies above that of none and competitively regulated targets (Table 3).  Binding motif analysis Any number of downstream analyses can be performed on the final output. For example, we could apply binding motif analysis to the groups of regulated targets. In this example, all the motif analysis itself is handled by the BCRANK package Ameur et al. 12 . Here, we explain how to prepare the input from the shared peaks and target objects produced in the last step.
First, we extract the transcript IDs of the targets in their respective groups. Then the peaks assigned to these targets are ordered and sliced.
# group peaks by their assigned targets peak_groups <-split(common_dt$tx_id, common_groups) # reorder peaks and get top n peaks peak_groups <-lapply(peak_groups, function(x) { # get peaks in x targets group p <-common_ap[common_ap$assigned_region %in% unique(x)] # order peaks by score p <-p[order(p$peak_score, decreasing = TRUE)] # get n top peaks p[seq_len(ifelse(length(p) > 50, 50, length(p)))] }) The input for bcrank is a fasta file with the sequence of the regions to look for frequent motifs. We used the BSgenome.Hsapiens.UCSC.hg19 to extract the sequences of the common peaks in the competitive and cooperative target groups. The sequences are first written to a temporary file and feed to the search function. The sequences in the search path of the regions of interest are shown in (Figure 4). In the competitively regulated regions, one sequence was more frequent than all other sequences. By contrast, no sequence was uniquely frequent in the regions of cooperative targets. The most frequent motifs in the two groups are shown as seq logos using the seqLogo package ( Figure 5).

Summary
In this article, we present a workflow for predicting the direct targets of a transcription factor by integrating binding and expression data. The target package implements the BETA algorithm ranking gene targets based on the distance of the ChIP peaks of the transcription factor in the genes and the differential expression of the factor perturbation. To predict the combined function of two factors, two sets of data are used to find the shared peaks and the product of their differential expression.

Data availability
All data underlying the results are available as part of the article and no additional source data are required.

Endocrine Regulatory Genomics, Department of Experimental & Health Sciences, Pompeu Fabra University, Barcelona, Spain
In this paper, Ahmed and Kim present the target R package, which implements the BETA algorithm and extends its functionality to predict combined targets and functions of two different transcription factors (TF). By using transcription factor binding data (ChIP-seq) and gene expression data when the TF is perturbed, they are able to predict the gene targets of a single or a pair of TFs. This paper raised some major concerns that need to be addressed by the authors: Regarding the actual package code, the distance calculation doesn't measure the actual distance between peaks and TSS. The code of your find_distance() function subtracts the peak center from the region center to obtain the distance between these two features. This is not the same as the distance from the peak to the TSS, which is what this variable should be measuring according to your text "The scores of individual peaks are decreasing function of the distance from the transcription start sites", your vignette "find_distance: calculate the distance between the peaks and the regions of interest, e.g. transcription start sites (TSS)." and the original method publication "∆ is the exact distance between a binding site and the TSS". An easy way to fix this is to provide different arguments for peak and regions "how", so the user can for example select how_peaks="center" and how_regions="end". In the specific case of the code you show in this paper, those parameters would return the actual distances between peaks and TSS.

1.
Related to the previous point, the parameter downstream in the function promoters() is set to 200 by default, so the width of the regions you are generating in the first chunk of code in page 6 actually have a width of 100,200bp. You should set this argument to 0 to actually obtain 100kb windows upstream of TSS.

2.
YY1 and YY2 might not be the best examples to use for extracting conclusions on gene targets and the combined action of both TFs. Besides YY1 activity as a TF, it can also interact 3.
with chromatin modifiers and direct them to specific regions of the genome 1 . It has also been identified as a structural factor that regulates the formation and DNA loops 2 . Thus, the changes in gene expression observed when perturbing this TF might not all be associated with its activity as a TF, which is the main focus of this package.
The section on the binding motif analysis is quite interesting in terms of what to do after performing the analysis with the R target package. However, I think it would be interesting to develop it a little bit more, maybe associate the sequences present in the different groups of regulated targets to actual transcription factors to see if there is a common regulatory pathway to these targets.

4.
Regarding the general text, and specifically the section "Predicting gene targets of individual factors", I feel that the description of the main package functionality is too technical and not very informative. The authors describe all the arguments that can be provided to the different functions and the object classes that come out of them, but this description is already in the package manual. Instead of talking about the arguments and object classes, I would briefly describe what they do and how they do it, so readers can easily follow the methods without the need to read the original BETA publication or the package vignette.

5.
Related to point 3, the datasets used in this paper are different from the ones used in the vignettes and included within the package and I wasn't able to find it on the GEO site either. I would recommend providing this data either within the package or in the docker image they already created. This would facilitate the reproduction of the results presented in this paper.

6.
The authors keep referring to transcription factors as "factors", which might induce confusion when reading the article. They write "Transcription Factors" (or the abbreviation TF) to differentiate them from the broad and diverse meanings of "factors".

7.
When revising the vignette I noticed that it's missing the steps for preparing the data gene expression data, specifically the set to create windows upstream of TSS. When they load the gene expression object with (data("real_transcripts")) the windows are already present. This is misleading for users that are following the vignette as they might miss this specific step and they will not be able to get the correct results when reproducing it with their own data.

8.
In page 5, the authors mention that the changes in expression resulting from separate KD of YY1 and YY2 are correlated, but they do not provide any statistical test to confirm this. They should at least perform a correlation test and show the p-value to make this affirmation.

9.
There are some sentences in the text that are difficult to understand. The authors should rewrite them to ensure that the readers can follow the text. Some examples are: 10.
Missing citations for "KnockTF" and "ChIP-Atlas" in p.4. The solution referred to by the reviewer is already implemented in the find_distance function as an argument called how which defaults to 'center'. This is a link to the code ( https://github.com/MahShaaban/target/blob/9c6f869d794cfaff63310c5f79cea1e1095e2198/R/function ). I chose this as default since it would be neutral to the peak width. But of course, in a different use case, the user might be interested in the distance from the 'start' or 'end' of the peak.

1.
We revised the text to state the correct downstream and upstream distances as pointed out by the reviewer.

2.
We are aware of these facts. We would like to argue that using these transcription factors is still suitable. In fact, one of the main goals of the package is to distinguish between the direct targets of transcription factors and the ones that change indirectly, hence the reliance on both gene expression and peak binding. However, the definition of target here might be broadened to include cases like the ones pointed to by the reviewer.

3.
The goal of this section, as correctly pointed out by the reviewer, is to give an example of further analysis of the target package output. There is a number of further analysis that could be performed using the found motifs, but it is beyond the purpose of this workflow article to get into.

4.
We revised the section to briefly describe what each function does and how. 5.
We added to the revised version of the manuscript a chunk of code to download the dataset.

6.
We revised the text to use "transcription factor" instead of just "factor", when appropriate.

7.
The data object real_transcripts is made up of the test data provided by BETA. The original files are too large to be included in the R package. The processing is script is part of the package though, `inst/extdata/make-data.R` 8.
We calculated the correlation coefficient for the fold-change of YY1 and 2 and added it to the text.

9.
We revised the text to correct the errors and rephrase difficult sentences. 10.
two TFs. How about the gene targets of the factor-unique binding sites?
The authors set the maximum peak-to-gene TSS distance at 100kb, which is fine for the purpose of demonstration. Practically, the authors may want to provide recommendations or suggestions to the external users, since this is a very critical parameter. Based on the chromatin interaction data and co-accessibility data, peaks can target genes over a much larger distance. Alternatively, provide the option to incorporate topologically associating domains data, which will improve the detection of regulatory interactions.

5.
p6: Because the ranks rather than the absolute value of the regulatory potential are used, the lower the value, the higher the potential Based on the original paper of the BETA package [ Figure 2], genes were ranked based on their regulatory potential score (from high to low), it should be "the lower the rank, the higher the regulatory potential"?, Please check Similarly on P7: The scores of the individual peaks are a decreasing function of the distance from the transcription start sites-the closer the factor binding site from the start site, the lower the score.
Based on the original paper of the BETA package [Table], this should be "the closer the factor binding site to the start site, the higher the score"? Please check. p12: In Summary section: based on the distance of the ChIP peaks of the transcription factor in the genes and the differential expression of the factor perturbation.

7.
based on the distances of the ChIP peaks of the transcription factor relative to the TSSs of the genes two sets of data are used to find the shared peaks and the product of their differential expression two sets of data are used to find the shared peaks and the rank product of their differential expression statistics? Table 2. two-sided [two.sided] Figure 2 legend: Figure 2A: the same color was used to represent both YY1 and YY2 data. Figure 2C should be Figure 2B. Figure 2D should be Figure 2C YY1 and YY2 targets are shown at each regulatory potential value. the x-axis is the rank, not the regulatory potential value itself the same for Figure 3, the x-axis is the rank of regulatory interaction Figure 3 legend: what are the transformed distances of the shared peaks? Need to explain whether it represents % of distance occurances [occurrences] Figure 5 legend: weight matrices, position weight matrices seq logos, sequence logos the letter occure at that position, occurs at that position factors. Therefore we used the distance recommended by the original paper and left the decision to the user to make depending on their case. Users can define their regions of interest in any way they like, for example, using TADs. Here, we used the simplest case of extracting TSSs and including stretches of the up and down streams.
We corrected the sentence referred to above. 6.
We corrected the sentence referred to above and revised the manuscript for typos and grammatical errors.

7.
Competing Interests: No competing interests were disclosed.
The benefits of publishing with F1000Research: Your article is published within days, with no editorial bias • You can publish traditional articles, null/negative results, case reports, data notes and more • The peer review process is transparent and collaborative • Your article is indexed in PubMed after passing peer review • Dedicated customer support at every stage • For pre-submission enquiries, contact research@f1000.com