Seqpare: a novel metric of similarity between genomic interval sets

Searching genomic interval sets produced by sequencing methods has been widely and routinely performed; however, existing metrics for quantifying similarities among interval sets are inconsistent. Here we introduce Seqpare, a self-consistent and effective metric of similarity and tool for comparing sequences based on their interval sets. With this metric, the similarity of two interval sets is quantified by a single index, the ratio of their effective overlap over the union: an index of zero indicates unrelated interval sets, and an index of one means that the interval sets are identical. Analysis and tests confirm the effectiveness and self-consistency of the Seqpare metric.


Introduction
Functional genomic data are often summarized as interval sets and deposited in public repositories (e.g., UCSC, ENCODE, Roadmap, GEO, SRA etc.).Identifying relationships among sequences and searching through widely available sequence data are routine tasks in genomic research.A fundamental operation in genomic/epigenomic analysis is comparing two interval sets, and many algorithms and tools have been developed for this purpose (Alekseyenko & Lee, 2007;Cormen et al., 2001;Feng et al., 2019;Giardine et al., 2005;Jalili et al., 2019;Kent et al., 2002;Li, 2011;Neph et al., 2012;Quinlan & Hall, 2010;Richardson, 2006).These methods are based on computing the total number of intersections (overlaps) between the two interval sets.To compare a query interval set with multiple interval sets in a genomic sequence database, searching tools LOLA (Sheffield & Bock, 2016) and GIGGLE (Layer et al., 2018) calculate two values -Fisher's exact p-value and the odds-ratio based on the total number of intersections -and use them as the similarity score to rank the search results.These similarity metrics have proven useful for determining relationships among interval sets, but also have some flaws.First, calculating the Fisher's exact test results requires building a contingency table, but determining its values is not straightforward.The p-value and odds-ratio for two intervals sets (with number of intervals N 1 and N 2 ) are calculated from four numbers, namely, the number of intersections between the two sets, n, the number of intervals in set 1 that do not overlap an interval in set 2, N 1n, the number of intervals in set 2 that do not overlap an interval in set 1, N 2n, and the number of intervals that are not present in either set, m.Determining the value of the fourth number m is not straightforward; in LOLA, it depends on the definition of a "universe set" that is not objectively defined, whereas GIG-GLE estimates m from the two interval sets.Second, the total number of overlaps n does not necessarily reflect similarity since intervals can have very different lengths (often in the range of 1 to 10 5 base pairs) and two very different intervals may intersect by only a few base pairs.This can result in inconsistency of the metrics: a comparison between two identical interval sets may have a larger p-value or smaller odds-ratio than a comparison between two different interval sets (see example cases and analysis in the next section).More strikingly, since one interval may contain or cover other intervals in an interval set, depending on how the overlaps are computed, n can be larger than N 1 and/or N 2 , i.e., N 1 -n and/or N 2 -n can be negative, which leads to both the p-value and odds-ratio being undefined-another potential source of inconsistency.Third, the Fisher's exactbased metrics require two values (p-value and odds-ratio) but neither is a direct measurement of the similarity: p-values are sensitive to the total number of regions and can range as low as 10 -200 for large genomic interval sets, and odds-ratios are sensitive to small numbers; and neither metric directly informs on how similar the two sets are.Last, the p-value calculation is computationally expensive for genomic interval sets, particularly when the number of intervals is large (up to 10 9 ).To overcome these weaknesses of the Fisher's exact-based metrics, we developed Seqpare, a self-consistent metric for quantifying the similarity among genomic interval sets.

Seqpare metric
The Seqpare metric uses a single index to quantify the degree of similarity S of two interval sets with number of intervals N 1 and N 2 .Similar to the Jaccard index, the Seqpare metric is directly defined as the ratio of the total effective overlap O of the two interval sets over the union N 1 +N 2 -O: For two intervals v 1 in set 1 and v 2 in set 2, the similarity s is defined as: where o is the length of the intersection and l 1 and l 2 are the lengths of v 1 and v 2 respectively.Definition 2 is the Jaccard index for individual intervals: o represents the effective overlap of the two intervals and s takes values in the range of [0, 1]: s = 0 indicates that there is no overlap between the two intervals, and s = 1 means that the union equals the overlap so v 1 and v 2 are identical.Then the total effective overlap O for the two interval sets can be calculated by adding up the similarities of all mutual best matching (MBM) pairs: A MBM pair is defined as a pair of intervals v 1 and v 2 that fulfill the following conditions: among all intervals in set 2 that intersect v 1 , v 2 matches v 1 the best, i.e., the similarity s between v 1 and v 2 is the highest among those intersections; and among all intervals in set 1 that intersect v 2 , v 1 matches v 2 the best.
Clearly, if two intervals only intersect each other, then they form an MBM pair.In Figure 1a, the two long intervals (the 1st in set 1 and the 1st in set 2) only intersect each other (intersection pair ip 1,1 ), so they form an MBM pair; similarly, the two short intervals ip 2,2 form another MBM pair.For intervals that involve multiple intersections, we define a relatively simple and strict rule to find the MBM pairs: find and choose the first MBM pair that has the highest s among all involved intersection pairs, then find and choose the next MBM pair that has the highest s from the rest of the intersection pairs (excluding all pairs that involve the intervals that are already chosen), and so on until there are no more intersection pairs.In Figure 1b, there are 3 intersection pairs: ip 1,2 with s=1/5, ip 1,3 with s=1/9, and ip 2,3 with s=1/5.So the first MBM pair is either ip 1,2 or ip 2,3 depending on which one is found first.If ip 1,2 is chosen as the first MBM pair, then ip 1,3 will not be considered since interval 1 in set 1 is already chosen, and then we have only ip 2,3 left, which is the second MBM pair.We get the same result if ip 2,3 is chosen as the first MBM pair.In Figure 1d, there are 6 intersection pairs and two MBM pairs: ip 1,1 and ip 3,2 both with s=1, where interval 2 in set 1 (i 2,1 ) does not have a match.Note that interval i 2,1 matches best with interval i 1,2 , but i 1,2 does not match best with i 2,1 , so they are not the mutual best matching pair.
Since the number of total matching pairs ≤ Min(N 1 , N 2 ) -the minimum of N 1 and N 2 -and s is in range of [0, 1], we obtain O ≤ Min(N 1 , N 2 ), and S takes a value in the range of [0, 1].If S is zero, then there is no matching pair, and vice versa; if S = 1, then N 1 = N 2 = O (the two sets are equivalent), and vice versa.And, because each s is the mutual best match, O is symmetric (the amount of overlap between set 1 and set 2 is the same as that between set 2 to set 1) and so is S.
In Figure 1a, O a =2 and S a =1, which is correct because the two sets are identical.In Figure 1b, O b =2/5 and S b =1/14, which is expected since the two sets are very different.The Fisher's exact approach is inconsistent here: the p-value in 1b is smaller than that in 1a although the two sets in 1b are very different while those in 1a are equivalent.Assuming that the number of intervals N in the 'universe set' is 100, then Fisher's exact test contingency table is [(2, 0), (0, 98)] in 1a and [(3, 0), (0, 97)] in 1b, which gives p a = 2.02×10 -4 and p b = 6.18×10 - 6 respectively.The odds-ratio is ∞ in both cases.In Figure 1c and Figure 1d, N 1n and N 2n are all negative, so it is not conceptually appropriate to use Fisher's exact test to calculate the p-value and odds-ratio.

Implementation
The implementation of the Seqpare metric is simple.The searching for MBM pairs is deterministic and it can be implemented by directly following the description in the above section.
The Seqpare code is built on top of the AIList v0.0.1 (Feng et al., 2019) software written in C.

Operation
The Seqpare software (Feng & Feng, 2020) was tested on Linux machines and the minimum required memory is 8GB.The interval set file should be in the format of bed or bed.gz.

A test with real genomic interval sets
To test Seqpare and compare it with the Fisher's exact-based metrics, we took 100 interval sets from a UCSC database and used one interval set, affyGnf1h, as a query to search over the database.Because the database contains the query set, affyGnf1h should have the highest similarity score.Table 1 (Feng & Feng, 2020) shows part of the result.Interval set affyGnf1h indeed ranks first with maximum similarity 1 when using Seqpare, but it ranks 94 th out of 100 when using the p-value and ranks last when using the odds-ratio.This happens because N 1 -n and N 2 -n are both negative (n=16686, N 1 =N 2 =12158).Given this inconsistency, GIGGLE sets negative N 1 -n and N 2 -n to zero to calculate the p-value, and to one to calculate the oddsratio.The Seqpare indices for other interval sets are all small (<0.03) because the average effective overlap of an intersection pair in those sets is about 0.1 or less, i.e., they are very different from the query set affyGnf1h; however, all of the p-values are so small (e -200 ), which suggests that the p-value is not a meaningful similarity index for these genomic interval sets.This search takes 6m30s for Seqpare and 15m32s for GIGGLE.All computations were carried out on a computer with a 2.8GHz CPU, 16GB memory, and an external SSD hard disk.The complete results can be found at the same site as the software.

Conclusion
We have shown that the Fisher's exact test approach may be not the most appropriate test statistic for comparing similarity among interval sets.While the approach has been shown to be successful for many questions, we have demonstrated how it can break down for a variety of reasons, such as very similar interval sets, within-set containment, widely varying interval lengths among sets, or small effective overlaps.In contrast, Seqpare is a self-consistent metric for quantifying the similarity of two interval sets that addresses these concerns.Seqpare is the first rigorously defined metric for comparing two sequences based on their interval sets.In addition to the metric itself, our Seqpare software tool provides functions for both searching  and mapping large-scale interval datasets.We anticipate that this approach will contribute to novel results for interval set searching.
I wonder what is the time complexity of MBM calculation?Does this metric work when there are self-contained intervals (overlapping intervals) in each interval set?Why you choose this metric instead of simply using jaccard index?
In the text, "6 ips" is written.It would be better to explain "ips" before its usage.
Is the rationale for developing the new software tool clearly explained?Yes

Is the description of the software tool technically sound? Yes
Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?Yes

Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool? Yes
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?Yes Competing Interests: No competing interests were disclosed.
I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.
similarities of these intersection pairs and then finds the MBM pairs one by one.The time complexity of MBM depends on the cases.If all intersection pairs are simple (Figure 1a), then the complexity is O(m); in the worst case (Figure 1c or Figure 1d), all intersection pairs are correlated and the complexity is O(m2).For most real genomic interval sets, the time complexity is close to O(m).

2.
Seqpare is a general metric of similarity and it works when there are self-contained intervals.Figure 1c and 1d are two cases of self-contained interval sets, where seqpare works well but the p-value based metric breaks down.Also, most of the UCSC interval sets used in this paper contain large number of overlapping intervals.
3. Jaccard index is defined as the ratio of Intersection over Union and it takes value in the range of [0, 1], i.e., the Intersection is never larger than the Union and the Union is never zero or negative.Jaccard index can be strictly used for two individual intervals, as explained in the Metric section of the paper.However, it cannot be used directly as similarity metric for two interval sets.
For two interval sets, the simple Jaccard metric would be J = N/(N1+N2-N), where N1 and N2 are the number of intervals in the two interval sets, and N is the number of intersections between them.However, the 'Intersection' N can be larger than the 'Union' (N1+N2-N), and (N1+N2-N) can also be zero or negative: in Figure 1c, N=4, N1=2, N2=2, and (N1+N2-N) is zero; in Figure 1d, N=6, N1=2, N2=3, (N1+N2-N) is -1.Therefore, just like the p-value, simple Jaccard index is not a rigorous similarity metric for interval sets.This also suggests that using p-value or simple Jaccard index to compare interval sets can be misleading and defining a rigorous similarity metric for interval sets is challenging.
provide a better explanation of this term and why this is important for such a problem.
Genomic intervals, in my opinion, should come with genomic coordinates, to indicate where does it come from in terms of the reference genome.Of course, this will not be the case if the data does not come from a model organism.I am surprised that such information is not used in their method.I think it is a disadvantage if this is not used.Since the problem can be much easier if order is being considered.I am interested to know what do the authors think about this.

2.
Figure 1C is not mentioned in the paper.Not sure why. 3.

Is the rationale for developing the new software tool clearly explained? Yes
Is the description of the software tool technically sound?Yes Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?Partly

Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool? Yes
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?Partly with the paper.Response: Thank you for your helpful comments and suggestions.We have revised the paper according to the specific comments, and we have also added a subset of the test data and instructions to the software page to ease the verification of the result.
Comment: I never get what exactly does "self-consistent" really mean.I do not think this is the same as the consistent used in statistics which is an asymptotic property.I think the authors should provide a better explanation of this term and why this is important for such a problem.Response: The word "self-consistent" is used here with its literal meaning of "not having parts or aspects which are in conflict or contradiction with each other" (google dictionary).Being selfconsistent is necessary for a definition, concept, metric or theory.We think it's a useful word and we have explained it in the Introduction.Nevertheless, we did remove it from the title since it may cause confusion without context.
Comment: Genomic intervals, in my opinion, should come with genomic coordinates, to indicate where does it come from in terms of the reference genome.Of course, this will not be the case if the data does not come from a model organism.I am surprised that such information is not used in their method.I think it is a disadvantage if this is not used.Since the problem can be much easier if order is being considered.I am interested to know what do the authors think about this.Response: Genomic intervals in the standard BED file format contain essential genomic information such as Chromosome, start coordinate, end coordinate and strand, etc.Like all other tools mentioned in the paper, Seqpare uses BED file format.Seqpare does not require the intervals to be sorted, which is an advantage over tools that do require intervals to be sorted.
Comment: Figure 1C is not mentioned in the paper.Not sure why.Response: Figure 1 shows four cases and Figure 1c is mentioned as case c in the last paragraph of the Methods section "Seqpare metric".We replaced 'case c' with 'Figure 1c'.

The benefits of publishing with F1000Research:
Your article is published within days, with no editorial bias • You can publish traditional articles, null/negative results, case reports, data notes and more • The peer review process is transparent and collaborative • Your article is indexed in PubMed after passing peer review • Dedicated customer support at every stage • For pre-submission enquiries, contact research@f1000.com

Figure 1 .
Figure 1.Example cases for illustrating the Seqpare similarity metric.The length ratio of the short interval to the longer intervals are 1: 5 in a), 1: 3: 5 in b), and 2: 3 :4 in d).The total number of overlaps n is 2 for a), 3 for b) (interval 1 in Set 1 intersects two intervals in Set 2), 4 for c) and 6 for d).The p-value in case b is smaller than that in case a. N 1 -n and N 2 -n are both negative in cases c and d.