Improving the support for XML dynamic updates using a hybridization labeling scheme (ORD-GAP)

Background : As the standard for the exchange of data over the World Wide Web, it is important to ensure that the eXtensible Markup Language (XML) database is capable of supporting not only efficient query processing but also capable of enduring frequent data update operations over the dynamic changes of Web content. Most of the existing XML annotation is based on a labeling scheme to identify each hierarchical position of the XML nodes. This computation is costly as any updates will cause the whole XML tree to be re-labelled. This impact can be observed on large datasets. Therefore, a robust labeling scheme that avoids re-labeling is crucial. Method: Here, we present ORD-GAP (named after Order Gap), a robust and persistent XML labeling scheme that supports dynamic updates. ORD-GAP assigns unique identifiers with gaps in-between XML nodes, which could easily identify the level, Parent-Child (P-C), Ancestor-Descendant (A-D) and sibling relationship. ORD-GAP adopts the OrdPath labeling scheme for any future insertion. Results: We demonstrate that ORD-GAP is robust enough for dynamic updates, and have implemented it in three use cases: (i) left-most, (ii) in-between and (iii) right-most insertion. Experimental evaluations on DBLP dataset demonstrated that ORD-GAP outperformed existing approaches such as ORDPath and ME Labeling concerning database storage size, data loading time and query retrieval. On average, ORD-GAP has the best storing and query retrieval time. Conclusion: The main contributions of this paper are: (i) A robust labeling scheme named ORD-GAP that assigns certain gap between each node to support future insertion, and (ii) An efficient mapping scheme, which built upon ORD-GAP labeling scheme to transform XML into RDB effectively.


Introduction
Extensible Markup Language (XML) was introduced in the 1990s by the World Wide Web Consortium (W3C) to be the standard for information exchange as it is self-descriptive. Similar to Hypertext Markup Language (HTML), XML is a tag-based syntax, yet, XML can represent data within its context and is readable by machines and humans as it utilizes a natural language. 1,2 Since the emergence of XML, many approaches to map XML into Relational DataBase (RDB) have existed. 3,4 Dynamic Prefix-based Labeling Scheme (DPLS) 5 extended the Dewey scheme 6,7 and is based on a two stage approach: (i) constructing the initial DPLS labeling and (ii) handling any updates. Alsubai and North 8 proposed a Child Prime Label (CPL) based on the prime number on the XML tree. The trees are traversed and annotated with labels (start, end, level, CPL) based on depth-first traversals. Research by Khanjari and Gaeini 9 proposed the FibLSS encoding scheme, which uses binary bit values (0 and 1) to assign node labels. The authors conducted experimental evaluations of their approach against Improved Binary String Labeling (IBSL), 10 which indicated that FibLSS is capable of supporting insertion without the need for relabeling.
More recently, Taktek and Thakker 11 introduced the Pentagonal Scheme, a dynamic XML labeling scheme. Their algorithms support dynamic updates without redundant labels or relabeling needed. Their evaluations showed that the Pentagonal Scheme can handle several insertions yet showed a better initial labeling time as compared to some existing schemes, especially on large XML datasets. Azzedin et al. 12 proposed the RLP-Scheme, which enriched Dewey labeling 6 with more information. With the RLP-Scheme, an ancestor node can be computed easily, yet the storage space and central processing unit time can be minimised for XML with many identical sub-trees.
In the literature, most of the existing approaches support only static query processing by assuming that the structural information will not have any changes over time. 13 This situation is impractical as the data exchanged over the Web is subject to very frequent updates. Due to this reason, we propose a mapping scheme called ORD-GAP that can support updates dynamically. Updates and delete operations are simple as they will not change the existing labeling, thus, the focus of this paper is on the insert operation as insertion will generate new or modify existing labeling. Figure 1 depicts the architecture diagram of our proposed approach. Our proposed approach consists of the three main components, namely, XML parser, XML Encoder, and XML Mapper. The XML document is the input, while the output will be stored into RDB. The XML parser is responsible for validating XML to ensure it is well-formed before any processing takes place. The XML Encoder annotates the XML tree via a labeling scheme so that the structural relationships among the XML nodes can be identified easily even upon transformation into other underlying storage. Subsequently, the XML Mapper maps or transforms the annotated XML tree into RDB storage. Subsequently, for query retrieval, it will be issued via Structure Query Language (SQL). Tree annotation Tree annotation of the proposed method includes both labeling and mapping schemes that work together to transform the XML tree into RDB storage. This approach adopted the node indexing of range labeling and prefix-based labeling as the initial annotation. Subsequently, we adopted the ORDPath 14 labeling scheme for any dynamic update operations. Henceforth, the proposed approach is named as ORD-GAP.

Methods
This labeling is in the format of (s-e)l. The s denotes the start range while the e denotes the end range. The l expresses the level of each node position. These values for s and e are generated based on the gap g. The value g is calculated based on the formula: g= Σ (max fan-out +max depth ). Figure 2 illustrates the snippet view of the SIGMOD Record dataset 15 labelled with the ORD-GAP scheme. This dataset is commonly used for benchmarking purpose. It was chosen as it contains various fan-outs (number of children each node has) and many levels to better demonstrate how our proposed approach works. Firstly, we need to find out the value for g, whereby we need to know the max fan-out and max depth From the dataset, we observed that the maximum fan-out and maximum level is 4 and 6 respectively. As such, the gap value calculated by our algorithm (see Figure 3(a)) is 10. The root will always start with s as 1. The value of the following node is allocated from the gap value and the previous node's value. In this case, since the gap is 10 and the value on the previous node's is 1 (the root node), so, the node "issue" is assigned with 11 and tailed by node "author" with 21 for the s. The e value on node tree will be assigned once the s has reached the leaf node. In this case, if the s label is 31 and is a leaf node, then the e label will be assigned with 41 (by adding the s value with the gap value, such as 31+11), followed by the node "issue" with 51 as the e.   Figure 3(a) shows the calculation of g which is formulated based on Σ (max fan-out + max depth ) of the tree while Figure 3(b) shows the algorithm to assign a label. In Function GetGap, parent node and next level of current node is an input used to obtain g. The max fan-out is the maximum number of child while max depth is the deepest level of the tree.

Structural relationship determination
Mapping schemes of ORG-GAP contain two tables to map the XML data in RDB . The two tables are internal table and  text table. The internal table is called iTable, which is used for storing the node that does not contain a text value. A text table is called tTable, and is used to store the leaf nodes. The attributes of both tables consists of Start, End, Level, PStart, Value; Start node keeps the s value of node, End node keeps the e value of node, and Level node keeps the depth of a node from the root. Tables 1 and 2 are the partial view of iTable and tTable based on outcome after the labeling scheme (see Figure 2).
For P-C relationship, it is determined based on the following conditions: • if (P(s) < C(s) < P(e)) and (C (level) -P (level) = 1) • Pstart for C == Start for P (Mapping Scheme) The level difference should be equal to one since the parent would be only one level higher than the child. Another condition is the PStart value should be equal to P value.
Lastly for Siblings, if the nodes have the same PStart from the table, they are siblings.

Results
The dynamic update of ORD-GAP was adapted from the ORDPath. 14 ORDPath encodes the P-C relationship by extending the parent's ORDPath label with a component for the child. However, in ORDPath, the even number is reserved for further node insertions. Generally, this approach is good as all four relationships could be determined easily. However, we observed that the label size grows uncontrollable with the growth of the tree. Henceforth, it may not be scalable for a huge dataset. Yet, we observed that dynamic insertion is not as huge as compared to initial tree labeling. This motivated us to use ORDPath labeling to support the insertion updates, while keeping ORD-GAP as the initial tree labeling.

Insertion scenario with ORD-GAP
The insertion consists of left-most, right-most and in-between insertion. Each insertion includes an additional node known as medium node which represents the insertion of dynamic update. Thus, this method creates an unlimited insertion on XML tree which avoids node relabeling. leaf node that will be mapped in the iTable (internal table) and tTtable (leaf node) as depicted in Tables 3 and 4, respectively.
We have implemented ORD-GAP using Java Development Kit (JDK) 8.0.510.16 on Netbean IDE 8.0.2 compile. Experimental evaluations were conducted to measure the performance of ORD-GAP as compared to ORDPath 14 and ME Labeling 16 approaches. These two existing approaches were taken for comparison because the technique does not require node re-labeling.
In the first part of the evaluation, the XML document is stored and transformed into RDB storage. The data insertion time and database storage size are recorded for all three approaches. After the storage is completed, we performed query retrieval to measure the performance of ORD-GAP, ORDPath and ME Labeling.    Lastly, our proposed approach ORD-GAP is put into evaluation to test for the dynamic update operations. All the experiments are performed on i7-3770 @3.4 processor with 16GB of RAM running on Windows 7. In the subsequence evaluations, we used the DBLP dataset 17 to demonstrate the possibility of supporting larger dataset.

Data storing evaluation time
In this evaluation, insertion time was recorded four times. We discarded the first reading to omit the buffering effect for consistency of execution time. The results recorded are the average time of the three consecutive times. Table 5 shows the insertion time of ORD-GAP, ORDPath 14 and ME labeling. 16 ORD-GAP is the fastest followed by ME Labeling and ORDPath.
Storage space evaluation Database storage consumption was evaluated to determine the storage space using ORD-GAP, ORDPath and ME Labeling approaches. From our experimental observation (see Table 6), we observed that ME Labeling requires higher storage space volume as compared to ORD-GAP and ORDPath due to the larger labeling size required as the depth of the XML tree increases.
As depicted, ORD-GAP reserved a gap between nodes, which delaying the initial node labelling, as ORD-GAP requires some calculation on retrieving the initial nodes. While ORDPath uses dot separated component byte-by-byte, that assigning node label is taken from the parent's nodes toward the depth of XML tree. Whereas ME Labeling uses multiplication that causes the increases of size labels. The multiplication requires more time on the computation as the size label increase. Thus, both ORDPath and ME Labeling take less time for node labeling.  Table 7 displays the query node in tree representation and XPath notation for each query. Figure 5 shows the query execution performance on various approaches. ORD-GAP is leading, followed by ME labeling and ORDPath. ORDPath require more time as compared to ORD-GAP and ME Labeling due to the number of elements in a node in DBLP. Although DBLP tree contains only three levels, it has multiple siblings in a node. Thus, the data model grows horizontally. ORDPath is prefix-based labeling that traverses using breadth-first search traversal. Likewise, ORDPath did not perform well. As the sibling's node increases, the size label is increased. Hence, it requires more time to retrieve data in the database.

Conclusion
In this paper, we propose a labeling scheme named ORD-GAP that enables dynamic insertion by adopting ORDPath techniques. ORDPath generates unrestricted insertion of large XML trees. We carried out evaluations to compare ORD-GAP with ORDPath and ME Labeling. The performance of ORD-GAP was evaluated based on the database size, insertion, query retrieval and dynamic updates. We showed that ORD-GAP has a better performance than ORDPath and ME Labeling. However, we were not able to test ORD-GAP on a dataset size beyond 1.2GB due to hardware limitations such as hardware processor and available RAM. In our future work, we will look into XML compression and optimization to ensure the further reduce the label size.