ALL Metrics
-
Views
-
Downloads
Get PDF
Get XML
Cite
Export
Track
Research Article

Improving the support for XML dynamic updates using a hybridization labeling scheme (ORD-GAP)

[version 1; peer review: 2 approved]
PUBLISHED 09 Sep 2021
Author details Author details
OPEN PEER REVIEW
REVIEWER STATUS

This article is included in the Research Synergy Foundation gateway.

Abstract

Background: As the standard for the exchange of data over the World Wide Web, it is important to ensure that the eXtensible Markup Language (XML) database is capable of supporting not only efficient query processing but also capable of enduring frequent data update operations over the dynamic changes of Web content. Most of the existing XML annotation is based on a labeling scheme to identify each hierarchical position of the XML nodes. This computation is costly as any updates will cause the whole XML tree to be re-labelled. This impact can be observed on large datasets. Therefore, a robust labeling scheme that avoids re-labeling is crucial.
Method: Here, we present ORD-GAP (named after Order Gap), a robust and persistent XML labeling scheme that supports dynamic updates. ORD-GAP assigns unique identifiers with gaps in-between XML nodes, which could easily identify the level, Parent-Child (P-C), Ancestor-Descendant (A-D) and sibling relationship. ORD-GAP adopts the OrdPath labeling scheme for any future insertion.
Results: We demonstrate that ORD-GAP is robust enough for dynamic updates, and have implemented it in three use cases: (i) left-most, (ii) in-between and (iii) right-most insertion. Experimental evaluations on DBLP dataset demonstrated that ORD-GAP outperformed existing approaches such as ORDPath and ME Labeling concerning database storage size, data loading time and query retrieval. On average, ORD-GAP has the best storing and query retrieval time.
Conclusion: The main contributions of this paper are: (i) A robust labeling scheme named ORD-GAP that assigns certain gap between each node to support future insertion, and (ii) An efficient mapping scheme, which built upon ORD-GAP labeling scheme to transform XML into RDB effectively.

Keywords

XML-RDB mapping, mapping scheme, XML databases, dynamic updates, XML labeling scheme.

Introduction

Extensible Markup Language (XML) was introduced in the 1990s by the World Wide Web Consortium (W3C) to be the standard for information exchange as it is self-descriptive. Similar to Hypertext Markup Language (HTML), XML is a tag-based syntax, yet, XML can represent data within its context and is readable by machines and humans as it utilizes a natural language.1,2 Since the emergence of XML, many approaches to map XML into Relational DataBase (RDB) have existed.3,4

Dynamic Prefix-based Labeling Scheme (DPLS)5 extended the Dewey scheme6,7 and is based on a two stage approach: (i) constructing the initial DPLS labeling and (ii) handling any updates. Alsubai and North8 proposed a Child Prime Label (CPL) based on the prime number on the XML tree. The trees are traversed and annotated with labels (start, end, level, CPL) based on depth-first traversals. Research by Khanjari and Gaeini9 proposed the FibLSS encoding scheme, which uses binary bit values (0 and 1) to assign node labels. The authors conducted experimental evaluations of their approach against Improved Binary String Labeling (IBSL),10 which indicated that FibLSS is capable of supporting insertion without the need for relabeling.

More recently, Taktek and Thakker11 introduced the Pentagonal Scheme, a dynamic XML labeling scheme. Their algorithms support dynamic updates without redundant labels or relabeling needed. Their evaluations showed that the Pentagonal Scheme can handle several insertions yet showed a better initial labeling time as compared to some existing schemes, especially on large XML datasets. Azzedin et al.12 proposed the RLP-Scheme, which enriched Dewey labeling6 with more information. With the RLP-Scheme, an ancestor node can be computed easily, yet the storage space and central processing unit time can be minimised for XML with many identical sub-trees.

In the literature, most of the existing approaches support only static query processing by assuming that the structural information will not have any changes over time.13 This situation is impractical as the data exchanged over the Web is subject to very frequent updates. Due to this reason, we propose a mapping scheme called ORD-GAP that can support updates dynamically. Updates and delete operations are simple as they will not change the existing labeling, thus, the focus of this paper is on the insert operation as insertion will generate new or modify existing labeling.

Methods

Figure 1 depicts the architecture diagram of our proposed approach. Our proposed approach consists of the three main components, namely, XML parser, XML Encoder, and XML Mapper. The XML document is the input, while the output will be stored into RDB. The XML parser is responsible for validating XML to ensure it is well-formed before any processing takes place. The XML Encoder annotates the XML tree via a labeling scheme so that the structural relationships among the XML nodes can be identified easily even upon transformation into other underlying storage. Subsequently, the XML Mapper maps or transforms the annotated XML tree into RDB storage. Subsequently, for query retrieval, it will be issued via Structure Query Language (SQL).

16f6d62d-c571-42fa-8b18-a0f06d36ae43_figure1.gif

Figure 1. Architecture diagram of the proposed approach.

Tree annotation

Tree annotation of the proposed method includes both labeling and mapping schemes that work together to transform the XML tree into RDB storage. This approach adopted the node indexing of range labeling and prefix-based labeling as the initial annotation. Subsequently, we adopted the ORDPath14 labeling scheme for any dynamic update operations. Henceforth, the proposed approach is named as ORD-GAP.

This labeling is in the format of (s-e)l. The s denotes the start range while the e denotes the end range. The l expresses the level of each node position. These values for s and e are generated based on the gap g. The value g is calculated based on the formula: g= Σ (maxfan-out+maxdepth).

Figure 2 illustrates the snippet view of the SIGMOD Record dataset15 labelled with the ORD-GAP scheme. This dataset is commonly used for benchmarking purpose. It was chosen as it contains various fan-outs (number of children each node has) and many levels to better demonstrate how our proposed approach works. Firstly, we need to find out the value for g, whereby we need to know the maxfan-out and maxdepth From the dataset, we observed that the maximum fan-out and maximum level is 4 and 6 respectively. As such, the gap value calculated by our algorithm (see Figure 3(a)) is 10. The root will always start with s as 1. The value of the following node is allocated from the gap value and the previous node’s value. In this case, since the gap is 10 and the value on the previous node’s is 1 (the root node), so, the node “issue” is assigned with 11 and tailed by node “author” with 21 for the s. The e value on node tree will be assigned once the s has reached the leaf node. In this case, if the s label is 31 and is a leaf node, then the e label will be assigned with 41 (by adding the s value with the gap value, such as 31+11), followed by the node “issue” with 51 as the e.

16f6d62d-c571-42fa-8b18-a0f06d36ae43_figure2.gif

Figure 2. The ORD-GAP labeling scheme.

Figure 3 shows the pseudocode for ORD-GAP. Figure 3(a) shows the calculation of g which is formulated based on Σ (maxfan-out + maxdepth) of the tree while Figure 3(b) shows the algorithm to assign a label. In Function GetGap, parent node and next level of current node is an input used to obtain g. The maxfan-out is the maximum number of child while maxdepth is the deepest level of the tree.

16f6d62d-c571-42fa-8b18-a0f06d36ae43_figure3a.gif

Figure 3(a). Algorithm for Function GetGap.

16f6d62d-c571-42fa-8b18-a0f06d36ae43_figure3b.gif

Figure 3(b). Algorithm for Function AssignLabel

Structural relationship determination

Mapping schemes of ORG-GAP contain two tables to map the XML data in RDB. The two tables are internal table and text table. The internal table is called iTable, which is used for storing the node that does not contain a text value. A text table is called tTable, and is used to store the leaf nodes. The attributes of both tables consists of Start, End, Level, PStart, Value; Start node keeps the s value of node, End node keeps the e value of node, and Level node keeps the depth of a node from the root. Tables 1 and 2 are the partial view of iTable and tTable based on outcome after the labeling scheme (see Figure 2).

Table 1. iTable of Parent Table for initial labeling.

StartEndLevelPstartValue
215122volume
619122number
12115146title
16119146initPage
20123146endPage
251281510author
291321510author
24133146authors
11134135article
361391413title
401431413initPage
441471413endPage
491521517author
531561517author
571601517author
481611413authors
35162135article
10163122articles
1164111issue
661691221volume
701731221number
751781324article
791821324article
741831221articles
65184111issue
185100SigmodRecord

Table 2. tTable of Child Table for initial labeling.

StartEndLevelPstartValue
31413311
7181341
13114157Architecture of Future Data Base Systems.
1711815830
2112215944
371381514Errors in 'Process Synchronization in Database Systems'.
4114215159
45146151629
67168132211
7117213233
761771425science direct
801811426ieee

ORD-GAP supports all structural relationships which are level, P-C, A-D and sibling. A-D relationship is determined based on the following conditions:

  • if (A(s) < D(s) < A(e)) and (D (level) – A (level) > 1).

Example: Let node1 be volume (21-51)2 and node2 be SigmodRecord (1-811)0, (SigmodRecord (1) < volume (21) < SigmodRecord (811) and volume (2) – SigmodRecord (0) > 1). As such, node1 and node2 has A-D relationship.

For P-C relationship, it is determined based on the following conditions:

  • if (P(s) < C(s) < P(e)) and (C (level) – P (level) = 1)

  • Pstart for C == Start for P (Mapping Scheme)

The level difference should be equal to one since the parent would be only one level higher than the child. Another condition is the PStart value should be equal to P value.

Example: Let node1 be article (111-341)3 and node2 be authors (241-331)4, (article (111) < authors (241) < article (341) and authors (4) – article (3)=1). As such, node1 and node2 have P-C relationship.

Lastly for Siblings, if the nodes have the same PStart from the table, they are siblings.

Example: Let node1 be endPage (201-231)4 and node2 be authors (241-331)4. From iTable, both have PStart ‘6’. As such, node1 is a sibling of node2.

Results

The dynamic update of ORD-GAP was adapted from the ORDPath.14 ORDPath encodes the P-C relationship by extending the parent's ORDPath label with a component for the child. However, in ORDPath, the even number is reserved for further node insertions. Generally, this approach is good as all four relationships could be determined easily. However, we observed that the label size grows uncontrollable with the growth of the tree. Henceforth, it may not be scalable for a huge dataset. Yet, we observed that dynamic insertion is not as huge as compared to initial tree labeling. This motivated us to use ORDPath labeling to support the insertion updates, while keeping ORD-GAP as the initial tree labeling.

Insertion scenario with ORD-GAP

The insertion consists of left-most, right-most and in-between insertion. Each insertion includes an additional node known as medium node which represents the insertion of dynamic update. Thus, this method creates an unlimited insertion on XML tree which avoids node relabeling.

Figure 4 shows dynamic updates of left-most, in-between, and right-most insertion. The nodes represent the left-most insertion (21.1), in-between insertion (641.1), and right-most insertion (831.1). The insertion contains internal node and leaf node that will be mapped in the iTable (internal table) and tTtable (leaf node) as depicted in Tables 3 and 4, respectively.

16f6d62d-c571-42fa-8b18-a0f06d36ae43_figure4.gif

Figure 4. Left-most, in-between and right-most insertion on ORD-GAP.

Table 3. iTable of Parent Table for dynamic updates.

StartEndLevelPstartValuePvalueType of insertion
21.1-1-dateissueLeft-most
21.1.1-2-date_issuedateLeft-most
831.1-1-dateissueRight-most
831.1.1-2-date_articledateRight-most
641.1-0-addonSigmodRecordIn-between
641.1.1-1-pageaddonIn-between
641.1.3-1-sub_authoraddonIn-between

Table 4. tTable of Parent Table for dynamic updates.

StartEndLevelPstartValueType of insertion
21.1.1.1-3-26 August 2019Left-most
831.1.1.1-3-26 July 2019Right-most
641.1.1.1-2-100 pageIn-between
641.1.3.1-2-McDonaldIn-between

We have implemented ORD-GAP using Java Development Kit (JDK) 8.0.510.16 on Netbean IDE 8.0.2 compile. Experimental evaluations were conducted to measure the performance of ORD-GAP as compared to ORDPath14 and ME Labeling16 approaches. These two existing approaches were taken for comparison because the technique does not require node re-labeling.

In the first part of the evaluation, the XML document is stored and transformed into RDB storage. The data insertion time and database storage size are recorded for all three approaches. After the storage is completed, we performed query retrieval to measure the performance of ORD-GAP, ORDPath and ME Labeling.

Lastly, our proposed approach ORD-GAP is put into evaluation to test for the dynamic update operations. All the experiments are performed on i7-3770 @3.4 processor with 16GB of RAM running on Windows 7. In the subsequence evaluations, we used the DBLP dataset17 to demonstrate the possibility of supporting larger dataset.

Data storing evaluation time

In this evaluation, insertion time was recorded four times. We discarded the first reading to omit the buffering effect for consistency of execution time. The results recorded are the average time of the three consecutive times. Table 5 shows the insertion time of ORD-GAP, ORDPath14 and ME labeling.16 ORD-GAP is the fastest followed by ME Labeling and ORDPath.

Table 5. XML data insertion on DBLP dataset.

Insertion time (ms)
DatasetORD-GAPORDPathME labeling
SigmodRecord1,926,9476,111,8162,491,407

Storage space evaluation

Database storage consumption was evaluated to determine the storage space using ORD-GAP, ORDPath and ME Labeling approaches. From our experimental observation (see Table 6), we observed that ME Labeling requires higher storage space volume as compared to ORD-GAP and ORDPath due to the larger labeling size required as the depth of the XML tree increases.

Table 6. Database sizes of various approaches on DBLP.xml.

ApproachTableRowTotal rowDatabase size (KB)Total database size (MB)
ORD-GAPiTable33321306337978401736749
tTable3005848366088
ME LabelingMeParenttable33321306337978392176797
MeChildtable3005848424912
ORDPathParentTablereed33321306337978328264651
ChildTablereed3005848338448

As depicted, ORD-GAP reserved a gap between nodes, which delaying the initial node labelling, as ORD-GAP requires some calculation on retrieving the initial nodes. While ORDPath uses dot separated component byte-by-byte, that assigning node label is taken from the parent’s nodes toward the depth of XML tree. Whereas ME Labeling uses multiplication that causes the increases of size labels. The multiplication requires more time on the computation as the size label increase. Thus, both ORDPath and ME Labeling take less time for node labeling.

Query retrieval evaluation

Table 7 displays the query node in tree representation and XPath notation for each query.

Table 7. XPath Notation of DBLP dataset.

QueryQuery NodeXPath Notation
PQ1:16f6d62d-c571-42fa-8b18-a0f06d36ae43_figure6.gif/dblp/mastersthesis/author
PQ2:16f6d62d-c571-42fa-8b18-a0f06d36ae43_figure7.gif//dblp//title
PQ3:16f6d62d-c571-42fa-8b18-a0f06d36ae43_figure8.gif//phdthesis/title
TQ4:16f6d62d-c571-42fa-8b18-a0f06d36ae43_figure9.gif/dblp[/article/www]/title
TQ5:16f6d62d-c571-42fa-8b18-a0f06d36ae43_figure10.gif//dblp[//title]//editor
TQ6:16f6d62d-c571-42fa-8b18-a0f06d36ae43_figure11.gif/dblp[/www]//title

Figure 5 shows the query execution performance on various approaches. ORD-GAP is leading, followed by ME labeling and ORDPath. ORDPath require more time as compared to ORD-GAP and ME Labeling due to the number of elements in a node in DBLP. Although DBLP tree contains only three levels, it has multiple siblings in a node. Thus, the data model grows horizontally. ORDPath is prefix-based labeling that traverses using breadth-first search traversal. Likewise, ORDPath did not perform well. As the sibling’s node increases, the size label is increased. Hence, it requires more time to retrieve data in the database.

16f6d62d-c571-42fa-8b18-a0f06d36ae43_figure5.gif

Figure 5. Query retrieval time on DBLP dataset.

Conclusion

In this paper, we propose a labeling scheme named ORD-GAP that enables dynamic insertion by adopting ORDPath techniques. ORDPath generates unrestricted insertion of large XML trees. We carried out evaluations to compare ORD-GAP with ORDPath and ME Labeling. The performance of ORD-GAP was evaluated based on the database size, insertion, query retrieval and dynamic updates. We showed that ORD-GAP has a better performance than ORDPath and ME Labeling. However, we were not able to test ORD-GAP on a dataset size beyond 1.2GB due to hardware limitations such as hardware processor and available RAM.

In our future work, we will look into XML compression and optimization to ensure the further reduce the label size.

Data availability

Comments on this article Comments (0)

Version 1
VERSION 1 PUBLISHED 09 Sep 2021
Comment
Author details Author details
Competing interests
Grant information
Copyright
Download
 
Export To
metrics
Views Downloads
F1000Research - -
PubMed Central
Data from PMC are received and updated monthly.
- -
Citations
CITE
how to cite this article
Haw SC, Amin A, Wong CO and Subramaniam S. Improving the support for XML dynamic updates using a hybridization labeling scheme (ORD-GAP) [version 1; peer review: 2 approved]. F1000Research 2021, 10:907 (https://doi.org/10.12688/f1000research.69108.1)
NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.
track
receive updates on this article
Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status: ?
Key to Reviewer Statuses VIEW
ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions
Version 1
VERSION 1
PUBLISHED 09 Sep 2021
Views
10
Cite
Reviewer Report 16 Nov 2021
Amjad Qtaish, College of Computer Science and Engineering, University of Ha’il, Ha’il, Saudi Arabia 
Approved
VIEWS 10
This paper proposed a new labeling scheme for solving dynamic XML updates. Three cases of updating (insertions) are used, which are: leftmost, rightmost, and between siblings. I prefer to add another case which is the insertion of the leaf node. ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Qtaish A. Reviewer Report For: Improving the support for XML dynamic updates using a hybridization labeling scheme (ORD-GAP) [version 1; peer review: 2 approved]. F1000Research 2021, 10:907 (https://doi.org/10.5256/f1000research.72709.r93981)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
Views
16
Cite
Reviewer Report 21 Sep 2021
Jiaheng Lu, Department of Computer Science, University of Helsinki, Helsinki, Finland 
Approved
VIEWS 16
This paper studied the problem of XML update by proposing a new dynamic labeling scheme called ORD-GAP. The methods look effective and the authors perform experiments to verify the update operation and query processing for two datasets: SIGMOD record and ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Lu J. Reviewer Report For: Improving the support for XML dynamic updates using a hybridization labeling scheme (ORD-GAP) [version 1; peer review: 2 approved]. F1000Research 2021, 10:907 (https://doi.org/10.5256/f1000research.72709.r93978)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.

Comments on this article Comments (0)

Version 1
VERSION 1 PUBLISHED 09 Sep 2021
Comment
Alongside their report, reviewers assign a status to the article:
Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions
Sign In
If you've forgotten your password, please enter your email address below and we'll send you instructions on how to reset your password.

The email address should be the one you originally registered with F1000.

Email address not valid, please try again

You registered with F1000 via Google, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Google account password, please click here.

You registered with F1000 via Facebook, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Facebook account password, please click here.

Code not correct, please try again
Email us for further assistance.
Server error, please try again.