ALL Metrics
-
Views
-
Downloads
Get PDF
Get XML
Cite
Export
Track
Software Tool Article
Revised

FDTool: a Python application to mine for functional dependencies and candidate keys in tabular data

[version 2; peer review: 2 approved]
PUBLISHED 19 Jun 2019
Author details Author details
OPEN PEER REVIEW
REVIEWER STATUS

This article is included in the Python collection.

Abstract

Functional dependencies (FDs) and candidate keys are essential for table decomposition, database normalization, and data cleansing. In this paper, we present FDTool, a command line Python application to discover minimal FDs in tabular datasets and infer equivalent attribute sets and candidate keys from them. The runtime and memory costs associated with seven published FD discovery algorithms are given with an overview of their theoretical foundations. Previous research establishes that FD_Mine is the most efficient FD discovery algorithm when applied to datasets with many rows (> 100,000 rows) and few columns (< 14 columns). This puts it in a special position to rule mine clinical and demographic datasets, which often consist of long and narrow sets of participant records. The structure of FD_Mine is described and supplemented with a formal proof of the equivalence pruning method used. FDTool is a re-implementation of FD_Mine with additional features added to improve performance and automate typical processes in database architecture. The experimental results of applying FDTool to 13 datasets of different dimensions are summarized in terms of the number of FDs checked, the number of FDs found, and the time it takes for the code to terminate. We find that the number of attributes in a dataset has a much greater effect on the runtime and memory costs of FDTool than does row count. The last section explains in detail how the FDTool application can be accessed, executed, and further developed.

Keywords

Functional dependencies, Data mining, Electronic health records, Relational database, FDTool, Rule discovery

Revised Amendments from Version 1

In response to the reviewers’ comments, in this revision we corrected grammatical and typographical errors.

In the abstract, we clarify that the experimental comparison of several functional dependency algorithms is referenced from previous research.

In the sentence following Definition 6, we substituted the phrase “determines equivalent attributes sets with” with “determines equivalent attribute sets using”, since the FDTool code uses the functional dependencies discovered at each level to generate equivalent attribute sets.

We uploaded publicly available data with the same shape and structure as the 13 CARES datasets. We base all simulation studies on the publicly available data, which can be found in FDTool/data/input/CARES/ as part of the FDTool repository and archived our Zenodo project.

To read any peer review reports and author responses for this article, follow the "read" links in the Open Peer Review table.

Introduction

Functional dependencies (FDs) are key to understanding how attributes in a database schema relate to one another. An FD defines a rule constraint between two sets of attributes in a relation1 r(U), where U = {v1,v2,…,vm} is a finite set of attributes (Yao et al., 2002). A combination of attributes over a dataset is called a candidate (Yao et al., 2002). An FD XY asserts that the values of candidate X uniquely determine those of candidate Y (Yao et al., 2002). For example, the social security number (SSN) attribute in a dataset of public records functionally determines the first name attribute. Because the FD holds, we write {SSN} → {first_name}.

Definition 1. A functional dependency XY, where X,YU, is satisfied by r(U), if for all pairs of tuples ti, tj r(U), we have that ti [X] = tj [X] implies ti [Y] = tj [Y] (Yao & Hamilton, 2008).

In this case, X is the left-hand side of an FD, and Y is the right-hand side (Yao et al., 2002). If Y is not functionally dependent on any proper subset of X, then XY is minimal (Yao et al., 2002). Minimal FDs are our only concern in rule mining FDs, since all other FDs are logically implied. For instance, if we know {SSN} → {first_name}, then we can infer that {SSN, last_name} → {first_name}.

Power set lattice

The search space for FDs can be represented as a power set lattice of nonempty attribute combinations. Figure 1 gives the nonempty attribute combinations of a relation r(U) such that U = {A,B,C,D}. There are 2n – 1 = 24 – 1 = 15 attribute subsets in the power set lattice (Yao & Hamilton, 2008). Each combination X of the attributes in U can be the left-hand side of an FD XY such that XY is satisfied by relation r(U) (Yao & Hamilton, 2008). Since the attribute set itself U trivially determines each one of its proper subsets, it can be ignored as a candidate. There remain 2n – 2 = 24 – 2 = 14 nonempty subsets of U that are to be considered candidates.

1c0b8850-e38b-4aaf-9994-43df524e4fe5_figure1.gif

Figure 1. Nonempty combinations of attributes A, B, C, and D by k-level.

There are n · 2n–1n = 4 · 24–1 – 4 = 28 edges (or arrows) in the semi-lattice of the complete search space for FDs in relation r(U) (Yao & Hamilton, 2008). The size of the search space for FDs is exponentially related to the number of attributes in U. Hence, the search space for FDs increases quite significantly when there is a greater number of attributes in U. For instance, when there are 12 attributes in a relation, the search space for FDs climbs to 24,564. This gives reason to be cautious of runtime and memory costs when deploying a rule mining algorithm to discover FDs.

Partition

The algorithms used to discover FDs differ in their approach to navigating the complete search space of a relation. Their candidate pruning methods vary and sometimes the methods used to validate FDs do as well. These differences affect runtime and memory behavior when used to process tables of different dimensions.

A common data structure used to validate FDs is the partition. A partition places tuples that have the same values on an attribute into the same group (Yao et al., 2002).

Definition 2. Let XU and let t1,…,tn be all the tuples in a relation r(U). The partition over X, denoted ∏X, is a set of the groups such that ti and tj, 1 ≤ i, jn, ij, are in the same group if and only if ti [X] = tj [X] (Yao et al., 2002).

It follows from Definition 2 that the cardinality of the partition card(∏A(r)) is the number of groups in partition ∏A (Yao & Hamilton, 2008). The cardinality of the partition offers a quick approach to validating FDs in a dataset.

Theorem 1. An FD XY is satisfied by a relation r(U) if and only if card(∏X) = card(∏XY) (Huhtala et al., 1999).

Theorem 1 provides an efficient method to check whether an FD XY holds in a relation2. Huhtala et al. (1999) proved it to support a fast validation method for relations consisting of a large number of tuples.

Closure

Efforts in relational database theory have led to more runtime and memory efficient methods to check the complete search space of a relation for FDs. In place of needing each arrow in a semi-lattice checked, we can infer the FDs that logically follow from those already discovered. Such FDs are to be discovered as a consequence of Armstrong’s Axioms (Maier, 1983) and the inference axioms derivable from them (Ramakrishnan & Gehrke, 2000), which are

  • Reflexivity: YX implies XY;

  • Augmentation: XY implies XZYZ;

  • Transitivity: XY and YZ imply XZ;

  • Union: XY and XZ imply XYZ;

  • Decomposition: XYZ implies that XY and XZ.

These axioms signal the distinction between FDs that can be inferred from already discovered FDs and those that cannot (Maier, 1983). Exploiting what can be derived from Armstrong’s Axioms allows us to avoid having to check many of the candidates in a search space.

Definition 3. Let F be a set of functional dependencies over a dataset D and X be a candidate over D. The closure of candidate X with respect to F, denoted X+, is defined as {Y | XY can be deduced from F by Armstrong’s Axioms} (Yao & Hamilton, 2008).

The nontrivial closure3 of candidate X with respect to F is defined as X* = X+ \ X and written X* (Yao & Hamilton, 2008). Definition 3 gives room to elegantly define keys. Informally, a key implies that a relation does not have two distinct tuples with the same values on those attributes. Keys uniquely identify all tuple records in a dataset.

Definition 4. Let R be a relational schema and X be a candidate of R over a dataset D. If XX* = R, then X is a key (Yao et al., 2002).

A candidate key X of a relation is a minimal key for that relation. This means that there is no proper subset of X for which Definition 4 holds.

Rule mining algorithms

Existing functional dependency algorithms are split between three categories: Difference- and agree-set algorithms (e.g., Dep-Miner, FastFDs), Dependency induction algorithms (e.g., FDEP), and Lattice traversal algorithms (e.g., TANE, FUN, FD_Mine, DFD) (Papenbrock et al., 2015).

Difference- and agree-set algorithms model the search space of a relation as the cross product of all tuple records (Papenbrock et al., 2015). They search for sets of attributes agreeing on the values of certain tuple pairs. Attribute sets only functionally determine other attribute sets whose tuple pairs agree, i.e., agree-sets (Asghar & Ghenai, 2015; Papenbrock et al., 2015). Then, agree-sets are used to derive all minimal FDs.

Dependency induction algorithms assume a base set of FDs in which each attribute functionally determines each other attribute (Papenbrock et al., 2015). While iterating through row data, observations are made that require certain FDs to be removed from the base set and others added to it. These observations are made by comparing tuple pairs based on the equality of their projections. After each record in a dataset is compared, the FDs left in the base set are considered valid, minimal and complete (Papenbrock et al., 2015).

Lattice traversal algorithms model the search space of a relation as a power set lattice. Most of such algorithms, (i.e., TANE, FUN, FD_Mine) use a level-wise approach to traversing the search space of a relation from the bottom-up (Papenbrock et al., 2015). They start by checking4 for FDs that are singleton sets on the left-hand side and iteratively transition to candidates of greater cardinality.

Performance

Papenbrock et al. (2015) released an experimental comparison of the aforementioned FD discovery algorithms. The seven algorithms were re-implemented in Java based on their original publications and applied to 17 datasets of various dimensions. They found that none of the algorithms are suited to yield the complete result set of FDs from a dataset consisting of 100 columns and 1 million rows (Papenbrock et al., 2015). Hence, it is a matter of discretion to choose the algorithm best fitting the dimensions of a dataset.

The experimental results show that lattice traversal algorithms are the least memory efficient, since each k-level5 can be a factor greater than the size of the previous level (Papenbrock et al., 2015). Difference- and agree-set algorithms and dependency induction algorithms perform favorably in memory experiments as a result of their operating directly on data and efficiently storing result sets. Lattice traversal algorithms scale poorly on tables with many columns (≥ 14 columns) due to memory limits (Papenbrock et al., 2015).

Lattice traversal algorithms are the most effective on datasets with many rows, because their validation method6 operates on attribute sets as opposed to data (Papenbrock et al., 2015). This puts such algorithms in a special position to rule mine clinical and demographic record datasets, which often consist of long and narrow sets of participant records. Difference- and agree-set algorithms and dependency induction algorithms commonly reach time limits when applied to datasets of these dimensions (> 100,000 rows) (Papenbrock et al., 2015).

Lattice traversal algorithms

Lattice traversal algorithms iterate through k-levels represented in a power set lattice. If the lattice is traversed from the bottom-up, we say the algorithm is level-wise.

Definition 5. Let X1, X2,…, Xk, Xk+1 be (k + 1) attributes over a database D. If X1X2XkXk+1 is an FD with k attributes on its left hand side, then it is called a k-level FD (Yao et al., 2002).

The search space for FDs is reduced at the end of each iteration using pruning rules. Pruning rules check the validity of candidates not yet checked with FDs already discovered and those inferred from Armstrong’s Axioms (Yao & Hamilton, 2008). After a search space is pruned, an Apriori_Gen principle generates k-level candidates with the (k – 1)-level candidates that were not pruned (Yao & Hamilton, 2008).

Apriori_Gen:

  • oneUp: generates all possible candidates in Ck from those in Ck–1.

  • oneDown: generates all possible candidates in Ck–1 from those in Ck.

Level-wise lattice traversal algorithms stop iterating after all candidates in a search space are pruned. In this case, Apriori_Gen generates the null set ∅ raising a flag for the algorithm to terminate. This has the effect of shortening runtime to the degree that FDs are discovered and others are inferred.

Tane

The level-wise lattice traversal algorithms TANE, FUN, and FD_Mine differ in terms of pruning rules. FUN and FD_Mine expand on the pruning rules of TANE. Released by Huhtala et al. (1999), TANE prunes a search space on the basis that only minimal and non-trivial7 FDs need be checked. TANE restricts the right-hand side candidates C+ for each attribute combination X to the set

C+(X)={AR|BX:X\{A,B}Bdoesnothold},

which contains all the attributes that the set X may still functionally determine (Papenbrock et al., 2015). The set C+ is used in the following pruning rules (Papenbrock et al., 2015).

•   Minimality pruning: If an FD X \ AA holds, A and all BC+ (X) \ X can be removed from C+ (X).

•   Right-hand side pruning: If C+ (X) = ∅, the attribute combination X can be pruned from the lattice, as there are no more right-hand side candidates for a minimal FD.

•   Key pruning: If the attribute combination X is a key, it can be pruned from the lattice.

Key pruning implies that all supersets of a key, i.e., super keys, can be removed, since they are by definition non-minimal (Huhtala et al., 1999).

FD_Mine

Like TANE and FUN, FD_Mine is structured around the level-wise lattice traversal approach and the aforementioned pruning rules. Unlike the other two algorithms, FD_Mine, authored by Yao et al. (2002), uses the concept of equivalence as means to more exhaustively prune the search space of a candidate (Papenbrock et al., 2015). Informally, attribute sets are equivalent if and only if they are functionally dependent on each other (Papenbrock et al., 2015).

The proofs demonstrating that no useful information is lost in pruning candidates from equivalent attribute sets are reproduced in this section and were originally developed by Yao & Hamilton (2008). The equivalence pruning method can be derived directly from Armstrong’s Axioms.

Definition 6. Let X and Y be candidates over a dataset D. If XY and YX hold, then we say that X and Y are an equivalence and denote it as XY.

After a k-level is fully validated, i.e., each k-level candidate is checked, FD_Mine determines equivalent attribute sets using the FDs already discovered.

Theorem 2. Let X, YU. If YX+ and XY+, then XY (Yao & Hamilton, 2008).

Proof. Since XX+ and YX+, Decomposition implies that XY. By a similar argument, YX holds. Because XY and YX, we have by definition that XY holds.

Lemma 3 and Lemma 4 are derived from Armstrong’s Axioms with the assumption of the equivalence XY.

Lemma 3. Let W,X,Y,Y',ZU and YY'. If XY and XWZ, then Y'WZ (Yao & Hamilton, 2008).

Proof. Suppose that XY and XWZ. This implies that XY. By Augmentation, YWXW. By Transitivity, YWXW and XWZ give that YWZ. By Augmentation, Y' \ Y can be added to both sides of YWZ to give that YW(Y' \ Y) → Z(Y' \ Y). By YY', we know that Y'WZ(Y' \ Y). Then, by Decomposition, Y'WZ.

Lemma 4. Let W,X,Y,ZU. If XY and WZX, then WZY (Yao & Hamilton, 2008).

Proof. By XY, we know that XY. By Transitivity, WZX and XY imply WZY.

Theorem 2 checks attribute sets X and Y for the equivalence XY. FD_Mine assumes that the attribute set Y is generated before X. By Lemma 3 and Lemma 4, we know that for equivalence XY, no further attribute sets Z such that YZ need be checked (Yao & Hamilton, 2008). Hence, Y is deleted as a result of the following pruning rule.

•   Equivalence pruning: If XY is satisfied by relation r(U), then candidate Y can be deleted. (Yao & Hamilton, 2008).

Exploiting the equivalence pruning method leaves FD_Mine in a more aggressive position to prune candidates than TANE. This offers an advantage in terms of runtime and memory behavior (Yao et al., 2002).

Non-minimal FDs

The pseudo-code proposed in the second version of FD_Mine (Yao & Hamilton, 2008) will under certain circumstances output non-minimal FDs (Papenbrock et al., 2015). FD_Mine references an Apriori_Gen method (Agrawal et al., 1996) stating that for each pair of candidates p, qCk–1 the set pq is to be placed in Ck if card(pq) = k. Example 1 shows that the Apriori_Gen method referenced and utilized by FD_Mine can violate minimality pruning by checking supersets that need not be checked. Figure 2 gives the power set lattice of the relation described in Example 1 pruned by FD_Mine.

1c0b8850-e38b-4aaf-9994-43df524e4fe5_figure2.gif

Figure 2. A pruned power set lattice.

FD_Mine deletes the candidates ABC, ABE, and ABD (red ovals) as a result of finding the candidate key AB (blue hexagon). It generates supersets of AB (yellow rectangles) at the next level.

Example 1. Let r(U) be a relation such that U = {A,B,C,D,E}. Suppose that AB is a key and that there are no other FDs in r(U). Since AB is a key, we know by definition that ABAB* = U. Provided this and that there are no other FDs in r(U), the candidates ABC, ABD and ABE are deleted from C3, and so C3 = Prune(Apriori_Gen(C2)) = {ACE, BCE, ACD, BCD, ADE, CDE, BDE}8. Then, C4 = {ABCD, ABCE, ACDE, ABDE, BCDE}. Because it must be that AB* = {C, D, E}, the algorithm validates the FDs ABCDE, ABCED, and ABDEC. Since E, for example, is functionally dependent on the proper subset ABABCD, ABCDE is non-minimal.

The Apriori_Gen principle presented in TANE (Huhtala et al., 1999) more effectively generates candidate level Ck+1 from Ck. It requires that Ck+1 only contains the attribute sets of size k + 1 which have all their subsets of size k in Ck (Huhtala et al., 1999); i.e.,

Ck+1={X|card(X)=k+1andforallϒwithϒXandcard(ϒ)=kwehaveϒCk}

In reference to Example 1, this method does not insert the candidate ABCD in C4, without loss of generality, because ABCABCD but ABCC3. Thus, the non-minimal FD ABCDE is not checked.

Prune(Ck, E, Closure)911

01for eachSCk:02for eachXoneDown[Ck]:03if(XS)then:04if(X{Z|ϒZE})then:#Pruning rule 105delete Sfrom Ck06ifSX+then:#Pruning rule 207delete Sfrom Ck08S+=S+X*#Pruning rule 309ifU==S+then:#Pruning rule 410delete Sfrom Ck11returnCk,Closure;

FD_Mine will under the circumstance described in Example 1 set closure values incorrectly. In line 2, FD_Mine iterates through Ck–1, as opposed to oneDown [Ck], which can cause the Prune() function to ignore setting the closure values of certain candidates. In Example 1, FD_Mine does not accurately set the closure ABCD* to E, since E is not saved to the closure values of the candidates ACD, BCDABCD at the previous level. Iterating through oneDown [Ck] sets the closure of a candidate to the union of the closure values of its proper subsets, so that the closure values of deleted candidates are not lost among their supersets.

Properly assigned closure values can allow the algorithm to avoid checking many non-minimal FDs. This is because the ObtainFDs module, i.e., the validation method, only checks12 the right-hand side attributes vi for which viU \ X+ (Yao & Hamilton, 2008). Hence, provided that Pruning rule 3 asserts the equality ABCD* = E, ABCDE need not be checked.

Operation

FDTool (Buranosky, 2018) is a command line Python application executed with the following statement: $ fdtool /path/to/file13. For Windows users, this is to be run from the working directory of fdtool.exe, which will likely be C:\Python27\Scripts for those installing with pip install fdtool. For other systems, installation automatically inserts the file path to the fdtool command in the PATH variable. /path/to/file is the absolute or relative path to a .txt, .csv, or .pkl file containing a tabular dataset. If the data file has the extension .txt or .csv, FDTool detects the following separators: comma (‘,’), bar (‘|’), semicolon (‘;’), colon (‘:’), and tilde (‘∼’). The data is read in as a Pandas data frame14.

Dependencies:

FDTool provides the user with the minimal FDs, equivalent attribute sets and candidate keys mined from a dataset. This is given with the time (s) it takes for the code to terminate (after reading in data), the row count and attribute count of the data, the number of FDs and equivalent attribute sets found, and the number of FDs checked. This is printed on the terminal after the code is executed as shown in Figure 3. The information is saved to a .FD_Info.txt file.

1c0b8850-e38b-4aaf-9994-43df524e4fe5_figure3.gif

Figure 3. Printed output of FDTool.exe.

Figure 3 shows the printed output of FDTool.exe applied to the contents of Table 1. The output file Table1. FD_Info.txt is saved to the user’s current working directory.

Table 1. Example dataset.

ABCDE
00020
01020
02022
03120
41114
43112
00120

Implementation

FDTool is a Python based re-implementation of the FD_Mine algorithm with additional features added to automate typical processes in database architecture. FD_Mine was published in two papers with more detail given to the scientific concepts used in algorithms of its kind (Yao et al., 2002; Yao & Hamilton, 2008). The two versions of FD_Mine were released with different structures but make use of the same theoretical foundation (Papenbrock et al., 2015), which is fully supported in mathematical proofs of the pruning rules used (Yao & Hamilton, 2008). FDTool was coded15 with special attention given to the pseudo-code presented in the second version of FD_Mine (Yao & Hamilton, 2008).

The Python script dbschema.py in FDTool/fdtool/modules/dbschema is taken from dbschemacmd (https://www.elstel.org/database/dbschemacmd.html.en): a tool for database schema normalization working on functional dependencies (Elmasri & Navathe, 2011). It is used to take sets of FDs and infer candidate keys from them. The operation first assigns the left-hand side attribute combinations of a set of FDs to dictionary keys and their closures to the corresponding values. It then reduces the set of FDs to a minimum coverage16. Candidate keys are assembled using the minimum coverage and closure structure by adding attributes to key candidates until each minimal attribute set X for which X+ = U is found. Details on the dbschema operations are described in FDTool/fdtool/modules/dbschema/Docs.

Use cases

FDTool was initially created to help decompose datasets of medical records as part of Clinical Archived Records research for Environmental Studies (CARES). CARES currently contains 13 datasets obtained from the medical software firms Epic and Legacy. The attribute count in this database ranges from 4 to 18; the row count ranges from 42,369 to 8,201,636.

Experimental results

To limit the strain on computational resources, FDTool has a built in time limit of 4 hours. FDTool reaches this preset limit (triggering program termination) when applied to the PatientDemographics dataset (42,369 rows × 18 columns) and the EpicVitals_TobaccoAlcOnly dataset (896,962 rows × 18 columns). The remaining 11 CARES datasets are given in Table 217.

Table 2. Experimental results of FDTool on 11 CARES datasets, which terminate in less than 4 hours (preset limit).

Dataset NameAttribute CountRow CountNo. of FDs checkedNo. of FDs
found
No. of
Equivalences
Time (s)
AllDxs78,201,68611271577.9
AllLabs101,294,106818434999.8
AllVisits92,019,117346447804.1
EpicMeds101,281,731453260551.9
EpicVitals201571,246,3031272086.5
EpicVitals20167988,3271272063.0
FamHx493,72515000.7
LegacyIPMeds8647,1227914128.2
LegacyOPMeds7740,616331817.7
LegacyPayors41,465,23315419.4
LegacyVitals81,453,92714670134.4

Experimental summary

The results from Table 2 show that runtime is primarily determined by the number of attributes in a dataset. For instance, the LegacyPayors dataset (1,465,233 rows × 4 columns) has slightly more rows (13% increase) but far fewer attributes (60% decrease) as compared to the AllLabs dataset (1,294,106 rows × 10 columns). The runtime of LegacyPayers (9.4 s.) is much less than that of AllLabs (999.8 s.), because AllLabs has many more arrows in its powerset lattice,

n .2n1n=10.210110=5110,

than does LegacyPayers (28). Hence, FDTool has more FDs to check when applied to AllLabs. It is clear that the attribute count of a dataset has a much greater effect on the runtime of FDTool than does row count.

Many of the arrows in the powerset lattice of a candidate are pruned by FDTool. AllLabs has 5,110 arrows in its powerset lattice. However, FDTool only checks 818 FDs, as there are many inferred from the 43 FDs found. This follows from the Prune() function, which deletes many of the candidates to check partially as a result of mining 4 equivalent attribute sets. FDTool terminates after 5 k-levels when applied to AllLabs.

Future development

We want to improve its performance so that FDTool is better equipped to handle datasets of different dimensions. Using the dependency induction algorithm FDEP, the reach of FDTool could be extended to datasets with fewer rows and more than 100 columns (Papenbrock et al., 2015). This might also require upgrading the source code with multicore processing methods, such as a Java API, to reduce runtime and avoid reaching memory limits. A formal proof of the dbschema operations is also desired.

Another goal is to increase the functionality provided by FDTool. This would mean implementing the pen and paper methods typically used to normalize relational schema and decompose tables. Our intent is to incorporate these changes in newer versions of FDTool, released at regular periods, so as to develop it as Python software that could automate much of what is done in the database design process.

Data availability

Zenodo: USEPA/FDTool: FDTool. https://doi.org/10.5281/zenodo.14428423.

While the authors fully support the open dissemination of data for verification and replication purposes, CARES data cannot be released as it contains Protected Health Information. For the purpose of testing the runtime and memory behavior of FDTool, we have produced simulated copies of all 13 datasets in the CARES collection. These datasets are publically available in FDTool/data/input/CARES as part of the FDTool repository and archived in the above Zenodo project.

Data are available under the terms of the Creative Commons Attribution 4.0 International license (CC-BY 4.0).

Software availability

FDTool is available from the Python Package Index: https://pypi.org/project/fdtool/

Latest source code: https://github.com/USEPA/FDTool.git

Source code at time of publication: https://doi.org/10.5281/zenodo.14428423

License: CC0 1.0 Universal. Module FDTool/fdtool/modules/dbschema released under a modified C-FSL license.

Comments on this article Comments (0)

Version 2
VERSION 2 PUBLISHED 19 Oct 2018
Comment
Author details Author details
Competing interests
Grant information
Copyright
Download
 
Export To
metrics
Views Downloads
F1000Research - -
PubMed Central
Data from PMC are received and updated monthly.
- -
Citations
CITE
how to cite this article
Buranosky M, Stellnberger E, Pfaff E et al. FDTool: a Python application to mine for functional dependencies and candidate keys in tabular data [version 2; peer review: 2 approved]. F1000Research 2019, 7:1667 (https://doi.org/10.12688/f1000research.16483.2)
NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.
track
receive updates on this article
Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status: ?
Key to Reviewer Statuses VIEW
ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions
Version 1
VERSION 1
PUBLISHED 19 Oct 2018
Views
19
Cite
Reviewer Report 29 Apr 2019
Sayan Mukherjee, Department of Statistical Science, Duke University, Durham, NC, USA 
Approved
VIEWS 19
This paper outlines in detail FDTool: a Python application to mine for functional dependencies. The paper is about developing a database tool that improves FD Mine a previous tool for functional dependencies

The definitions, problem statement and explanations ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Mukherjee S. Reviewer Report For: FDTool: a Python application to mine for functional dependencies and candidate keys in tabular data [version 2; peer review: 2 approved]. F1000Research 2019, 7:1667 (https://doi.org/10.5256/f1000research.18017.r46574)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
Views
33
Cite
Reviewer Report 10 Dec 2018
Howard J. Hamilton, Department of Computer Science, University of Regina, Regina, SK, Canada 
Shubhashis Shil, University of Regina, Regina, SK, Canada 
Approved
VIEWS 33
The article is well-structured and well-written. It contains nice presentations of appropriate definitions, theorems, algorithms, examples, and use-cases.
 
The article has clear contributions:

In the theory part: It enhances the FD_Mine algorithm by improving ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Hamilton HJ and Shil S. Reviewer Report For: FDTool: a Python application to mine for functional dependencies and candidate keys in tabular data [version 2; peer review: 2 approved]. F1000Research 2019, 7:1667 (https://doi.org/10.5256/f1000research.18017.r39685)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.

Comments on this article Comments (0)

Version 2
VERSION 2 PUBLISHED 19 Oct 2018
Comment
Alongside their report, reviewers assign a status to the article:
Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions
Sign In
If you've forgotten your password, please enter your email address below and we'll send you instructions on how to reset your password.

The email address should be the one you originally registered with F1000.

Email address not valid, please try again

You registered with F1000 via Google, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Google account password, please click here.

You registered with F1000 via Facebook, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Facebook account password, please click here.

Code not correct, please try again
Email us for further assistance.
Server error, please try again.