Revised
Amendments from Version 1
With our revision we sharpened the focus of our paper. Our main focus is neither on the specific results of the presented use case, nor on the metrics we provide. We are writing an opinion paper (the article has now been reclassified as such), and both the use case and the metrics are illustrations of our opinion. Here we want to make a strong case for the simplicity of data and workflow components. Although it is not our intention to use the case study as proof, our paper is accompanied by many statistical analyses and plots. This may lead the reader to believe that we want to present a research article. However, we think that our plots are very useful for other data managers and scientists in illustrating why it is worthwhile to invest energy into simplifying datasets. This is especially the case for files from the long tail of big data, which are handcrafted and relatively small data sets resulting from fieldwork and not from automated sensors. To be able to illustrate the problem of merging these files - which is our day to day work as hybrids of data managers and researchers - we chose our case study, as it is representative of our work and the work of our fellow data managers we spoke to. We also think that it is highly useful to illustrate our difficulties in data re-use. Reworking our text in response to the reviewer’s questions we scaled down the method descriptions and put a stronger focus on the opinion parts of the paper. For that we reworked the text in many passages and also added a new section to the discussion better discussion the complexity measures.
With our revision we sharpened the focus of our paper. Our main focus is neither on the specific results of the presented use case, nor on the metrics we provide. We are writing an opinion paper (the article has now been reclassified as such), and both the use case and the metrics are illustrations of our opinion. Here we want to make a strong case for the simplicity of data and workflow components. Although it is not our intention to use the case study as proof, our paper is accompanied by many statistical analyses and plots. This may lead the reader to believe that we want to present a research article. However, we think that our plots are very useful for other data managers and scientists in illustrating why it is worthwhile to invest energy into simplifying datasets. This is especially the case for files from the long tail of big data, which are handcrafted and relatively small data sets resulting from fieldwork and not from automated sensors. To be able to illustrate the problem of merging these files - which is our day to day work as hybrids of data managers and researchers - we chose our case study, as it is representative of our work and the work of our fellow data managers we spoke to. We also think that it is highly useful to illustrate our difficulties in data re-use. Reworking our text in response to the reviewer’s questions we scaled down the method descriptions and put a stronger focus on the opinion parts of the paper. For that we reworked the text in many passages and also added a new section to the discussion better discussion the complexity measures.
See the authors' detailed response to the review by Paolo Missier
Introduction
Interdisciplinary approaches, new tools and technologies as well as the increasing availability of online accessible data have changed the way researchers pose questions and perform analyses1. Workflow software like Kepler or Pegasus enable access to various sources of scientific data, allow for a visual documentation of scientific analyses and potentially help in breaking down complex analyses into smaller components2,3. However with an increase in the complexity of analyses and the datasets involved, the complexity of the workflows may easily grow up to a degree that makes them hard to understand and reuse. This is particularly true for data in ecology, which consist of small and highly heterogeneous data files that don’t result from automated loggers but from scientific experiments, observations, or interviews. Current literature deals with different tools to create and manipulate workflows4–6, the data provenance2 and how semantics can be integrated into workflows7. Here we argue that there is a need for quality measures of workflow components, which include scripts, as well as for the underlying data sources. Failure to reuse workflows and available research data is not only a waste of time, money and effort but also represents a threat to the basic scientific principle of reproducibility. Providing feedback mechanisms on the data and workflow component complexity has great potential to increase the readability and the reuse of workflows and its components.
In the following we 1) introduce our concepts of workflow component complexity and identity as well as data complexity. We then use 2) a workflow from the research domain of biodiversity and ecosystem functioning (BEF) to illustrate these concepts. The analysis combines small and heterogeneous datasets from different working groups in the BEF-China experiment (DFG: FOR 891) to quantify the effect of biodiversity and stand age on carbon stocks in a subtropical forest. In the third and last part of the paper we 3) discuss the opportunities for quantifying the complexity and identity of workflow components and data for developing useful features of data sharing platforms to foster scientific reproducibility. In particular, we are convinced that simplicity and a clear focus of research data and scientific workflows are the key to adequate reuse and finally to the reproducibility of science.
Complexity and identity
Here we are interested in workflows that begin with the cleaning, the aggregation, and the imputation of research data. These first steps can make up to 70% of the whole workflow8. As data managers and researchers, we want to improve the readability of such workflows whether they are scripts or graphs. Our concept of complexity thus should capture the effort and time needed to understand and reuse such workflows. Regarding the complexity of source code we found similar initiatives that provide quality measures. The Code Climate service for example provides code complexity feedback to programmers in many different programming languages (https://codeclimate.com/?v=b). Their complexity measures take the number of lines of code as well as the repetition of identical code lines into account.
Quantifying data complexity is not as straight forward as workflow component complexity. Datasets used for synthesis in research collaborations often consist of “dark” data, lacking sufficient meta-data for reuse9–11. In our experience as data managers of research collaborations, many datasets contain a complete representation of a certain study and thus allow us to answer more than just one single question. This is due to a “space efficient” use of sheets of papers and excel sheets during the field period of the study. Thus, many data columns are used for different measurements, color is used to code for study sites without explicitly naming them in a separate column, which all together constitutes a bad quality data management. Later in the process of writing up, often each specific analysis makes use of a subset of the original dataset only. Data needs to be transformed, imputed, aggregated, or merged with data from other columns to be used in an analysis10. Thus, not only the metadata but also the data columns in datasets differ in their quality and usage in a workflow.
Here we argue that data complexity can be quantified by looking at the workflow components needed to aggregate and focus the data for analysis. One of the paradigms of data-driven science is that the analysis should be accompanied by it’s data. We argue that at the same time, data should be accompanied by workflows that offer cleaning and a meaningful aggregation of the data. Data complexity could then be measured by the complexity of their workflows.
In Garijo et al.8, they identify common or recurring tasks in workflows including data and workflow oriented motifs. Identifying these motifs or identities of workflow components may allow for an improved sharing of code and workflow components. There are many initiatives that propagate the sharing and the reuse of small pieces of code, including the GitHub service Gist or the Stack Overflow question answer portals. Workflows and components are shared via online platforms like “MyExperiment”12 which has to date approximately 7500 members presenting about 2500 workflows. Providing quantitative complexity measures together with automated tagging may further increase component and data reuse and the identification of common tasks may also support the use of semantic technologies to assist in the workflow creation process7,13.
Example workflow
Biodiversity effects on subtropical carbon stocks
Our example workflow is part of an ongoing study that measures biodiversity effects on subtropical carbon stocks and flows. It represents a typical synthesis in collaborative research projects as it combines eight datasets collected by seven independent research groups collaborating within the BEF-China research platform (www.bef-china.de, DFG: FOR 891). The data is archived, harmonized, and exchanged using the BEFdata web application10. The meta-data is exported in EcologicalMarkup-Language (EML) format which is used to import the data into the Kepler Workflow system2, Figure 1). The data describes carbon pools from soil, litter, woody debris, herb layer plants, trees and shrubs that surpass 3 cm diameter at breast height. The data has been taken in the years 2008 and early 2009 on the observational plots of the research platform. The plots span a gradient from 22 to 116 years of plot age and 15 to 35 tree species14. Our example workflow cleans, imputes and merges the data and terminates in a linear model relating biomass pools to plot age and plot diversity. It shows that carbon pools increase with stand age, however, in plots with high species richness this increase was less steep (p-values for stand age: 0.0006, species richness: 0.0568 and their interaction: 0.0236).

Figure 1. The EML 2 dataset component in use for the integration of the wood density dataset in the workflow.
On the right side the opened metadata window which displays all the additional information available for the dataset.
Workflow design
We used the Kepler workflow system (version 2.4) to build our workflow. The components in Kepler fall into two categories: “actors”, which handle all kinds of data related tasks, and “directors”, which direct the execution of components in the workflow. Workflow components can perform anything from data import over the transformation of data or the execution of complex scripts containing statistical procedures2. The components in Kepler can “talk” to each other via a port system. Output ports of components hand over their data to input ports of another component for further consumption2.
The Kepler “SDF” director was used for execution as it handles sequential workflows. The data was imported using the “eml2dataset” actor as it can import data along the conventions of EML15. It reads the information available in the metadata file and uses it to automatically set up output ports reflecting the columns in the data to allow a direct consumption by other components in the workflow. For the data manipulation in Kepler we used the “RExpression” actor. It offers an interface to the R statistics environment16 and thus allows arbitrary complex R scripts to be embedded into the workflow. We aimed for uniform workflow components thus setting a limit of 5 lines of code per workflow component as a rule of thumb.
Quantifying workflow complexity
To quantify the complexity of the components we used the number of code lines (loc), the number of R commands (cc) and R packages used (pc), as well as the number of input and output ports (cp) of the components (Equation 1). We further calculated a relative component complexity as the ratio of absolute complexity to total workflow complexity given by the sum of all component complexities (Equation 2).
As each component in the workflow starts its operation only if all input port variables have arrived, the longest port connection of a component back to a data source defines its absolute position in the workflow sequence (Figure 2). We could thus explore total workflow complexity, individual component complexities, the number of components, and the number of identical tasks (see below) along the sequence of the workflow. For this we used linear models which have been compared using the Akaike Information Criteria (AIC)17 to select for the most parsimonious model.

Figure 2. This figure shows assigned component positions using the example of the herb layer dataset of the workflow.
The absolute position in a workflow is defined by the distance back to the data source. The numbers on the components display the distance count back to the data source. The assignment of positions starts with 0.
Quantifying component identity
Based on our analysis we defined 12 tasks or identities handled by the components a priori (Table 1). We then used text mining tools to characterize the components in the workflow automatically. For this we used the presence/absence of R commands and libraries as qualitative values, the number of input and output ports, the number of datasets a component is connected with, as well as the count of code lines. This allowed us to match the defined identities with the gathered characteristics of the components. We used non metric multidimensional scaling (NMDS)18 to find the two main axes of variation in the multidimensional space defined by the characteristics. We then performed a linear regression to identify which of the characteristics and which of identities could explain the variation of the two NMDS axes. Furthermore we compared the complexity of identities. For this we used a Kruskal-Wallis test and a post-hoc Wilcoxon test, since residuals where not normally distributed (Shapiro-Wilk test).
Table 1. Workflow component identities defined a priori and their relation to the data oriented motifs identified by8.
Figure 6 plots the a priori defined identities of the workflow components to the characteristics we measured from each component a posteriori. Characteristics include lines of codes or specific commands (Table 3).
Identities | Description | Motif |
---|
data source | Access (remote or local data) | Data retrieval |
type transformation | Transform the type of a variable (e.g to numeric) | Data preparation |
merge data | Match and merge data | Data organization |
data aggregation | Aggregate of data | Data organization |
create new vector | Create a vector filled with new data | Data curat./clean |
data imputation modify a vector | Impute data (e.g linear regressions on data subsets) | Data curat./clean |
modify a vector | Modify a complete vector by a factor or basic arithmetic operation | Data organization |
create new factor | Create a new factor | Data organization |
data extraction | Extract data values (e.g from comment strings) | Data curat./clean |
sort data | Sort data | Data organization |
data modeling | All kinds of model comparison related operations (ANOVA, AIC) | Data analysis |
Quantifying quality and usage of data
For our analysis we only used a subset of the data columns available in each data source. Thus we quantified the “data usage” of a data source as the ratio of data columns used for the analysis to the total number of data columns in that data source 3. In contrast, the usage of data columns in relation to the whole workflow is quantified by the total number of workflow components processing the data, before and after the critical component in the workflow that signifies where the data preparation of a column ends (Equation 3). Similarly, the quality of a data column in relation to the workflow is quantified by the number of workflow components that deal with the data preparation (Equation 4). Thus, the higher the quality value is, the lower the column’s quality in terms of effort needed to prepare it for the analysis. Using these metrics we can compare datasets based on the usage and quality of their data. We did this using Kruskal-Wallis and post-hoc Wilcoxon tests.
The beginning matters - results from our workflow meta analysis
The workflow analysing carbon pools on a gradient of biodiversity consisted of 71 components in 16 workflow positions consuming the data of 8 datasets (Table 3). The data in the workflow was manipulated via 234 lines of R code. The number of code lines per component ranged between 1 (e.g component plot_2_numeric) and 23 (component impute_missing_tree_heights) with an overall mean of 3.3 (± 3.98 SD). See Figure 2 and Figure 3 for a graphical representation of the workflow.

Figure 3. The usage and quality measure on an example dataset.
The components marked with a P represent preparation steps of a variable. Here we see three preparation steps so the quality is 4. The components marked with C and I represent direct consumption and an indirect influence. Those together build the variable usage together with all following influenced components.
Although we aimed to keep the components streamlined and simple, the absolute and relative component complexity varied markedly. The absolute complexity ranged between 4 and 41 with an overall mean of 9.25 (± 6.77 SD), (Summary: Min. 4.0, 1st Qu. 4.0, Median 8.0, Mean 9.2, 3rd Qu. 12.0, Max. 41.0). Relative component complexity ranged between 0.69% (e.g component calculate_carbon_mass_from_biomass) and 7.03% (e.g component: add_missing_height_broken_trees) with an overall mean of 1.59 (± 1.16 SD), (Summary: Min. 0.69, 1st Qu. 0.69, Median 1.37, Mean 1.59, 3rd Qu. 2.05, Max. 7.03).
Total workflow complexity decreased exponentially from the beginning to the end (Figure 4). The exponential decrease means that the decrease in complexity is steeper in the beginning of the workflow than at the end of the workflow, showing that complexity at the end of the workflow did not differ as much as at the beginning of the workflow. From the three models relating the sum of relative component complexities to workflow position, the one including position as logarithm (AIC = 64.69) was preferred over the one including a linear and a quadratic term for position (AIC = 65.67, delta AIC = 0.71), and the one including position as linear term only (AIC = 72.26, delta AIC = 7.3).

Figure 4. Relative workflow complexity along workflow positions could be best described by an exponential model including position as logarithm (R-squared=0.9, F-statistic: 29.16 on 3 and 10 DF, p-value: smaller than 0.001 ***).
This figure shows the model back transformed to the original workflow positions. The gray shading displays the standard error.
At the same time, relative complexity increased in the course of the analysis (Figure 5), since our model with an intercept and a linear term for the workflow position (AIC = 325.85) was preferred. However, we will argue later, that this increase was mainly due to a group of workflow components of extreme simplicity at the very beginning of the workflow visible in the bottom left of Figure 5. These “data type transformation” components convert textual columns into numeric columns. This was necessary as there was columns with intermixed content of categorical and numerical values. As we will outline later, we took this as an opportunity to program a feature for our BEF-Data data management portal, that splits columns mixed with text and numbers to numeric and categorical columns for the EML output.

Figure 5. Relative component complexities along the workflow of the carbon analysis.
The points are slightly jittered to handle over plotting. Although workflow component complexity slightly increases towards the end of the workflow, this increase is largely due to the many components at the beginning of the workflow with low complexity (lower left corner). R-squared: 0.09, F-statistic: 6.57 on 1 and 61 DF, p-value: 0.01285.
We could group workflow components according to their assigned identities using text mining. The non metric multidimensional scaling had a stress value of 0.17 using 2 main axes of variation. Several of the parameters, including specific commands of R code, were correlated to axes scores (Table 2). Our defined identities could be significantly separated in the parameter space (r2 0.58, p-value 0.001): the first axis spans between the tasks “data aggregation” and “modify a vector” while the second one spans between the tasks “data extraction” and “data type transformation” (Figure 6).
Table 2. Characteristics of workflow components used to assess variation or similarity between components by means of non metric multidimensional scaling (NMDS).
Characteristics include lines of code, use of packages, as well as specific commands (see text for further detail). Figure 6 plots the first two axes of the NMDS. r2, Pr(>r) and sig. describe R square, Probability, and significance level of a correlation with the characteristic as dependent and the NMDS scores of both axes (NMDS1, NMDS2) as independent variables. For example, “count of codelines” separates workflows components in the NMDS plot, so that components with more codelines are plotted in the upper left quadrat of the plot in Figure 6 Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1. P-values based on 999 permutations.
Characteristics | NMDS1 | NMDS2 | r2 | Pr(>r) | sig. |
---|
abline | -0.595118 | 0.803639 | 0.0879 | 0.024 | * |
as.numeric | 0.229379 | -0.973337 | 0.3536 | 0.001 | *** |
attach | -0.485526 | 0.874222 | 0.0535 | 0.113 | |
data.frame | -0.902072 | -0.431586 | 0.8074 | 0.001 | *** |
ddply | -0.802096 | -0.597195 | 0.3456 | 0.001 | *** |
detach | -0.485526 | 0.874222 | 0.0535 | 0.113 | |
grep | -0.211367 | 0.977407 | 0.2885 | 0.001 | *** |
ifelse | -0.663695 | 0.748004 | 0.3222 | 0.001 | *** |
is.na | -0.759568 | 0.650428 | 0.2800 | 0.001 | *** |
length | -0.601172 | 0.799120 | 0.0526 | 0.182 | |
lm | -0.199313 | 0.979936 | 0.0578 | 0.157 | |
match | 0.445922 | 0.895072 | 0.0057 | 0.849 | |
mean | -0.994320 | 0.106435 | 0.1346 | 0.016 | * |
none | 0.977510 | 0.210891 | 0.5095 | 0.001 | *** |
plot | -0.689676 | 0.724118 | 0.0478 | 0.254 | |
predict | -0.595118 | 0.803639 | 0.0879 | 0.024 | * |
sort | -0.417914 | -0.908487 | 0.0071 | 0.954 | |
strsplit | 0.016877 | 0.999858 | 0.0633 | 0.059 | . |
subset | -0.788938 | -0.614472 | 0.0212 | 0.821 | |
sum | -0.658793 | -0.752324 | 0.1278 | 0.004 | ** |
summary | 0.022488 | -0.999747 | 0.0043 | 0.969 | |
unique | -0.485526 | 0.874222 | 0.0535 | 0.113 | |
unlist | 0.016877 | 0.999858 | 0.0633 | 0.059 | . |
vector | -0.269756 | 0.962929 | 0.4583 | 0.001 | *** |
which | -0.421341 | 0.906902 | 0.4582 | 0.001 | *** |
write.csv | 0.022488 | -0.999747 | 0.0043 | 0.969 | |
count of R functions | -0.796351 | 0.604834 | 0.4893 | 0.001 | *** |
count of codelines | -0.530470 | 0.847704 | 0.5392 | 0.001 | *** |
domain count | 0.920778 | -0.390087 | 0.0142 | 0.704 | |
count packages per component | -0.781994 | -0.623286 | 0.4004 | 0.001 | *** |
Table 3. The workflow positions listed along with the unique component tasks they contain and the count of components per position.
Position | Tasks | Component count |
---|
0 | data source | 8 |
1 | data type transformation, data extraction, create new vector | 20 |
2 | merge data, data imputation, modify a vector, create new vector, data aggregation, sort data | 10 |
3 | create new factor, merge data, data aggregation | 7 |
4 | merge data, create new vector, data imputation, modify a vector | 7 |
5 | create new vector, merge data, modify a vector | 5 |
6 | merge data, data imputation, modify a vector | 3 |
7 | data imputation, create new vector, data aggregation | 3 |
8 | create new vector | 2 |
9 | merge data | 1 |
10 | modify a vector | 1 |
11 | data aggregation | 1 |
12 | modify a vector | 1 |
13 | create new vector | 1 |
14 | data modeling | 1 |

Figure 6. Workflow components (points) in reduced component characteristics space (Table 2).
We used non metric multidimensional scaling (NMDS, see text for further detail) to reduce the parameter space to two axes. Table 2 lists the regression results of the axes scores on the component characteristics, which are plotted in smaller text here. Table 1 lists the a priori tasks, which are plotted in large labels here. Points are jittered by a factor of 0.2 horizontally and vertically to handle over plotting.
Our workflow identities had similar complexity, with only one exception: the task “data type transformation” was less complex (Kruskal-Wallis chi-squared = 41.97, df = 9, p < 0.001) than the tasks “create new vector”, “data aggregation”, “data imputation” and “merge data” (Figure 7). Again, “data type transformation” was only used at the beginning of the workflow to transform columns mixing numbers and text to numbers.

Figure 7. The median, 25% and 75% quantiles of the relative component complexities for the component tasks.
Letters refer to: a=create new factor, b=create new vector, c=data aggregation, d=data extraction, e=data imputation, f=data modeling, g=data type transformation, h=merge data, i=modify a vector, j=sort data. The small dots are the relative complexities, the diamonds the means. The whiskers are 25% quantile - 1.5 * IQR and 75% quantile + 1.5 * IQR, big black circles are outliers. Signific.: * = 0.05, ** = 0.001.
Data usage in relation to data sources was higher in smaller data sources. “Wide” data sources, those consisting of many columns, contributed less to the analysis than “smaller” data sources with fewer columns. While on average 37.4% of the columns in the data sources were used, a linear regression showed that the number of columns not used increased with the total number of columns available per data source (Figure 8).

Figure 8. The linear regression shows the dependency between the total column count of the datasets and the count of unused columns.
The gray shaded area represents the standard error. Linear model with total columns as predictor for unused columns: R-squared: 0.925, F-statistic: 74.02 on 1 and 6 DF, p-value: 0.0001.
At the same time, data usage in relation to the workflow was similar for all data sources. Data column usage within the workflow ranged between a minimum of 1 and a maximum of 16 with an overall mean of 6.38 (± 4.25 SD). Although usage was different between datasets (Kruskal-Wallis, chi-squared = 18.05, df = 7, p-value = 0.012), a post hoc group wise comparison could not identify the differences (Wilcoxon test). The data column quality, the amount of processing steps needed to transform data for the analysis, was also similar for all data sources. It ranged between a minimum of 1 and a maximum of 10 with an overall mean of 3.54 (± 2.44 SD). There were no differences in the data column quality between data sources (Kruskal-Wallis, chi-squared = 10.9, df = 7, p-value = 0.14).
Discussion
Our paper aims to raise the attention, that simplifying datasets already goes a far way in reducing workflow complexity. We provide illustrations for the amount of effort needed to merge datasets at the beginning of the workflows and suggest the need of feedback mechanisms for data providers on the complexity of their datasets. We further show that specific workflow tasks can be identified using text mining, which could be used for social sharing mechanisms in workflow or scriptlet generation.
The example workflow that we use to illustrate our points shows that components dealing with the cleaning, the imputation, the aggregation and the merging of data contribute the most to the complexity of the workflow. Similarly, Garijo et al.8 found that these steps can make up to 70% of a whole scientific workflow. In our example the complexity decreases exponentially along the positions in the workflow and thus with the ongoing preparation of data towards the actual analysis (Figure 4). A simplification of the underlying data could significantly reduce the amount of steps needed to prepare the data for analysis and thus the overall complexity of the workflow. Here we argue that the simplicity of data could be fostered by feedback mechanisms to data providers that inform about the usage and the quality of their datasets19. While this mechanism could help to improve data that is already available the information could also be used to develop guidelines for good quality datasets in structure and constitution of columns. Additionally the information could guide the development of data management tools that assist researchers or data curators in the process of creating good quality data. Furthermore the information could be employed in tracking down the ownership of data products which remains as unsolved problem19 and a major concern in sharing data20,21.
In our workflow, “wide” datasets consisting of many columns contribute less to the analysis than smaller datasets with fewer columns (Figure 8). The more data columns a dataset has, the more difficult it is to understand and to describe. The high number of columns in datasets, however, results from the effort to provide comprehensive information about a study in one single file. This often includes different experimental designs and methodologies. These datasets typically result from copying the field notes that are related to the same research objects, but combine information from different experiments. For example, a field campaign on estimating the amount of woody debris on a study site might count the number and size of branches that are found. At the same time, as one is already in the field, other branches might be used to find general rules for branch allometries. Thus, the same sheet of paper will be used for two different purposes. While this approach is efficient in terms of time and field work effort, it leads to highly complicated datasets. Separating the datasets into two, one for the dead matter, the other for branch allometries would decrease the number of columns of a dataset and increase the focus and thus the value of the datasets for our specific analysis.
We show that we could identify workflow tasks using text mining techniques (Figure 6). The identification of common and recurring tasks in workflows may serve several purposes. First the mechanism could be employed to automatically detect and tag workflow components to help progress towards a semantic enhanced workflow environment. Secondly, it could also identify bottlenecks in a workflow, i.e. components that need improvement in terms of simplification. Furthermore the mechanism could be employed to guide the development of a semantic framework that can assist researchers in the process of creating workflows guiding the exploration of useful and compatible components for a certain analysis. In our case, we could identify a task of very low complexity (data type transformation). It constituted the second axis of our NMDS analysis (Figure 6). This task converts text vectors into numeric vectors. Having text vectors that could actually be interpreted as numbers stems from a “weakness” in the EML standard15 describing the previously mentioned highly complicated field data as it does not allow a per column definition of categorical values for numeric columns. However, it is very common that scientists comment missing values or numbers below or above a measuring uncertainty threshold using text.
In a larger context we are witnessing a growing loss of data22 which is mostly related to illegible and highly complicated datasets and due to missing metadata. This is especially true for the small and heterogeneous data constituting the long tail of big data9. The concern of losing valuable data has led to the invention of several data management tools like DataUP or BEFdata10. They help to annotate data in Excel and use Excel sheets as exchange format with the database. This is an important step since many researchers are not well trained in data management and they mainly use Excel for data management and storage. At the same time portals are emerging that allow publishing datasets where the Ecological Archives http://www.hindawi.com/dpis/ecology/, Figshare (http://figshare.com/) or F1000Research are only a few of the alternatives. Data journals are emerging (e.g. Scientific Data http://www.nature.com/sdata/ and Dataset Papers in Science, http://www.hindawi.com/journals/dpis/) which try to provide an impact for data to give researchers credit for all of their work and not only for publications23. Workflows are shared via online platforms like “MyExperiment”12 which has to date approximately 7500 members presenting about 2500 workflows. However, there is only a simple rating mechanism for the workflows which does not identify the complexity or the quality of a workflow and its components.
We here exemplify how to quantify the complexity as well as the quality and the usage of data in scientific workflows, using simple qualitative and quantitative measures. Our means are not meant to be exhaustive but rather could serve as a starting point for discussion towards the development of more sophisticated complexity feedback mechanisms for data providers and workflows creators. Our example workflow strongly relies on the interface component of Kepler connecting to the R statistical environment for the purpose of data manipulation and analysis. Thus the means we provide to measure complexity and quality are adapted to that specific workflow situation. However, adapting our means to further components that work as interfaces to other programming languages should be straightforward. Further complexity attributes could be the inclusion of the variable types of workflow components or a ratio capturing the enrichment or reduction of data consumed by the component. Providing complexity measures at the level of workflow components might help in adapting workflows towards a better readability and reusability and thus improve their value for reuse. Additionally it can guide the restructuring and simplification of data for better use in workflows, understandability and reuse.
Summary
Providing feedback to researchers about the complexity of their data may be an important step towards improving the quality of scientific data. We show that simple text-based measures can be helpful in quantifying the complexity of data and the workflow. Offering complexity measures can help to identify complicated components in workflows that need improvement. Measures about the usage of data can also help to better propagate the ownership of derived data products as it allows to track down the contribution of each dataset. Identifying common and recurring tasks in workflows could facilitate building up libraries of standard components and semantic frameworks to guide scientist in the process of the workflow creation.
Data availability
figshare: Data used to quantify the complexity of the workflow on biodiversity-ecosystem functioning, http://dx.doi.org/10.6084/m9.figshare.1008319
Author contributions
C.T.P., K.N., S.R., C.W. and H.B. substantially contributed to the work including the conceptualization, the acquisition and analysis of data as well as the critical revision of the draft towards the final manuscript.
Competing interests
No competing interests were disclosed.
Grant information
The data was collected by 7 independent projects of the biodiversity - ecosystem functioning - China (BEF-China) research group funded by the German Research Foundation (DFG, FOR 891).
Acknowledgements
Thanks to all the data owners from the BEF-China experiment who contributed their data to make this analysis possible. The research data is not all publicly available currently but will be in future. The datasets are linked and the ones publicly available are marked accordingly and can be downloaded using the following links. By dataset these are: Wood density of tree species in the Comparative study plot (CSPs): David Eichenberg, Martin Böhnke, Helge Bruelheide. Tree size in the CSPs in 2008 and 2009: Bernhard Schmid, Martin Baruffol. Biomass of herb layer plants in the CSPs, separated into functional groups (public): Alexandra Erfmeier, Sabine Both. Gravimetric Water Content of the Mineral Soil in the CSPs: Stefan Trogisch, Michael Scherer-Lorenzen. Coarse woody debris (CWD): Collection of data on dead wood with special regard to snow break (public): Goddert von Oheimb, Karin Nadrowski, Christian Wirth. CSP information to be shared with all BEF-China scientists: Helge Bruelheide, Karin Nadrowski. CNS and pH analyses of soils depth, increments of 27 Comparative Study Plots: Peter Kühn, Thomas Scholten, Christian Geißler.
Faculty Opinions recommendedReferences
- 1.
Michener WK, Jones MB:
Ecoinformatics: supporting ecology as a data-intensive science.
Trends Ecol Evol.
2012; 27(2): 85–93. PubMed Abstract
| Publisher Full Text
- 2.
Altintas I, Berkley C, Jaeger E, et al.:
Kepler: an extensible system for design and execution of scientific workflows. Proceedings. 16th International Conference on Scientific and Statistical Database Management. 2004; 423–424. Publisher Full Text
- 3.
Ewa D, Gurmeet S, Mei-hui S, et al.:
Pegasus: a framework for mapping complex scientific workflows onto distributed systems. 2005; 13: 219–237. Reference Source
- 4.
Gries C, Porter JH:
Moving from custom scripts with extensive instructions to a workflow system: use of the Kepler workflow engine in environmental information management. In M B Jones and C Gries, editors, Environmental Information Management Conference 2011. Santa Barbara, CA University of California. 2011; 70–75. Reference Source
- 5.
Ewa D, Gurmeet S, Mei-hui S, et al.:
Pegasus: a framework for mapping complex scientific workflows onto distributed systems. 2005; 13: 219–237. Reference Source
- 6.
Oinn T, Greenwood M, Addis M, et al.:
Taverna: lessons in creating a workflow environment for the life sciences.
Concurr Comput.
2006; 18(10): 1067–1100. Publisher Full Text
- 7.
Bowers S, Ludäscher B:
Towards Automatic Generation of Semantic Types in Scientific Workflows. Web Information Systems Engineering WISE 2005 Workshops Proceedings. 2005; 3807: 207–216. Publisher Full Text
- 8.
Garijo D, Alper P, Belhajjame K, et al.:
Common motifs in scientific workflows: An empirical analysis. IEEE 8th International Conference on E-Science. 2012; 1–8. Publisher Full Text
- 9.
Heidorn PB:
Shedding light on the dark data in the long tail of science.
Library Trends.
2008; 57(2): 280–299. Publisher Full Text
- 10.
Nadrowski K, Ratcliffe S, Bönisch G, et al.:
Harmonizing, annotating and sharing data in biodiversity-ecosystem functioning research.
Methods Ecol Evol.
2013; 4(2): 201–205. Publisher Full Text
- 11.
Parsons MA, Godoy O, LeDrew E, et al.:
A conceptual framework for managing very diverse data for complex, interdisciplinary science.
J Info Sci.
2011; 37(6): 555–569. Publisher Full Text
- 12.
De Roure D, Goble C, Bhagat J, et al.:
myexperiment: Defining the social virtual research environment. In eScience, 2008. eScience ’08. IEEE Fourth International Conference on. 2008; 182–189. Publisher Full Text
- 13.
Gil Y, González-Calero PA, Kim J, et al.:
A semantic framework for automatic generation of computational workflows using distributed data and component catalogues.
J Experimental Theoretical Artificial Intelligence.
2011; 23(4): 389–467. Publisher Full Text
- 14.
Bruelheide H:
The role of tree and shrub diversity for production, erosion control, element cycling, and species conservation in Chinese subtropical forest ecosystems. 2010. Reference Source
- 15.
Fegraus EH, Andelman S, Jones MB, et al.:
Maximizing the value of ecological data with structured metadata: an introduction to ecological metadata language (eml) and principles for metadata creation.
Bull Ecol Soc Am.
2005; 86(3): 158–168. Publisher Full Text
- 16.
R Development Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, ISBN 3-90005107-0, 2008. Reference Source
- 17.
Burnham KP, Anderson DR:
Model selection and multimodel inference: a practical information-theoretic approach. Springer. 2002; 172. Publisher Full Text
- 18.
Dixon P:
Vegan, a package of R functions for community ecology.
J Vegetation Sci.
2003; 14(6): 927–930. Publisher Full Text
- 19.
Ingwersen P, Chavan V:
Indicators for the Data Usage Index (DUI): an incentive for publishing primary biodiversity data through global information infrastructure.
BMC Bioinformatics.
2011; 12(Suppl 15): S3. PubMed Abstract
| Publisher Full Text
| Free Full Text
- 20.
Cragin MH, Palmer CL, Carlson JR, et al.:
Data sharing, small science and institutional repositories.
Philos Trans A Math Phys Eng Sci.
2010; 368(1926): 4023–38. PubMed Abstract
| Publisher Full Text
- 21.
Xiaolei H, Hawkins BA, Fumin L, et al.:
Willing or unwilling to share primary biodiversity data: results and implications of an international survey.
Conserv Lett.
2012; 5(5): 399–406. Publisher Full Text
- 22.
Nelson B:
Data sharing: Empty archives.
Nature.
2009; 461(7261): 160–163. PubMed Abstract
| Publisher Full Text
- 23.
Piwowar H:
Altmetrics: Value all research products.
Nature.
2013; 493(7431): 159–159. PubMed Abstract
| Publisher Full Text
Comments on this article Comments (0)