DataUp: A tool to help researchers describe and share tabular data

Scientific datasets have immeasurable value, but they lose their value over time without proper documentation, long-term storage, and easy discovery and access. Across disciplines as diverse as astronomy, demography, archeology, and ecology, large numbers of small heterogeneous datasets (i.e., the long tail of data) are especially at risk unless they are properly documented, saved, and shared. One unifying factor for many of these at-risk datasets is that they reside in spreadsheets. In response to this need, the California Digital Library (CDL) partnered with Microsoft Research Connections and the Gordon and Betty Moore Foundation to create the DataUp data management tool for Microsoft Excel. Many researchers creating these small, heterogeneous datasets use Excel at some point in their data collection and analysis workflow, so we were interested in developing a data management tool that fits easily into those work flows and minimizes the learning curve for researchers. The DataUp project began in August 2011. We first formally assessed the needs of researchers by conducting surveys and interviews of our target research groups: earth, environmental, and ecological scientists. We found that, on average, researchers had very poor data management practices, were not aware of data centers or metadata standards, and did not understand the benefits of data management or sharing. Based on our survey results, we composed a list of desirable components and requirements and solicited feedback from the community to prioritize potential features of the DataUp tool. These requirements were then relayed to the software developers, and DataUp was successfully launched in October 2012.


Introduction
The move towards digital data is ubiquitous across all domains in academic research and scholarship [1][2][3][4][5] , and these data can be made available more easily and distributed more quickly than ever before. This is often called the data deluge, and is a phenomenon that has been examined in the traditional academic literature 2,4,6 , as well as in several major media outlets 7-9 .
Among the most pressing problems associated with the data deluge is good data management: how does one handle the huge volume of available information effectively and efficiently to solve important problems? Knowledge of good data management techniques and software development lags behind the progression of the data deluge. Consequently, although researchers of all fields are faced with huge volumes of data from increasingly diverse sources, they do not have the skills to handle their data sets. This challenge is amplified by the fact that research data are seldom shared, re-used, or preserved 10-12 . There is a growing awareness among practitioners and funders that this situation represents inefficient use of research dollars, missed opportunities to exploit prior investment, and a general loss for the scholarly community 13 . Michener et al. 14 described the loss of valuable data and insight about those datasets as "information entropy". This loss of information is becoming increasingly worrisome as data management practices improve very slowly, while the volume of data grows exponentially.
Recognizing that most earth, environmental, and ecological scientists use spreadsheets at some point in their data life cycle, the California Digital Library (CDL) partnered with Microsoft Research Connections and the Gordon and Betty Moore Foundation to create a tool that would encourage and enable good data stewardship practices for datasets created in Microsoft Excel. Our vision was to promote publishing, archiving, and sharing of tabular data among earth, environmental, oceanographic, and ecological scientists by creating a tool that will easily integrate into their current workflows and assist them in data management and preservation. This will, in turn, enable faster and more efficient research, thereby increasing the pace of scientific advancement.
Others have worked towards creating tools to help researchers conform to best practices and archive their data. The OpenRefine (formerly Google Refine) project is one such example (http://openrefine.org). This tool seeks to help researchers work with "messy'' tabular data, and is free and open to anyone. However it does not link directly to repositories, and therefore only addressed some of the features we planned to undertake with DataUp. Another related tool for working with spreadsheets is RightField, an open-source tool for adding ontology term selection to Excel spreadsheets (http://www. rightfield.org.uk). RightField allows researchers to access controlled vocabularies, which results in better quality metadata. Similar to OpenRefine, however, RightField does not have capabilities for archiving research data. To optimize the tool, we first identified the needs of the community via surveys of researchers. We found that, on average, researchers had poor data management practices, were not aware of data centers or metadata standards, and did not understand the benefits of data management or sharing. We used the survey results to compose a list of desirable components and solicited feedback from the community to prioritize potential features.
The resulting DataUp tool facilitates documenting, managing, archiving, and sharing tabular scientific data. It comes in two forms, both open-source: an add-in for Excel and a web-based application. The add-in operates within the well-known program Microsoft Excel; the web application allows users to upload tabular data to the webbased tool in either Excel (.xlsx) or comma-separated value (.csv) format. Both the add-in and the web application provide users with the ability to (1) Perform a "best practices check" to ensure the data are CSV-compatible; (2) Create standardized metadata, or a description of the data, using a wizard-style template; (3) Retrieve a unique identifier for their dataset from their chosen data repository, and (4) Post their datasets and associated metadata to the repository.

Methods and results
The extent to which researchers use Microsoft Excel is not fully documented, however based on strong anecdotal evidence we assumed that it is a standard tool for most scientists. Given this fact,

Amendments from Version 1
We have added a paragraph to the introduction describing spreadsheet tools that are similar to DataUp in some of their functionality, and have added a note outlining recent developments with DataUp. We have also corrected two minor typographical errors that reviewers noticed.

REVISED
dash.dataone.org we determined that an add-in for Excel would have the greatest potential impact on how scientists work with data. An add-in (also called a plug-in) is a small piece of software that one installs on a local computer. Once installed, it extends the capabilities of an existing program: in this case, Excel. The add-in's functionality is available from within the program, and in the case of Excel, appears as a "ribbon" of functions and features within the standard user interface. In this way, we assumed that researchers would be more likely to use the tool since it is fully integrated with a program they are already using.
Our target audience for creating the tool was scientists and researchers actively working with earth, environmental, oceanographic, and ecological data. These researcher groups were chosen based on their relatively low participation in data sharing 15 and their presumed high levels of Excel use. To capture their data management needs, we surveyed and interviewed more than 130 researchers over the course of five months (August to December, 2011). We also collected suggestions for requirements from academic libraries, data centers, data managers, and other data professionals, although this collection was less structured and more anecdotal. Most of these interactions occurred via interactions with the DataONE project community 16 ; a full list of partners affiliated with the DataONE project is available on their website (http://dataone.org).

Researcher surveys and interviews
We used several methods to communicate with our potential stakeholder community in developing the tool. These included the DCXL blog (now the Data Pub Blog, located at datapub.cdlib.org), two Twitter accounts (@dataupcdl and @carlystrasser), and interviews and conversations at conferences, webinars, and professional meetings.
Our goal in surveying and interviewing researchers was to determine how they were currently handling data management, especially as it related to Excel data, and how best the tool we were developing might help improve researcher practices surrounding data. The questions we asked underwent revision to improve the survey instrument, and to that end we used four similar versions of the survey over the course of data collection. The number of respondents for each survey version was 43, 12, 47, and 10 respectively, for a total of 112 respondents. The four versions of the survey can be viewed in the associated datasets. Interview questions were less structured and varied depending on the interviewee.
We attended four professional meetings and surveyed researchers of various statuses (i.e., from student to senior researcher) and from many different institutions and organizations (Table 1). We also conducted surveys and in-depth interviews with researchers at four campuses in the University

Survey results
Demographically, the survey pool was composed of researchers and scientists ranging from undergraduate-level to PhD-level ( Figure 1).
We asked researchers about their choice of operating system because of the potential implications for development of the tool. Of those surveyed, the large majority (74%) used a Windows-based operating system, while 23% used a Mac-based system and 2% used Linux ( Figure 2).
We asked a series of questions related to how the researchers were using Excel for their day-to-day work. We found that 80% of those surveyed answered that they used Excel "every day" or "almost every day" (Figure 3).  When asked what data-related tasks they were undertaking when using Excel, we found that most were at least using Excel to organize their data (96%). Excel was also used by the majority of participants for visualizing data (61%), performing minor calculations (75%), and for sharing data with colleagues in Excel format (81%) (Figure 4).
To better understand the content of researchers' spreadsheets, we asked whether the following Excel features were used in their datasets ( Figure 5).
• multiple tables on a single spreadsheet • cell shading to indicate information about the data (i.e., ad-hoc metadata) Most researchers created header rows (97%), used embedded formulas (83%), and used cell shading as a form of ad-hoc metadata (74%). Of those we surveyed, the majority (74%) reported that they had a "better than average" knowledge of the Excel software, while 24% reported an average knowledge (n = 105).
We asked researchers to identify other software programs that they use alongside Excel for their data analysis and organization. Note that these results are likely heavily influenced by the venues used to interview researchers, since software programs tend to be used by many researchers in a given discipline ( Figure 6). The open-source statistical software R was most often cited (53%).

Rarely or not often 12%
Often 8% Every day or almost every day 80% Other information gathered via the survey included areas of work (i.e., field versus lab; area of focus; discipline), attitudes about data sharing, and knowledge of data repositories. These questions were not asked formally via survey in most cases, rendering the results difficult to share with any confidence in the numbers reported.

Requirements
Although the practices reported by researchers are common and accepted uses of Excel, they are not necessarily well suited for long-term preservation of high-quality data. This has been previously reported in the literature [17][18][19][20] . In addition, the European Spreadsheet Risk Interest Group has created a curated list of stories detailing instances where spreadsheets are implicated in erroneous reporting (http://eusprig.org). In general, issues associated with using Excel for generating curation-ready datasets are (1) poor data table construction (e.g. multiple data tables on a single spreadsheet); (2) a lack of metadata or poorly standardized metadata (e.g. using comments, notes, color-coding, and shading to document important details about the dataset; (3) embedded figures, charts, and comments that make the spreadsheet less usable in programs outside Excel; and (4) poor provenance of how data is produced via calculations, statistics, and formulas.
Based on the information collected from researchers and other stakeholders, we created the following high-level requirements for the tool: 1. Check data file for .csv compatibility and create .csv version data file. The user can generate and download a customized report detailing elements in their dataset that might cause problems for data archiving and/or export of the data file as a .csv version.
2. Generate metadata that is linked to the data file. Using the DataUp tool, machine-and human-readable metadata is generated, embedded in the data file, and can be exported as a separate file. The metadata is displayed in a new tab on the spreadsheet, can be saved separately, and relies on Ecological Metadata Language (EML) and the DataONE metadata schema (http://mule1.dataone. org/ArchitectureDocs-current/design/SearchMetadata.html). Both file-level and parameter-level metadata are created by the tool.
• File-level metadata is information about the entire dataset, such as the creator, temporal and spatial details of the data collection, and the funders of the project. The tool is able to pre-populate some fields based on user information provided by Excel. Keywords can be selected from standard lists.
• Parameter metadata describes individual elements of the data file, and most commonly corresponds to the header row of a tabular dataset. The user can identify a header row to begin the process of creating parameter metadata.
3. Generate a citation for the data file. Using the tool, the user can generate a complete data citation for their tabular dataset. This includes all the metadata necessary for citing the dataset, is in a standard format, and becomes part of the metadata. The citation can be downloaded in standard formats (e.g. .ris, .bib, .xml).
4. Repository authentication set-up. The user can authenticate with their chosen repository from within the tool, assuming they have pre-existing login information for that repository. This will then allow them to deposit their dataset in the repository via the tool.

5.
Link an identifier to the data file. The tool allows the user to retrieve and save a persistent identifier (such as a DOI) for their dataset from their chosen repository.
6. Ensure that the data file is ready for deposition into a repository. The tool determines whether the data file is ready for deposit into the designated archive by checking for the following: • Determine whether a compatibility check has been completed.
• Determine whether metadata is complete (i.e., all required metadata are present).
• Determine whether a citation has been generated.
The tool then generates the technical metadata needed by the designated repository.
7. Submit the data file for deposition into the designated repository.
8. Ensure compatibility for Excel users without the add-in: users without the add-in locally installed are able to open the data file and access the metadata.
These requirements were posted on the DataUp blog, with requests for feedback from the community. We then passed on the document to the Microsoft Research team, who generated a second version of the requirements based on their knowledge of Excel and their protocols for software development. These requirements were then relayed to the developers (contractors for Microsoft Research).

Add-in versus web-based application
In the course of development, questions arose from the project team as to whether an Excel add-in was the most appropriate choice for delivering the tool to researchers; the alternative discussed was a web-based application. Concerns were that an add-in had compatibility issues that required updates on the developer's part and downloads on the user's part. In addition, the project timeline dictated that the add-in could be built only for Windows platforms; Macintosh systems would not be able to use the tool. This is not true for a web-based application. See Table 2 for a summary of the differences between the two potential versions of the tool: an add-in and a web application. In early 2012, we launched a campaign to determine which of the two versions of the tool should be created. Input was received from attendees of the Ocean Sciences 2012 Meeting in Salt Lake City, Utah. We also asked researchers and others via online surveys and blog posts which they would prefer, and what barriers they perceived to each version of the tool. We collected results from approximately 200 individuals. Most (95%) were willing to download an add-in, and most (83%) indicated that they would prefer an add-in to a web application (assuming the add-in were available for Mac as well). However 72% reported that there were barriers to their downloading and/or installing an add-in for Excel. Barriers mentioned included version compatibility issues, security concerns (e.g., viruses), lack of Mac compatibility, and a lack of administrative controls over computers, preventing downloads. The full set of survey responses is available in the associated datasets. Given these contradictory results we determined that there was a need for both versions of the tool. We therefore proceeded with the development of both an add-in for Excel and a web-based application. The requirements were the same for both versions; only the delivery of the functionality differed between the two. Of those surveyed, 75% used a Windows operating system, compared to 22% using a Mac, and 3% using some other system (e.g. Linux). These results paralleled those from our general researcher survey ( Figure 2).

The DataUp tool
The tool created based on our requirements and user feedback is called DataUp. DataUp is free and open source, and has two forms: a web-based application (web app http://dataup.org) and a downloadable Excel add-in. Both versions of the tool provide users with the ability to (1) perform a "best practices check" to ensure that data are well formatted and organized; (2) create standardized metadata (i.e., a scientifically-meaningful description of the data), using a wizard-style template; (3) retrieve a unique identifier for their dataset from their chosen data repository; and (4) upload datasets and associated metadata to a public data repository.
Best practices check. The tool determines whether the data file has any of 11 potential issues that do not comply with data management best practices, such as embedded charts, comments, and color-coded cells. These issues were chosen based on interviews with researchers, as well as data managers who often "clean up" spreadsheets submitted by researchers for archiving. In addition to identifying the locations of these problems, DataUp informs the user why they are potentially problematic, and offers suggested alternatives or the ability to remove them in bulk. The information provided by the DataUp tool for each of these potential issues is below:   (Table 3). We choose to support only a subset of EML in order to provide the lowest barrier to entry for researchers interested in documenting their datasets.
Obtain an identifier. Valuing and incentivizing the time and effort required to manage data well is an important factor in fostering data sharing and reuse. One way to allow data producers to get credit for this is through data citation. The DataUp tool connects to the user's chosen repository to retrieve a unique identifier for the researcher's dataset. For its first iteration, DataUp connects to the EZID service (http://n2t.net/ezid), based at CDL, used by the public DataUp ONEShare repository. The identifier generated is an ARK (Archival Resource Key, https://confluence.ucop.edu/display/Curation/ ARK). ARKs provide stable, opaque, versatile, and transcriptionsafe identifiers. This identifier is saved in the data file's metadata.

Share and archive.
Once metadata is created, the user can connect directly to a repository via DataUp and upload their data for archiving. Currently, DataUp is connected to ONEShare, which is a dedicated public DataUp repository to which anyone may deposit tabular data (more information below).

Architecture
DataUp's codebase is written in C# using the .NET application framework. The web app is deployed on Microsoft's Windows Azure cloud platform. DataUp's architecture (Figure 7) consists of two clients communicating via an intermediating web service to one or more repositories. The add-in client is an Excel extension that runs directly on a researcher's Windows-based computer.
The web app client runs as an online application hosted in Azure. Client/web service communication uses the OData protocol 21 . Both clients support standard EML metadata and draw functionality from a common web service, also hosted in Azure. That web service is managed by a separate administrative service.
DataUp was designed not only for standalone metadata checks, but also for contacting a variety of repositories to obtain persistent identifiers and to archive data. Currently, the only repository supported is ONEShare, an instance of the CDL Merritt repository that is also a DataONE Member Node (more information below). With the front-end running at CDL and a storage node back-end running at the University of New Mexico, content can be browsed either by logging in directly to Merritt as a guest or using the DataONE ONEMercury interface (http://dataone.org/onemercury).

Creation of the ONEShare repository
Although there are hundreds of data repositories available to researchers for data archiving, the majority of scientists are not aware of their existence or how to access them. One of the major outcomes of the DataUp project is the ONEShare repository, created specifically for the DataUp tool. ONEShare is a special instance of CDL's Merritt repository, which serves as a digital archive and access system to the University of California campuses (http://www.cdlib. org/uc3/merritt). Users can deposit their tabular data and metadata directly into the ONEShare repository from within the tool, allowing for seamless data archiving within the researcher's current workflow. The DataUp web service performs the repository submission using the Merritt API, hiding all details of the transfer protocol from the DataUp user. An added advantage of ONEShare is its connection to the DataONE network of repositories. DataONE links together existing data centers and enables its users to search for data across all participating repositories using a single search interface. Since Merritt is a member node on the DataONE network, all data deposited into ONEShare will be indexed and made discoverable by any DataONE user, facilitating collaboration and enabling data re-use.
The ONEShare repository is collaboratively supported by the CDL and the University of New Mexico Library. CDL's Merritt repository relies on a highly decentralized micro-services architecture 22 .
In the case of ONEShare, a Merritt storage node was established on a University of New Mexico (UNM) virtual server managing a local file system. All DataUp submissions to Merritt are routed automatically to the UNM storage node, but the data are still subject to all Merritt preservation and access services such as ongoing fixity audits, metadata search and browse, and pro-active preservation analysis and planning. Merritt is also integrated as a member node on the DataONE network and the full set of descriptive metadata for all DataUp-submitted data is automatically harvested by the DataONE coordinating nodes for inclusion in the federated ONEMercury search interface, increasing the public visibility of DataUp datasets.

Beta testing/feedback
The first versions of the add-in and web application underwent beta testing by researchers, librarians, software engineers, and other stakeholders. Testers included professional contacts of the DataUp team, researchers who participated in the requirements-gathering survey and consented to future contact, and individuals responding to a blog post requesting subjects for beta testing. We received feedback from 23 testers via an online survey. We received additional comments via email and conversations with researchers. Information gathered from the beta testers was relayed to the developers who addressed those issues that were reasonable within the given time frame for software release. Data from beta testing is available from the associated datasets.

Formation of a community
One of the major goals of the DataUp project was to create an open-source tool that could be adopted and used by the larger community. To that end, we partnered with the non-profit Outercurve Foundation, whose goal is to enable code exchange and understand-  Reference Source more than 10 countries (84% of visits from the US). These numbers do not, however, adequately represent the tool's popularity and potential. The CDL has received inquiries about DataUp from many repositories, organizations, and publishers interested in configuring the tool for their needs. The inquiries represent a range of stakeholders that are crucial to data sharing, including a large citizen science project, a major social science data archive, some highprofile data publication services, and others. They are excited about the possibilities that DataUp represents for linking researchers' workflows directly to repositories, with capabilities for generating metadata and performing best practices checks.

Future plans
The DataUp team received a one-year grant from the US National Science Foundation, supplemental to the DataONE project. Using these funds, the DataUp web application will undergo another iteration that will result in easier repository connections, better features, and a more streamlined workflow. The code resulting from this project will be open-source, and community ownership will be encouraged. The text of the NSF proposal is available from the University of California's eScholarship repository 23 .
CDL envisions that the future of DataUp will be directed by the community of stakeholders. Interested developers can expand upon and increase the tool's functionality to meet the needs of a broad array of researchers. Code for both the add-in and web application is open source and participation in its improvement is strongly encouraged. Although the target audience for our tools that result from the DataUp project will be earth, environmental, oceanographic, and ecological scientists, we envisage that any tools developed will be easily implemented in other research communities, such as the social sciences. I am impressed with the simplicity of this tool, which attempts to solve issues in data description for a single type of data. This is much better than the 'workbench' approaches that try to do too much and end up failing.

Data and software availability
The issue of errors in data conversion between formats is critical and is a known issue in the data archival The issue of errors in data conversion between formats is critical and is a known issue in the data archival world. This tool addresses some of these common issues that arise in both spreadsheet data description and conversion.
The paper presents some very useful information gathered from surveys and pilot work, but this is rather US-centric. I don't imagine the use cases of these type of scientists' behaviour in other countries are that different, but I would expect some pointers to this wider context.
The referenced literature is good and covers many of the key sources I would refer to. However my own organisation in the UK has been advising on data documentation, including use of Excel and conversion issues, for some years, so it would be good to cite some examples of other efforts to address these issues on the non-ecology field and offer examples of non US resources that provide extensive data management advice ( ). http://ukdataservice.ac.uk/manage-data.aspx On page 6 the checklist of issues is very clear and useful and great to alert researchers to these issues upfront.
In terms of platforms for the tool, I think a Mac version will be important. In my experience, many data creators prefer to have the convenience of local tools to document data, rather than relying on web-based tools, that can suffer from browser issues and loss of data through poor connection.
I do believe that data preparation tools are best built into researchers' existing data handling software, as this brings the activities a step closer to data analysis and away from the burden of completing data deposit forms.
I love the idea that the source code has been made available and that, on the whole, the project has been carried out in the spirit of openness, despite using a Microsoft base for the tool.
I am also terribly impressed with the work done to convince Microsoft of the importance of this tool, and to secure codevelopment to enable it to be a plug-in. On this front, I have had some negative experience in lobbying software suppliers of qualitative analysis packages to implement a data exchange standard to enable export of within-system documentation; conversion between different market leaders' softwares is currently difficult, if not impossible. They should possibly take a leaf out of Microsoft's book and also listen to what data archivists/publishers are saying! The tool looks like it has had some user testing and feedback.
Overall I believe this tool could have much wider value than the purposes for which the team have developed it. By simply replacing the metadata standard in use it could easily be applied to other disciplines, e.g. social science data. I would be very keen to pilot it and offer feedback on our own tabular data collection in the social sciences domain. The social sciences use the Data Documentation Initiative (DDI) which has fields that map pretty close to the schema used in the tool and discussed in this paper.
I would advocate engagement with more data centres, possibly through forums like the Research Data Alliance.

I have read this submission. I believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.
No competing interests were disclosed.

Carly Strasser
Reviewer text is italicized; our responses are below the reviewer text.
The paper presents some very useful information gathered from surveys and pilot work, but this is rather US-centric. I don't imagine the use cases of these type of scientists' behaviour in other countries are that different, but I would expect some pointers to this wider context.
In retrospect, it would have been valuable to collect use cases and feedback from non-US researchers. We were limited somewhat by our 12 month timeline and focused primarily on the communities that we could easily interview given our spatial and temporal constraints. Based on conversations with our international colleagues, it is our experience that researcher behaviors, concerns, and use cases are, across the board, fairly similar to what we found.

However my own organisation in the UK has been advising on data documentation, including use of Excel and conversion issues, for some years, so it would be good to cite some examples of other efforts to address these issues on the non-ecology field and offer examples of non US
resources that provide extensive data management advice ( http://ukdataservice.ac.uk/manage-data.aspx).
We recognize that many groups work in the area of best practices for data management. We did not, however, choose to focus on this aspect of community involvement since it has been largely covered by others. That said, we have added a paragraph referencing other projects that have similar goals to DataUp, one of which is based in the UK.
In terms of platforms for the tool, I think a Mac version will be important. In my experience, many data creators prefer to have the convenience of local tools to document data, rather than relying on web-based tools, that can suffer from browser issues and loss of data through poor connection.
We concur that a Mac version of the tool would be quite valuable. However, given our limited budget and abbreviated schedule, it was not possible for the first iteration of DataUp. Also, the fundamentally different architectures of the Mac and Windows versions of Excel meant that a Mac plug-in would have had to been created from scratch with no opportunity for code re-use from the Windows version. We hope that interested members of the open source will take on this task in the near future. We would be happy to support any such effort to the fullest extent possible.
Overall I believe this tool could have much wider value than the purposes for which the team have developed it. By simply replacing the metadata standard in use it could easily be applied to other disciplines, e.g. social science data. I would be very keen to pilot it and offer feedback on our own tabular data collection in the social sciences domain. The social sciences use the Data Documentation Initiative (DDI) which has fields that map pretty close to the schema used in the tool and discussed in this paper.
It's true that the value of the tool goes well beyond just our target audience for requirements development and feedback. Social scientists in particular have expressed interest in the DataUp tool as a potentially valuable addition to their toolkit. One of the advantages of the DataUp/Dash merger is that the Dash platform has a more open architectural design that will greatly facilitate the process of supporting alternative metadata schemas.
I would advocate engagement with more data centres, possibly through forums like the Research Data Alliance.
We also agree that engaging more data centres would be ideal. The RDA was not yet formed when this work was conducted, but moving forward we hope to take advantage of such coalitions to get more uptake of a tool like DataUp. The paper describes the DataUp Excel-based metadata and data capturing tool developed as part of the DataONE project. I like the paper and I like the tool.
The paper is well written.
The need for spreadsheet-based data management tools is critical, and DataUp makes a valuable contribution.
The tool works, the software is available, is being used, and is useful.
The survey results and the requirements are a very useful guide for other workers using spreadsheets as a prime mechanism for data upload. The survey is well conducted given the constraints of such things, and it is refreshing to see that the people who do the data management (postdocs, postgrads) were targeted. This is a very useful contribution to the field. There are, however, some improvements that I would like to see in the final article: The metadata model seems to be very high level (Table 3). Is there a richer metadata model for specific data types? Is it possible to upload and index/search on more domain specific metadata captured in the DataUp model? To what extent does the metadata model work for all the data types you mention? As your user base is wide one would expect heterogeneity to be a big problem.
How are controlled vocabularies and or specific domain metadata models incorporated? The architecture figure 7 is a very general figure and can be replaced by something that showed the protocol of how DataUp is used in practice. Are Excel templates prepared by the DataUp team or There is sparse information on uptake or the impact of uptake. The number of downloads are listed but not how many datasets were uploaded to the repository using DataUp.
Despite the excellent requirements survey and user engagement, there is no evaluation of DataUp's use. What is the difference in uptake between the Excel and web-based version?
There is no related work section. The only related work is RightField (reference17)  There are two typos: sheet,s -> sheets highprofile -> high profile We have read this submission. We believe that we have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however we have significant reservations, as outlined above.
No competing interests were disclosed.

Carly Strasser
Reviewer text is italicized; our responses are below the reviewer text.
The metadata model seems to be very high level (Table 3). Is there a richer metadata model for specific data types? Is it possible to upload and index/search on more domain specific metadata captured in the DataUp model? To what extent does the metadata model work for all the data types you mention? As your user base is wide one would expect heterogeneity to be a big problem.
Currently there is not a richer metadata model. Future development plans include the ability to expand metadata options and allow for generating metadata files that comply with different standards for various disciplines. We chose some elements of EML as our baseline for three reasons: (1) the metadata fields are fairly generic, so they can apply to many different disciplines; (2) EML is a flexible metadata language that, although originally constructed for Ecology, has the ability to describe a wide variety of datasets; and (3) EML is one of the metadata standards accepted by the DataONE network.
How are controlled vocabularies and or specific domain metadata models incorporated? The architecture figure 7 is a very general figure and can be replaced by something that showed the protocol of how DataUp is used in practice. Are Excel templates prepared by the DataUp team or through the Plug-in?
There are no controlled vocabularies incorporated into the tool, although this has been on our "wish list" of future development. Currently a user can specify a controlled vocabulary in the metadata, however we have no connection or integration with vocabulary systems (and no current mechanism for that connection). Similarly, the metadata must be hard-coded and therefore we have no ability to "switch out" the metadata depending on domain-specific interests. There are no templates in use by the DataUp tool. Instead, there are a set of "rules" that the tool consults while parsing a user's spreadsheet. The tool checks for best practices and reports back; this is not in any way set by the DataUp team.
There is sparse information on uptake or the impact of uptake. The number of downloads are listed but not how many datasets were uploaded to the repository using DataUp. Despite the excellent requirements survey and user engagement, there is no evaluation of DataUp's use. What is the difference in uptake between the Excel and web-based version?
Unfortunately we don't have access to this information, nor do we believe it is being collected by the service. We are limited in our ability to modify the code base to obtain these metrics since our technical team is not familiar with the Microsoft Azure service or the C#/.NET code base. We can cite the number of downloads of the add-in and the number of datasets uploaded to ONEShare, however this does not give us metrics that allow comparison of add-in versus web application. More broadly, we are not able to extensively evaluate the use of DataUp because of our limited ability to access user information. Based on the number of submissions to ONEShare, uptake of the tool by researchers has been minimal but steady.
There is no related work section. The only related work is RightField (reference17) but what RightField does and how it relates to DataUp is not mentioned. Similar tools to DataUp such as ISAtool Suite and Ontomaton are not mentioned.
We have added a paragraph to the introduction on existing work in this area.
The survey appears USA-centric. EZID is only available in the USA. DataUp currently works with DataONEShare. Can and how DataUp be adapted for use in other repositories? Can it reuse infrastructure that is not US based?
The code is openly available for anyone to use, and the CDL encourages other organizations to take the code and deploy their own instances of DataUp. The identifier provided for a dataset is via the repository; ONEShare is a US repository with connections to the EZID identifier service via the CDL. If other repositories deployed instances of DataUp, the identifier schema would be specific to their existing system.
No competing interests were disclosed. Competing Interests: