BiocPkgTools: Toolkit for mining the Bioconductor package ecosystem

Motivation: The Bioconductor project, a large collection of open source software for the comprehension of large-scale biological data, continues to grow with new packages added each week, motivating the development of software tools focused on exposing package metadata to developers and users. The resulting BiocPkgTools package facilitates access to extensive metadata in computable form covering the Bioconductor package ecosystem, facilitating downstream applications such as custom reporting, data and text mining of Bioconductor package text descriptions, graph analytics over package dependencies, and custom search approaches. Results: The BiocPkgTools package has been incorporated into the Bioconductor project, installs using standard procedures, and runs on any system supporting R. It provides functions to load detailed package metadata, longitudinal package download statistics, package dependencies, and Bioconductor build reports, all in "tidy data" form. BiocPkgTools can convert from tidy data structures to graph structures, enabling graph-based analytics and visualization. An end-user-friendly graphical package explorer aids in task-centric package discovery. Full documentation and example use cases are included. Availability: The BiocPkgTools software and complete documentation are available from Bioconductor ( https://bioconductor.org/packages/BiocPkgTools).


Introduction
Bioconductor is a open source software project (comprising 1741 individual analysis packages) and community for the analysis and comprehension of large-scale biological data. Newly submitted software packages undergo a technical review to ensure that best practices and Bioconductor coding conventions are followed. The project maintains an automated build system that ensures that packages in the Bioconductor project are compiled and built successfully and pass basic checks. Package downloads are tracked and aggregated by package and month, longitudinally. Finally, package details such as title, description, version, author, and dependencies on other R packages are compiled based on package metadata.
The current size and growth of the Bioconductor project suggests that there is merit in exposing computable forms of the metadata describing the Bioconductor package ecosystem. To that end, we developed a small suite of tools, BiocPkgTools, to provide easy access to project details such as download statistics, bulk package metadata, and package build status. Developers, project leaders, open source software researchers, and Bioconductor end users can build on the availability of these data for applications such as custom reporting, dependency graph analytics, package filtering, and text mining.

Features and usage
The core functionality of BiocPkgTools is to expose Bioconductor project and package metadata as tidy data 1 objects ( Figure 1). The data presented by the package are accessed directly from online resources available from Bioconductor. As such, the package relies on web connectivity and collects the most recent data. Installation instructions are detailed on the package website.
Package functionality can be roughly divided into data access, data presentation, and graph/network functionality. See Table 1 for an overview.
After installing BiocPkgTools, the biocDownloadStats function can generate a tidy data structure summarizing monthly download statistics (both total and unique IP addresses) for all Bioconductor packages.  BiocPkgTools can access and transform web-accessible resources including package metadata, download statistics, dependencies between packages, and updated Bioconductor build report status to "tidy data" reports that can be manipulated using standard R tools. Interactive package exploration is also available.
The biocBuildReport function gathers information from the Bioconductor build report site and can be used, for example, to summarize the "build status" for all Bioconductor pacakages.
buildrep = biocBuildReport(version = "3.9") These data are useful to developers to track the health of their software either programmatically or via a searchable, sortable table from the problemPage function.
As an alternative to basic web browser search and the Bioconductor online software list, the biocExplore function offers interactive and graphical approach to package browsing (see Figure 2). The biocExplore widget allows browsing packages under different biocViews, Bioconductor's software catergory tags. This interactively visualises the relative number of downloads each package has under different biocViews, allowing users to quickly determine which packages are most commonly used for different analysis tasks.
The Bioconductor package ecosystem is, by design, highly interconnected via package dependencies. Several functions in the BiocPkgTools package provide examples of package dependency graph creation and visualization. Figure 3 displays packages within one degree of dependency relationship of the GEOquery package.

Implementation
BiocPkgTools is implemented as a standard R package and hosted in the Bioconductor repository. All functions are documented and include examples. An included tutorial (vignette) demonstrates features and capabilities.

Discussion
The BiocPkgTools package comprises a set of functions for accessing software metadata from the growing collection of Bioconductor packages. For software developers, this metadata can be useful for tracking package build status and the health of package dependencies. Easy access to descriptive package metadata for all Bioconductor software resources can empower researchers or users interested in text mining, custom package search, or analysis of the existing software ecosystem. BiocPkgTools can provide easy access to metrics of Bioconductor sofware usage that are increasingly being incorporated into funding and promotion decisions.

Data availability
All data accessed and used by the BiocPkgTools package are publicly available and are updated regularly at the Bioconductor project. This article presents the BiocPkgTools package, which provides an R API to the various package metadata that is available mostly in human-readable formats on the website. By https://bioconductor.org/ providing an R API for accessing package metadata, the authors argue that "data mining and value-added functionality such as package searching, text mining, and analytics on packages" may follow.
I believe that this package will add value to the Bioconductor community and beyond. It is likely that the package will lower the threshold for doing "data mining" on Bioconductor packages. The reports that are based on these metadata and produced by this package are likely to inspire others to produce other types of reports and interactive tools.
Having said, I do think there is room for some immediate improvements to the article, which might also spill over to the package itself.

Major:
The role of the package: It is not clear whether this package is to be considered a Bioconductor core package or a user-contributed package. That can only be inferred/guessed from authors list and possible from the package name.
If it is an official core Bioconductor package (similar to BiocManager) supported and maintained by the Bioconductor Team, then I think it would be of value to make that explicit. This would change the expectation on the package and its long-term support, e.g. will it break or not when the Bioconductor website changes, what should be documented as part of the package and what should be documented on the Bioconductor website., etc. The questions in the following section illustrate this.

Description on the metadata:
There is no explicit reference to the source data, e.g. is this package using a public or a private There is no explicit reference to the source data, e.g. is this package using a public or a private Bioconductor Online API to gather the data. Is this API, or URLs official and stable, or is this package meant to play that role?
Is there another reference from where one can learn more about how the data is collected? Are data from other Bioconductor mirrors included in the download stats?
The download stats do not contain information on the package version, which is mentioned in the package vignette. However, it's not clear whether it is only downloads from the current release branch that are counted or not. For instance, if I download a package for a legacy Bioconductor release version or the current developer version, will that add to the data? Minor/trivial:

API:
Given the package title 'Collection of simple tools for learning about Bioc Packages' and description, the `biocBuildEmail()` function seems to be an odd-one-out.  Table 1: biocbuildReport -> biocBuildReport. Grammar in Section 'Introduction': Should "suggest" be used instead of "suggests" in "The current size and growth of the Bioconductor project suggests"?
In addition, `spelling::spell_check_package()` on the package itself reveals several mistakes.
Nomenclature: Bioconductor is sometimes referred to as 'bioconductor' (lower case) or just 'bioc' and 'Bioc' (the latter is even used in the package title). , which permits unrestricted use, distribution, and reproduction in any medium, provided the original Attribution Licence work is properly cited.

Mike Smith
Genome Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany In this paper Su et al present BiocPkgTools, an R package that provides programmatic access to metadata about software in the Bioconductor project. The package is available from Bioconductor and the source code can be easily viewed on Github. The metadata it can access includes download numbers, package 'health' status, topic categories, and software dependencies.
Since the software forms at least as large a part of this publication as the article, I have tried to review both.

Paper:
The paper is succinct and to the point. In some instances there is an implied understanding of R package development and the Bioconductor build system. For example, in the code block presented to demonstrate the biocBuildReport function, the row names e.g. buildbin, checksrc etc are probably fairly meaningless to someone unfamiliar with Bioconductor's continuous integration platform. It's fair to say most interested readers will already be Bioconductor package developers, but a caption linking this back to the build system discussion in the introduction would add clarity.
Similarly the biocViews concept is not full described, but forms a key part of one the packages main functions: biocExplore(). Expanding a little on how these terms are assigned to packages (i.e. mostly but not always by the package authors) might make it clearer to users why some values return unexpected results e.g. 'Bioinformatics' only shows me 15 packages presumably because most package authors see this as redundant, although 'Software' feels similarly obvious yet returns a huge number of packages as this is assigned by Bioconductor itself.
The motivation behind providing programmatic access to build reports and download statistics is presented in the text. However, it would be nice for the authors to expand upon what they feel the use cases for the package dependency graphs may be. They look cool, but based on the paper content it's not immediately obvious to me where they might be used. The discussion highlights the fact that download metrics etc are gaining traction as a measure factored into funding decisions, and one thing that springs to mind is that a similar point can be made visually for software infrastructure that has many downstream dependents. My efforts to create a `subgraphByDegree()` with degree greater than one (to view a larger software stack) creates a graph to large to render, so an example of how to visualise this (i.e. show all downstream dependsOnMe packages in a tree) would be a great addition to either the paper or the package vignette.

Minor points:
I recommend providing a link to the BiocPkgTools landing page after the sentence "Installation instructions are detailed on the package website." Indicate that the code to produce Figure 3 can be found in the package vignette (or include it as supplementary material here). Since it is quite a bit more than a one-liner to produce Figure 3 I think it would be beneficial to point readers to the code.