Keywords
open access, publication profiles, R shiny
This article is included in the RPackage gateway.
This article is included in the Research on Research, Policy & Culture gateway.
open access, publication profiles, R shiny
Given the increasing number of both researchers and publications as well as publishing modes,1,2 it becomes a challenge to identify and consolidate all publications from a single author. A few of the main issues are the non-uniqueness of names, differently written names (e.g. with or without middle initial) and changing affiliation over time. As a solution to this problem, unique identifiers were created that enable robust linkage of publications to authors, assuming researchers and their collaborators use them consistently. The de facto standard identifier in many fields is the Open Researcher and Contributor ID (ORCID),3 although other identifiers such as Google Scholar ID4 or ResearcherID (Publons)5 are also widely used. Having multiple identifiers on multiple platforms is not unusual and automatic publication detection and syncing between accounts is possible to some degree. However, automatic synchronization of accounts for different identifiers can be hindered by the fact that different document identifiers are used, such as DOI (Digital Object Identifier) or the independent identifier used by Google Scholar.
Because of this lack of standardized identifiers for both authors and documents, it is often necessary to synchronize publication records on different platforms manually to obtain complete records. For instance, there is no simple one-click solution to synchronize publications between ORCID and Google Scholar. In Google Scholar, publications need to be searched and added manually (if they are not detected automatically) while in ORCID it is possible to input a citation file. A typical workflow to update ORCID based on Google Scholar would therefore be to first search (one by one) in Google Scholar all publications that are listed in ORCID and then add the missing ones. But since it is possible that publications listed in Google Scholar are not in ORCID, the reverse needs to be done to be sure the accounts are up to date. If more accounts need to be synced (e.g. Publons), the complexity and time needed increases accordingly. Although it is possible, and probably advisable, to link accounts for automatic updates (e.g. linking Publons with ORCID), this cannot be done under all circumstances and missing publications are still possible.
While some (commercial) services (such as Dimensions6 or Web of Science7) provide extensive data mining to retrieve publication data, they often also rely on unique identifiers (such as ORCID in the case of Dimensions) for correct assignment. Furthermore, on many platforms that combine different sources (e.g. Dimensions), it is not easy to determine where the data originated (e.g. is a publication listed in ORCID or in Publons? or both?), meaning no information about the “completeness” of those sources is given. In addition, data exploration and visualization is often restricted to citations over time (except costly commercial services, such as Dimensions). With the growing awareness, interest and mandates towards Open Science, open access (OA) status of articles can also be of interest. The same is true for preprints, which are often not taken into account despite becoming increasingly important in many research fields.8,9
Another inconvenience can be the existence of duplicated publications, which can stem either from the association of preprint and peer-reviewed publication or from revisions or different versions. In many cases, it is sensible to treat those closely linked publication as just one publication instead of multiple. Often it is not possible to detect duplicated publications automatically and manual intervention is needed.
To our knowledge, there does not exist a free tool that allows researchers to interactively explore their publication metadata across multiple platforms, together with the open access status of each publication. Commercial tools exist, such as Elements (from the company Symplectic10) or Dimensions, but they are intended for institutional use. In our case, we took inspiration from the Swiss National Science Foundation’s Open Access Check,11 which allows Swiss researchers to reflect on their publishing practices and encourages various forms of OA, including green OA; importantly, such resources rely on the source databases being up to date in the first place.
Furthermore many of the available tools are not made for individual authors but rather operate on the department, institutions or even country level. A few important tools are: the open science monitoring of the European commission12 (country-level), the German open access monitor (institution-level) and OpenAIRE (Open Access Infrastructure for Research in Europe) provides dashboards (country- or institution-level).
To facilitate overview and synchronization of publication records, we provide a web-based application that allows publications for an author to be retrieved from different sources, combines entries, checks for duplicates and downloads citations to easily update records across platforms. Furthermore, the open access status of each publication is provided, which can help to select publications that could be “greened” (i.e., depositing documents in institutional repositories). Taken together, this allows researchers to organize their public publication profiles and to interactively explore the accuracy of records across the various entry points.
The workflow is as follows: The user needs to first specify the unique identifiers of the researcher of interest for at least one of ORCID, Google Scholar and Publons. Additionally, a search query for Pubmed can be generated. Furthermore, the option to search for bibliometrics, obtained from the NIH Open Citation Collection using iCite,13 can be selected. After confirmation, publications are retrieved from the specified sources and combined into a table based on the DOI (see Figure 1) or, in case of publications from Google Scholar, based on (fuzzy) matching of titles and/or metadata retrieval from Zotero (Zotero translator, i.e. web scraping)14 or Crossref (i.e. query the available metadata to obtain a DOI).15 After joining the publications list, the open access status of each publication with a DOI is retrieved using Unpaywall,16 who provide a publicly accessible database containing open access information for publications. The definitions of the different open access status that Unpaywall uses is provided in Table 1. Additionally, preprints are defined as having OA status “green” in Unpaywall with the attribute “version” equal to “submittedVersion”. A database snapshot of Unpaywall can be downloaded https://unpaywall.org/products/snapshot.
The identifiers given by the user are used to obtain the data from each platform independently. The data is then merged and the open access status (column OA) is obtained using the Digital Object Identifier (DOI). Furthermore duplicates are detected by comparing the titles of the publications.
OA status | Open accessible | Description |
---|---|---|
Gold | Yes | Published in open-access journal |
Green | Yes | Publication in free repository |
Hybrid | Yes | Open licence |
Bronze | Yes | No open licence |
Closed | No |
After this step, interactive exploration of the publications is possible. Various options to filter the data according to OA status, year and source (ORCID, Google Scholar, etc.) are available with the possibility to remove or show duplicates (detected using fuzzy matching of titles). Several metrics, tables and plots are available for exploration of the data. Examples include a upset plot that shows how many publications are associated with each identifier, a histogram of the number of publications per year colored by open access status, and a table listing the individual publications. After exploration, specific subsets can be generated using the filtering options, which are then imposed on the visualizations and tables presented. In all cases, relevant snapshots of the citation information can be obtained in the form of a downloadable file.
Another possible application is the integration of local databases, such as university repositories. For example, the Zurich Open Research Archive (ZORA),17 developed and maintained by the Main Library at the University of Zurich, has been integrated in an alternative version of the app that allows local entries to be compared with public profiles, allowing synchronization of publication profiles with local repositories.
The application is written in R (Version 4.1.0)18 and shiny (Version 1.6.0),19 see Software availability. As a back-end database, PostgreSQL is used to store a local copy of Unpaywall (and ZORA). Such a local database for Unpaywall is not strictly needed, but a large speedup of the retrieval of the open access status is achieved compared to access over the Unpaywall API. Furthermore, since only a fraction of the data from Unpaywall is used (only the DOI, the open access status and two additional columns for preprint identification) the actual table, containing open access status, is comparably small with a size of about 6 GB (compared to more than 165 GB of the complete version). Unpaywall does daily updates that can be downloaded and are used to update the local database to keep it in sync with the online version. The DOIs for publications listed in Google Scholar are obtained by either matches to publications from other sources, metadata retrieval using the Zotero translator service or a Crossref query.
Various R packages that facilitate retrieval of publications from a specific resource such as https://docs.ropensci.org/rorcid (ORCID),20 https://github.com/jkeirstead/scholar (Google Scholar)21 or https://docs.ropensci.org/rentrez (Pubmed)22 have been included.
The app is containerized using Docker (Version 19.03.13, dockerfiles and docker-compose file are provided in Software availability). Multiple, interacting containers are deployed using docker-compose, the two most important are a container running the R shiny application and another running PostgreSQL. Furthermore, the Zotero translator service is run in a separate container. As already stated, the PostgreSQL service is not strictly needed, but substantially increases retrieval speed of the OA status.
Figure 2 shows a use case for an author where the ORCID (0000-0002-3048-5518) and Google Scholar ID (XPfrRQEAAAAJ) were given as an input (collapsed panel in Figure 2A). Panel B provides a summary of the publication list and options to filter by dataset, by year and by OA status. Additionally, the possibility to remove duplicates or only show duplicates is available. The other panels contain visualizations including an upsetplot23 (C), a histogram (D) and a table (E). The table can be further filtered by selecting rows allowing to create specific citation lists that can be created based on the rows in the table. The contents of the table can be copied to the clipboard or downloaded in CSV format.
Our method relies on the DOI to retrieve the OA status, which is a limitation in domains where DOIs are not used. The DOI is also used to unambiguously match publications. If no DOI is present, the titles of the publications are used for matching, which can lead to ambiguity. Even if a publication has an assigned DOI, but it is missing in the data, it becomes difficult or time-consuming to retrieve the missing information with services such as the Zotero translator or Crossref.
Because of the non commercial nature of this application, some additional limits present themselves. Most notably, our application requires freely-available APIs for retrieving the open publication data from their respective sources. While for the two main sources considered (ORCID and Google Scholar) so far no restrictions have been noticed, the APIs of Dimensions or Mendeley are closed and for others, rate limits in the number of requests are quite restrictive (e.g. for Publons).
Software available from: https://pubassistant.ch/
Source code available from: https://github.com/markrobinsonuzh/os_monitor
Archived source code at time of publication: https://doi.org/10.5281/zenodo.5509626
License: MIT
We thank Izaskun Mallona for help with hosting the application and various helpful suggestions. We thank various members of the Statistical Bioinformatics Group at University of Zurich for feedback.
Views | Downloads | |
---|---|---|
F1000Research | - | - |
PubMed Central
Data from PMC are received and updated monthly.
|
- | - |
Is the rationale for developing the new software tool clearly explained?
No
Is the description of the software tool technically sound?
Partly
Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?
Yes
Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?
No
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?
No
References
1. Hook D, Porter S, Herzog C: Dimensions: Building Context for Search and Evaluation. Frontiers in Research Metrics and Analytics. 2018; 3. Publisher Full TextCompeting Interests: Daniel Hook is the CEO of Digital Science, the owner of Altmetric, Dimensions, Figshare, IFI Claims, ReadCube and Symplectic. He is also a co-founder of Symplectic and a Board Member (and Treasurer) of ORCID.
Reviewer Expertise: Open Research, Bibliometrics, Sociology of Research, Theoretical Physics (Quantum Statistical Mechanics, PT-Symmetric Quantum Mechanics).
Alongside their report, reviewers assign a status to the article:
Invited Reviewers | |||
---|---|---|---|
1 | 2 | 3 | |
Version 3 (revision) 13 Apr 22 |
read | read | |
Version 2 (revision) 20 Dec 21 |
read | read | read |
Version 1 30 Sep 21 |
read |
Provide sufficient details of any financial or non-financial competing interests to enable users to assess whether your comments might lead a reasonable person to question your impartiality. Consider the following examples, but note that this is not an exhaustive list:
Sign up for content alerts and receive a weekly or monthly email with all newly published articles
Already registered? Sign in
The email address should be the one you originally registered with F1000.
You registered with F1000 via Google, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Google account password, please click here.
You registered with F1000 via Facebook, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Facebook account password, please click here.
If your email address is registered with us, we will email you instructions to reset your password.
If you think you should have received this email but it has not arrived, please check your spam filters and/or contact for further assistance.
Comments on this article Comments (0)