ALL Metrics
-
Views
-
Downloads
Get PDF
Get XML
Cite
Export
Track
Software Tool Article

pubassistant.ch: consolidating publication profiles of researchers

[version 1; peer review: 1 not approved]
PUBLISHED 30 Sep 2021
Author details Author details
OPEN PEER REVIEW
REVIEWER STATUS

This article is included in the RPackage gateway.

This article is included in the Research on Research, Policy & Culture gateway.

Abstract

Online accounts to keep track of scientific publications, such as Open Researcher and Contributor ID (ORCID) or Google Scholar, can be time consuming to maintain and synchronize. Furthermore, the open access status of publications is often not easily accessible, hindering potential opening of closed publications. To lessen the burden of managing personal profiles, we developed a R shiny app that allows publication lists from multiple platforms to be retrieved and consolidated, as well as interactive exploration and comparison of publication profiles. A live version can be found at pubassistant.ch.

Keywords

open access, publication profiles, R shiny

Introduction

Given the increasing number of both researchers and publications as well as publishing modes,1,2 it becomes a challenge to identify and consolidate all publications from a single author. A few of the main issues are the non-uniqueness of names, differently written names (e.g. with or without middle initial) and changing affiliation over time. As a solution to this problem, unique identifiers were created that enable robust linkage of publications to authors, assuming researchers and their collaborators use them consistently. The de facto standard identifier in many fields is the Open Researcher and Contributor ID (ORCID),3 although other identifiers such as Google Scholar ID4 or ResearcherID (Publons)5 are also widely used. Having multiple identifiers on multiple platforms is not unusual and automatic publication detection and syncing between accounts is possible to some degree. However, automatic synchronization of accounts for different identifiers can be hindered by the fact that different document identifiers are used, such as DOI (Digital Object Identifier) or the independent identifier used by Google Scholar.

Because of this lack of standardized identifiers for both authors and documents, it is often necessary to synchronize publication records on different platforms manually to obtain complete records. For instance, there is no simple one-click solution to synchronize publications between ORCID and Google Scholar. In Google Scholar, publications need to be searched and added manually (if they are not detected automatically) while in ORCID it is possible to input a citation file. A typical workflow to update ORCID based on Google Scholar would therefore be to first search (one by one) in Google Scholar all publications that are listed in ORCID and then add the missing ones. But since it is possible that publications listed in Google Scholar are not in ORCID, the reverse needs to be done to be sure the accounts are up to date. If more accounts need to be synced (e.g. Publons), the complexity and time needed increases accordingly. Although it is possible, and probably advisable, to link accounts for automatic updates (e.g. linking Publons with ORCID), this cannot be done under all circumstances and missing publications are still possible.

While some (commercial) services (such as Dimensions6 or Web of Science7) provide extensive data mining to retrieve publication data, they often also rely on unique identifiers (such as ORCID in the case of Dimensions) for correct assignment. Furthermore, on many platforms that combine different sources (e.g. Dimensions), it is not easy to determine where the data originated (e.g. is a publication listed in ORCID or in Publons? or both?), meaning no information about the “completeness” of those sources is given. In addition, data exploration and visualization is often restricted to citations over time (except costly commercial services, such as Dimensions). With the growing awareness, interest and mandates towards Open Science, open access (OA) status of articles can also be of interest. The same is true for preprints, which are often not taken into account despite becoming increasingly important in many research fields.8,9

Another inconvenience can be the existence of duplicated publications, which can stem either from the association of preprint and peer-reviewed publication or from revisions or different versions. In many cases, it is sensible to treat those closely linked publication as just one publication instead of multiple. Often it is not possible to detect duplicated publications automatically and manual intervention is needed.

To our knowledge, there does not exist a free tool that allows researchers to interactively explore their publication metadata across multiple platforms, together with the open access status of each publication. Commercial tools exist, such as Elements (from the company Symplectic10) or Dimensions, but they are intended for institutional use. In our case, we took inspiration from the Swiss National Science Foundation’s Open Access Check,11 which allows Swiss researchers to reflect on their publishing practices and encourages various forms of OA, including green OA; importantly, such resources rely on the source databases being up to date in the first place.

Furthermore many of the available tools are not made for individual authors but rather operate on the department, institutions or even country level. A few important tools are: the open science monitoring of the European commission12 (country-level), the German open access monitor (institution-level) and OpenAIRE (Open Access Infrastructure for Research in Europe) provides dashboards (country- or institution-level).

To facilitate overview and synchronization of publication records, we provide a web-based application that allows publications for an author to be retrieved from different sources, combines entries, checks for duplicates and downloads citations to easily update records across platforms. Furthermore, the open access status of each publication is provided, which can help to select publications that could be “greened” (i.e., depositing documents in institutional repositories). Taken together, this allows researchers to organize their public publication profiles and to interactively explore the accuracy of records across the various entry points.

Methods

The workflow is as follows: The user needs to first specify the unique identifiers of the researcher of interest for at least one of ORCID, Google Scholar and Publons. Additionally, a search query for Pubmed can be generated. Furthermore, the option to search for bibliometrics, obtained from the NIH Open Citation Collection using iCite,13 can be selected. After confirmation, publications are retrieved from the specified sources and combined into a table based on the DOI (see Figure 1) or, in case of publications from Google Scholar, based on (fuzzy) matching of titles and/or metadata retrieval from Zotero (Zotero translator, i.e. web scraping)14 or Crossref (i.e. query the available metadata to obtain a DOI).15 After joining the publications list, the open access status of each publication with a DOI is retrieved using Unpaywall,16 who provide a publicly accessible database containing open access information for publications. The definitions of the different open access status that Unpaywall uses is provided in Table 1. Additionally, preprints are defined as having OA status “green” in Unpaywall with the attribute “version” equal to “submittedVersion”. A database snapshot of Unpaywall can be downloaded https://unpaywall.org/products/snapshot.

ed81d4f4-81f2-4526-9493-7065435d2344_figure1.gif

Figure 1. Overview of the data processing.

The identifiers given by the user are used to obtain the data from each platform independently. The data is then merged and the open access status (column OA) is obtained using the Digital Object Identifier (DOI). Furthermore duplicates are detected by comparing the titles of the publications.

Table 1. Open access (OA) definition used by Unpaywall.

OA statusOpen accessibleDescription
GoldYesPublished in open-access journal
GreenYesPublication in free repository
HybridYesOpen licence
BronzeYesNo open licence
ClosedNo

After this step, interactive exploration of the publications is possible. Various options to filter the data according to OA status, year and source (ORCID, Google Scholar, etc.) are available with the possibility to remove or show duplicates (detected using fuzzy matching of titles). Several metrics, tables and plots are available for exploration of the data. Examples include a upset plot that shows how many publications are associated with each identifier, a histogram of the number of publications per year colored by open access status, and a table listing the individual publications. After exploration, specific subsets can be generated using the filtering options, which are then imposed on the visualizations and tables presented. In all cases, relevant snapshots of the citation information can be obtained in the form of a downloadable file.

Another possible application is the integration of local databases, such as university repositories. For example, the Zurich Open Research Archive (ZORA),17 developed and maintained by the Main Library at the University of Zurich, has been integrated in an alternative version of the app that allows local entries to be compared with public profiles, allowing synchronization of publication profiles with local repositories.

Implementation

The application is written in R (Version 4.1.0)18 and shiny (Version 1.6.0),19 see Software availability. As a back-end database, PostgreSQL is used to store a local copy of Unpaywall (and ZORA). Such a local database for Unpaywall is not strictly needed, but a large speedup of the retrieval of the open access status is achieved compared to access over the Unpaywall API. Furthermore, since only a fraction of the data from Unpaywall is used (only the DOI, the open access status and two additional columns for preprint identification) the actual table, containing open access status, is comparably small with a size of about 6 GB (compared to more than 165 GB of the complete version). Unpaywall does daily updates that can be downloaded and are used to update the local database to keep it in sync with the online version. The DOIs for publications listed in Google Scholar are obtained by either matches to publications from other sources, metadata retrieval using the Zotero translator service or a Crossref query.

Various R packages that facilitate retrieval of publications from a specific resource such as https://docs.ropensci.org/rorcid (ORCID),20 https://github.com/jkeirstead/scholar (Google Scholar)21 or https://docs.ropensci.org/rentrez (Pubmed)22 have been included.

Operation

The app is containerized using Docker (Version 19.03.13, dockerfiles and docker-compose file are provided in Software availability). Multiple, interacting containers are deployed using docker-compose, the two most important are a container running the R shiny application and another running PostgreSQL. Furthermore, the Zotero translator service is run in a separate container. As already stated, the PostgreSQL service is not strictly needed, but substantially increases retrieval speed of the OA status.

Use case

Figure 2 shows a use case for an author where the ORCID (0000-0002-3048-5518) and Google Scholar ID (XPfrRQEAAAAJ) were given as an input (collapsed panel in Figure 2A). Panel B provides a summary of the publication list and options to filter by dataset, by year and by OA status. Additionally, the possibility to remove duplicates or only show duplicates is available. The other panels contain visualizations including an upsetplot23 (C), a histogram (D) and a table (E). The table can be further filtered by selecting rows allowing to create specific citation lists that can be created based on the rows in the table. The contents of the table can be copied to the clipboard or downloaded in CSV format.

ed81d4f4-81f2-4526-9493-7065435d2344_figure2.gif

Figure 2. pubassistant.ch panel overview.

After entering identifiers in panel A, successful retrieval and merging, panels B-E appear. Panel B is the main panel for filtering. Visualizations are in panel C (upsetplot), D (histogram) and E (table).

Discussion

Our method relies on the DOI to retrieve the OA status, which is a limitation in domains where DOIs are not used. The DOI is also used to unambiguously match publications. If no DOI is present, the titles of the publications are used for matching, which can lead to ambiguity. Even if a publication has an assigned DOI, but it is missing in the data, it becomes difficult or time-consuming to retrieve the missing information with services such as the Zotero translator or Crossref.

Because of the non commercial nature of this application, some additional limits present themselves. Most notably, our application requires freely-available APIs for retrieving the open publication data from their respective sources. While for the two main sources considered (ORCID and Google Scholar) so far no restrictions have been noticed, the APIs of Dimensions or Mendeley are closed and for others, rate limits in the number of requests are quite restrictive (e.g. for Publons).

Data availability

No data are associated with this article.

Software availability

Software available from: https://pubassistant.ch/

Source code available from: https://github.com/markrobinsonuzh/os_monitor

Archived source code at time of publication: https://doi.org/10.5281/zenodo.5509626

License: MIT

Comments on this article Comments (0)

Version 3
VERSION 3 PUBLISHED 30 Sep 2021
Comment
Author details Author details
Competing interests
Grant information
Copyright
Download
 
Export To
metrics
Views Downloads
F1000Research - -
PubMed Central
Data from PMC are received and updated monthly.
- -
Citations
CITE
how to cite this article
Gerber R and Robinson MD. pubassistant.ch: consolidating publication profiles of researchers [version 1; peer review: 1 not approved]. F1000Research 2021, 10:989 (https://doi.org/10.12688/f1000research.73493.1)
NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.
track
receive updates on this article
Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status: ?
Key to Reviewer Statuses VIEW
ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions
Version 1
VERSION 1
PUBLISHED 30 Sep 2021
Views
58
Cite
Reviewer Report 26 Oct 2021
Daniel W Hook, Digital Science, London, UK 
Not Approved
VIEWS 58
The authors have created a free, open source piece of software to bring researcher and publication records together from different data sources. They detail their motivations and methodology in this paper.
  • The authors begin their article
... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Hook DW. Reviewer Report For: pubassistant.ch: consolidating publication profiles of researchers [version 1; peer review: 1 not approved]. F1000Research 2021, 10:989 (https://doi.org/10.5256/f1000research.77148.r96152)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
  • Author Response 20 Dec 2021
    Reto Gerber, Department of Molecular Life Sciences, University of Zurich, Zurich, 8057, Switzerland
    20 Dec 2021
    Author Response
    • However, the authors choose to take a peculiarly western-centric view of this issue. The greatest challenges of name disambiguation are typically found when authors who might not natively
    ... Continue reading
COMMENTS ON THIS REPORT
  • Author Response 20 Dec 2021
    Reto Gerber, Department of Molecular Life Sciences, University of Zurich, Zurich, 8057, Switzerland
    20 Dec 2021
    Author Response
    • However, the authors choose to take a peculiarly western-centric view of this issue. The greatest challenges of name disambiguation are typically found when authors who might not natively
    ... Continue reading

Comments on this article Comments (0)

Version 3
VERSION 3 PUBLISHED 30 Sep 2021
Comment
Alongside their report, reviewers assign a status to the article:
Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions
Sign In
If you've forgotten your password, please enter your email address below and we'll send you instructions on how to reset your password.

The email address should be the one you originally registered with F1000.

Email address not valid, please try again

You registered with F1000 via Google, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Google account password, please click here.

You registered with F1000 via Facebook, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Facebook account password, please click here.

Code not correct, please try again
Email us for further assistance.
Server error, please try again.