ALL Metrics
-
Views
-
Downloads
Get PDF
Get XML
Cite
Export
Track
Software Tool Article
Revised

pubassistant.ch: consolidating publication profiles of researchers

[version 3; peer review: 3 approved]
PUBLISHED 13 Apr 2022
Author details Author details
OPEN PEER REVIEW
REVIEWER STATUS

This article is included in the Research on Research, Policy & Culture gateway.

This article is included in the RPackage gateway.

Abstract

Online accounts to keep track of scientific publications, such as Open Researcher and Contributor ID (ORCID) or Google Scholar, can be time consuming to maintain and synchronize. Furthermore, the open access status of publications is often not easily accessible, hindering potential opening of closed publications. To lessen the burden of managing personal profiles, we developed a R shiny app that allows publication lists from multiple platforms to be retrieved and consolidated, as well as interactive exploration and comparison of publication profiles. A live version can be found at pubassistant.ch.

Keywords

open access, publication profiles, R shiny

Revised Amendments from Version 2

Several points in the introduction were made more specific and the description of rationale of this work has been extended. A clearer description of the use case and the target user has been added. Furthermore, a small evaluation on the accuracy of matching publications was conducted.

See the authors' detailed response to the review by Paul Albert and Curtis L. Cole
See the authors' detailed response to the review by Griffin M Weber
See the authors' detailed response to the review by Daniel W Hook

Introduction

Given the increasing number of both researchers and publications as well as publishing modes,1,2 it becomes a challenge to identify and consolidate all publications from a single author. A few of the main issues are transliteration of names into roman alphabetic system, the non-uniqueness of names, differently written names (e.g., with or without middle initial) and changing affiliation over time. There are broadly speaking two approaches to solve this ambiguity: “unattended” and “attended”. The “unattended” approach tries to automatically resolve ambiguity using additional existing metadata. The “attended” approach relies on human intervention in the form of unique identifiers that enable robust linkage of publications to authors, assuming researchers and their collaborators use them consistently. The most important, widely used and de facto standard identifier in many fields is the Open Researcher and Contributor ID (ORCID).3 Other identifiers such as Google Scholar ID4 or ResearcherID (Publons)5 are also used, although they are not as broadly used as ORCID and persistence of identifiers is not always guaranteed. Having multiple identifiers on multiple platforms is not unusual and automatic publication detection and syncing between accounts is possible to some degree. However, automatic synchronization of accounts for different identifiers can be hindered by the fact that not all systems use the standardized DOI (Digital Object Identifier) as document identifier to match publications.

Although the two main standardized identifiers for authors (ORCID) and documents (DOI) are widely adopted, other identifiers are still used, making it often necessary to synchronize publication records on different platforms manually to obtain complete records. For instance, there is no simple one-click solution to synchronize publications between ORCID and Google Scholar. In Google Scholar, publications need to be searched and added manually (if they are not detected automatically) while in ORCID it is possible to input a citation file. A typical workflow to update ORCID based on Google Scholar would therefore be to first search (one by one) in Google Scholar all publications that are listed in ORCID and then add the missing ones. But since it is possible that publications listed in Google Scholar are not in ORCID, the reverse needs to be done to be sure the accounts are up to date. If more accounts need to be synced (e.g., Publons), the complexity and time needed increases accordingly. Although it is possible, and probably advisable, to link accounts for automatic updates (e.g., linking Publons with ORCID), this cannot be done under all circumstances and missing publications are still possible. Updating data in ORCID is possible using a variety of methods, such as through CRIS (current research information) systems, as auto updates between Crossref and ORCID, linking to Dimensions, among others.

While some (commercial) services (such as Dimensions6 or Web of Science7) provide extensive data mining to retrieve publication data, they often also rely on unique identifiers (such as ORCID in the case of Dimensions) for correct assignment. Furthermore, on many platforms that combine different sources, it is not easy to determine where the data originated (e.g., is a publication listed in ORCID or in Publons? or both?). In addition, data exploration and visualization is often restricted to citations over time (except costly commercial services). With the growing awareness, interest and mandates towards Open Science, open access (OA) status of articles can also be of interest. The same is true for preprints, which are not always taken into account despite becoming increasingly important in many research fields.8,9

Another inconvenience can be the existence of duplicated publications, which can stem either from the association of preprint and peer-reviewed publication or from revisions or different versions. In many cases, it is sensible to treat those closely linked publication as just one publication instead of multiple. If the required information to link publications is missing, automatic detection is not always possible and manual intervention is needed.

Many tools, both commercial and free, exist that combine bibliographies and bibliometrics for a wide variety of use cases such as evaluation, compliance (e.g. OA), grant writing, literature review and keeping professional web profiles updated. Furthermore, many of the available tools are not made for individual authors but rather operate on the department, institutions or even country level. Although those tools allow research institutions to curate research profiles to meet some of the aforementioned use cases, they are usually not designed for individuals to curate their own individual profiles outside an institutional context. Some of the existing commercial tools include Elements (from the company Symplectic10) and Dimensions, both of which are mainly intended for institutional use; but, especially Dimensions also offers functionalities for authors to explore their bibliographies.

Commercial as well as institutional tools provide valuable improvements to the quality of bibliographies, especially in non-STEM subjects where accurate representations of scholars is often more difficult.

Some of the free and/or open tools include VIVO11 (Institutional level, creates ontologies for representing scholarship), Profiles Research Networking12 (Institutional level, help to discover collaborators), ReCiter13 (Institutional level, find publications of authors in PubMed), ImpactStory14 (author level, impact and open access status of publications from ORCID).

Furthermore, tools mainly intended for monitoring open science include the open science monitoring of the European commission15 (country-level), the German open access monitor (institution-level) and OpenAIRE (Open Access Infrastructure for Research in Europe), which provides dashboards (country- or institution-level).

In our case, we took inspiration from the Swiss National Science Foundation’s Open Access Check,16 which allows Swiss researchers to reflect on their publishing practices and encourages various forms of OA, including green OA; importantly, such resources rely on the source databases being up to date in the first place. It is worth noting that our tool is not meant in any way for evaluation of researchers and that initiatives such as DORA17 and the Leiden Manifesto18 represent important considerations toward responsible research evaluation.

To facilitate overview and synchronization of publication records, we provide a web-based application that allows publications for an author to be retrieved from different sources, combines entries, checks for duplicates and downloads citations to easily update records across platforms. Furthermore, the open access status of each publication is provided, which can help to select publications that could be “greened” (i.e., depositing documents in institutional repositories). Taken together, this allows researchers to organize their public publication profiles and to interactively explore the accuracy of records across the various entry points. In other words, pubassistant.ch is intended for researchers who want to cleanup their publication profiles across multiple platforms and are interested in the open access status of their publications. One specific use case would be to find publications where the ORCID was not included and therefore is not listed in the online ORCID profile, but for example found and listed by Google Scholar.

Methods

The workflow is as follows: The user needs to first specify the unique identifiers of the researcher of interest for at least one of ORCID, Google Scholar and Publons. Additionally, a search query for Pubmed can be generated. Furthermore, the option to search for bibliometrics, obtained from the NIH Open Citation Collection using iCite,19 can be selected. After confirmation, publications are retrieved from the specified sources and combined into a table based on the DOI (see Figure 1) or, in case of publications from Google Scholar, based on (fuzzy) matching of titles and/or metadata retrieval from Zotero (Zotero translator, i.e., web scraping)20 or Crossref (i.e., query the available metadata to obtain a DOI).21 Since the set of considered publications stem from the same author, matching of publications is solely based on the title of the publication, by calculating the pairwise relative Levenshtein distances (Levenshtein distance divided by the maximum possible Levenshtein distance, i.e., number of characters) between titles and setting a threshold of 0.1 below which publications are assumed to be the same. The accuracy of detecting duplicates using a small test dataset (n=6929) of reasearch articles with an associated preprint on bioRxiv22 was 0.72. No other formal validation of this approach was done, but manual checking of a large number cases showed good matching in most cases. After joining the publications list, the open access status of each publication with a DOI is retrieved using Unpaywall,23 who provide a publicly accessible database containing open access information for publications. The definitions of the different open access status that Unpaywall uses is provided in Table 1. Additionally, preprints are defined as having OA status “green” in Unpaywall with the attribute “version” equal to “submittedVersion”. A database snapshot of Unpaywall can be downloaded https://unpaywall.org/products/snapshot.

856a41a3-7b87-42b7-8a62-53ad6bf1c263_figure1.gif

Figure 1. Overview of the data processing.

The identifiers given by the user are used to obtain the data from each platform independently. The data is then merged and the open access status (column OA) is obtained using the Digital Object Identifier (DOI). Furthermore duplicates are detected by comparing the titles of the publications.

Table 1. Open access (OA) definition used by Unpaywall.

OA statusOpen accessibleDescription
GoldYesPublished in open-access journal
GreenYesPublication in free repository
HybridYesOpen licence
BronzeYesNo open licence
ClosedNo

After this step, interactive exploration of the publications is possible. Various options to filter the data according to OA status, year and source (ORCID, Google Scholar, etc.) are available with the possibility to remove or show duplicates (detected using fuzzy matching of titles, similar to matching of publications). Several metrics, tables and plots are available for exploration of the data. Examples include a upset plot that shows how many publications are associated with each identifier, a histogram of the number of publications per year colored by open access status, and a table listing the individual publications. After exploration, specific subsets can be generated using the filtering options, which are then imposed on the visualizations and tables presented. In all cases, relevant snapshots of the citation information can be obtained in the form of a downloadable file.

Another possible application is the integration of local databases, such as university repositories. For example, the Zurich Open Research Archive (ZORA),24 developed and maintained by the Main Library at the University of Zurich, has been integrated in an alternative version of the app that allows local entries to be compared with public profiles, allowing synchronization of publication profiles with local repositories.

Implementation

The application is written in R (Version 4.1.0)25 and shiny (Version 1.6.0),26 see Software availability. As a back-end database, PostgreSQL is used to store a local copy of Unpaywall (and ZORA). Such a local database for Unpaywall is not strictly needed, but a large speedup of the retrieval of the open access status is achieved compared to access over the Unpaywall API. Furthermore, since only a fraction of the data from Unpaywall is used (only the DOI, the open access status and two additional columns for preprint identification) the actual table, containing open access status, is comparably small with a size of about 6 GB (compared to more than 165 GB of the complete version). Unpaywall does daily updates that can be downloaded and are used to update the local database to keep it in sync with the online version. The DOIs for publications listed in Google Scholar are obtained by either matches to publications from other sources, metadata retrieval using the Zotero translator service or a Crossref query.

Various R packages that facilitate retrieval of publications from a specific resource such as https://docs.ropensci.org/rorcid (ORCID),27 https://github.com/jkeirstead/scholar (Google Scholar)28 or https://docs.ropensci.org/rentrez (Pubmed)29 have been included.

Operation

The app is containerized using Docker (Version 19.03.13, dockerfiles and docker-compose file are provided in Software availability). Multiple, interacting containers are deployed using docker-compose, the two most important are a container running the R shiny application and another running PostgreSQL. Furthermore, the Zotero translator service is run in a separate container. As already stated, the PostgreSQL service is not strictly needed, but substantially increases retrieval speed of the OA status.

Use case

As mentioned above, a key use case is the comparison of publications listed in different public profiles and the detection of possible duplicated entries that allows the respective profiles to be updated. Figure 2 shows a use case for an author where the ORCID (0000-0002-3048-5518) and Google Scholar ID (XPfrRQEAAAAJ) were given as an input (collapsed panel in Figure 2A). Panel B provides a summary of the publication list and options to filter by dataset, by year and by OA status. Additionally, the possibility to remove duplicates or only show duplicates is available. The other panels contain visualizations including an upsetplot30 (C), a histogram (D) and a table (E). The table can be further filtered by selecting rows allowing to create specific citation lists that can be created based on the rows in the table. The contents of the table can be copied to the clipboard or downloaded in CSV format.

856a41a3-7b87-42b7-8a62-53ad6bf1c263_figure2.gif

Figure 2. pubassistant.ch panel overview.

After entering identifiers in panel A, successful retrieval and merging, panels B-E appear. Panel B is the main panel for filtering. Visualizations are in panel C (upsetplot), D (histogram) and E (table).

Discussion

Our method relies on the DOI to retrieve the OA status, which is a limitation in domains where DOIs are not used. The DOI is also used to unambiguously match publications. If no DOI is present, the titles of the publications are used for matching, which can lead to ambiguity. Even if a publication has an assigned DOI, but it is missing in the data, it becomes difficult or time-consuming to retrieve the missing information with services such as the Zotero translator or Crossref.

Because of the non commercial nature of this application, some additional limits present themselves. Most notably, our application requires freely-available APIs, or in the case of Google Scholar web-scraping (contravening the terms of use of Google Scholar), for retrieving the open publication data from their respective sources. For the two main sources considered so far (ORCID and Google Scholar), no restrictions have been noticed, while for others rate limits in the number of requests are quite restrictive (e.g., for Publons). Other APIs not currently included in our application (e.g., from Dimensions or Mendeley) could be added in the future. An useful addition could be the integration of the API from OpenAlex for citation information and publication metadata.31

Data availability

No data are associated with this article.

Software availability

Software available from: https://pubassistant.ch/

Source code available from: https://github.com/markrobinsonuzh/os_monitor

Archived source code at time of publication: https://doi.org/10.5281/zenodo.5509626

License: MIT

Comments on this article Comments (0)

Version 3
VERSION 3 PUBLISHED 30 Sep 2021
Comment
Author details Author details
Competing interests
Grant information
Copyright
Download
 
Export To
metrics
Views Downloads
F1000Research - -
PubMed Central
Data from PMC are received and updated monthly.
- -
Citations
CITE
how to cite this article
Gerber R and Robinson MD. pubassistant.ch: consolidating publication profiles of researchers [version 3; peer review: 3 approved]. F1000Research 2022, 10:989 (https://doi.org/10.12688/f1000research.73493.3)
NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.
track
receive updates on this article
Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status: ?
Key to Reviewer Statuses VIEW
ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions
Version 3
VERSION 3
PUBLISHED 13 Apr 2022
Revised
Views
10
Cite
Reviewer Report 04 May 2022
Paul Albert, Wood Library, Weill Cornell Medical College, New York, NY, USA 
Curtis L. Cole, Department of Population Health Sciences, Weill Cornell Medical College, New York, NY, USA 
Approved
VIEWS 10
The authors did address several of our concerns and clearly disagree with much of our feedback. Some of the improvements they made are not reflected in the figures (e.g. Fig 2) so with little effort they could make more improvements. ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Albert P and Cole CL. Reviewer Report For: pubassistant.ch: consolidating publication profiles of researchers [version 3; peer review: 3 approved]. F1000Research 2022, 10:989 (https://doi.org/10.5256/f1000research.126603.r134743)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
Views
9
Cite
Reviewer Report 14 Apr 2022
Daniel W Hook, Digital Science, London, UK 
Approved
VIEWS 9
I thank the authors for making further updates in response to my comments. In ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Hook DW. Reviewer Report For: pubassistant.ch: consolidating publication profiles of researchers [version 3; peer review: 3 approved]. F1000Research 2022, 10:989 (https://doi.org/10.5256/f1000research.126603.r134741)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
Version 2
VERSION 2
PUBLISHED 20 Dec 2021
Revised
Views
20
Cite
Reviewer Report 17 Feb 2022
Griffin M Weber, Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA;  Department of Medicine, Beth Israel Deaconess Medical Center, Boston, MA, USA 
Approved
VIEWS 20
pubassistant.ch is a website that helps users to manage the list of publications that they have authored. Publications can be imported from several sources, including ORCID and Google Scholar. Deduplication is handled by matching on DOI or similar titles (using ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
M Weber G. Reviewer Report For: pubassistant.ch: consolidating publication profiles of researchers [version 3; peer review: 3 approved]. F1000Research 2022, 10:989 (https://doi.org/10.5256/f1000research.80816.r120551)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
  • Author Response 14 Apr 2022
    Reto Gerber, Department of Molecular Life Sciences, University of Zurich, Zurich, 8057, Switzerland
    14 Apr 2022
    Author Response
    Unfortunately, some features of the website are slow due to limits on the APIs the software calls; and, I encountered some bugs, especially with the Citations tab either not loading ... Continue reading
COMMENTS ON THIS REPORT
  • Author Response 14 Apr 2022
    Reto Gerber, Department of Molecular Life Sciences, University of Zurich, Zurich, 8057, Switzerland
    14 Apr 2022
    Author Response
    Unfortunately, some features of the website are slow due to limits on the APIs the software calls; and, I encountered some bugs, especially with the Citations tab either not loading ... Continue reading
Views
29
Cite
Reviewer Report 31 Jan 2022
Paul Albert, Wood Library, Weill Cornell Medical College, New York, NY, USA 
Curtis L. Cole, Department of Population Health Sciences, Weill Cornell Medical College, New York, NY, USA 
Not Approved
VIEWS 29
This paper describes pubassistant.ch, a freely available website using open-source code. Pubassistant.ch is designed to perform the following functions:
  • Allows users to manually import data from the following services: Google Scholar, Publons, PubMed, and ORCID.
... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Albert P and Cole CL. Reviewer Report For: pubassistant.ch: consolidating publication profiles of researchers [version 3; peer review: 3 approved]. F1000Research 2022, 10:989 (https://doi.org/10.5256/f1000research.80816.r120552)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
  • Author Response 14 Apr 2022
    Reto Gerber, Department of Molecular Life Sciences, University of Zurich, Zurich, 8057, Switzerland
    14 Apr 2022
    Author Response
    We agree with the first reviewer's careful comments and appreciate that many of the concerns were addressed in this revision. However, one key problem remains, the unclear perspective and motivation ... Continue reading
COMMENTS ON THIS REPORT
  • Author Response 14 Apr 2022
    Reto Gerber, Department of Molecular Life Sciences, University of Zurich, Zurich, 8057, Switzerland
    14 Apr 2022
    Author Response
    We agree with the first reviewer's careful comments and appreciate that many of the concerns were addressed in this revision. However, one key problem remains, the unclear perspective and motivation ... Continue reading
Views
22
Cite
Reviewer Report 12 Jan 2022
Daniel W Hook, Digital Science, London, UK 
Approved with Reservations
VIEWS 22
We thank the authors for taking the time to improve their manuscript.

We make a number of optional suggestions for changes below that we believe would further improve and strengthen the paper:
  1. The
... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Hook DW. Reviewer Report For: pubassistant.ch: consolidating publication profiles of researchers [version 3; peer review: 3 approved]. F1000Research 2022, 10:989 (https://doi.org/10.5256/f1000research.80816.r116162)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
  • Author Response 14 Apr 2022
    Reto Gerber, Department of Molecular Life Sciences, University of Zurich, Zurich, 8057, Switzerland
    14 Apr 2022
    Author Response
    The authors note that “Many tools, both commercial and free, exist to explore certain aspects of bibliographies”.  While this is true it does not, perhaps, get at the core of ... Continue reading
COMMENTS ON THIS REPORT
  • Author Response 14 Apr 2022
    Reto Gerber, Department of Molecular Life Sciences, University of Zurich, Zurich, 8057, Switzerland
    14 Apr 2022
    Author Response
    The authors note that “Many tools, both commercial and free, exist to explore certain aspects of bibliographies”.  While this is true it does not, perhaps, get at the core of ... Continue reading
Version 1
VERSION 1
PUBLISHED 30 Sep 2021
Views
58
Cite
Reviewer Report 26 Oct 2021
Daniel W Hook, Digital Science, London, UK 
Not Approved
VIEWS 58
The authors have created a free, open source piece of software to bring researcher and publication records together from different data sources. They detail their motivations and methodology in this paper.
  • The authors begin their article
... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Hook DW. Reviewer Report For: pubassistant.ch: consolidating publication profiles of researchers [version 3; peer review: 3 approved]. F1000Research 2022, 10:989 (https://doi.org/10.5256/f1000research.77148.r96152)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
  • Author Response 20 Dec 2021
    Reto Gerber, Department of Molecular Life Sciences, University of Zurich, Zurich, 8057, Switzerland
    20 Dec 2021
    Author Response
    • However, the authors choose to take a peculiarly western-centric view of this issue. The greatest challenges of name disambiguation are typically found when authors who might not natively
    ... Continue reading
COMMENTS ON THIS REPORT
  • Author Response 20 Dec 2021
    Reto Gerber, Department of Molecular Life Sciences, University of Zurich, Zurich, 8057, Switzerland
    20 Dec 2021
    Author Response
    • However, the authors choose to take a peculiarly western-centric view of this issue. The greatest challenges of name disambiguation are typically found when authors who might not natively
    ... Continue reading

Comments on this article Comments (0)

Version 3
VERSION 3 PUBLISHED 30 Sep 2021
Comment
Alongside their report, reviewers assign a status to the article:
Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions
Sign In
If you've forgotten your password, please enter your email address below and we'll send you instructions on how to reset your password.

The email address should be the one you originally registered with F1000.

Email address not valid, please try again

You registered with F1000 via Google, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Google account password, please click here.

You registered with F1000 via Facebook, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Facebook account password, please click here.

Code not correct, please try again
Email us for further assistance.
Server error, please try again.