ALL Metrics
-
Views
-
Downloads
Get PDF
Get XML
Cite
Export
Track
Data Note

RxPairEvid: an auditable, machine-learning-ready dataset of drug–drug pairs with pharmacovigilance signal features and MedDRA PT-code rationales

[version 1; peer review: 1 not approved]
PUBLISHED 04 May 2026
Author details Author details
OPEN PEER REVIEW
REVIEWER STATUS

Abstract

Background

Drug–drug interactions (DDIs) remain a major source of preventable harm, yet many computational DDI resources are hard to reproduce, difficult to audit, or constrained by redistribution licenses. We created RxPairEvid-50 K to provide a small, license-clean, model-ready matrix of canonical drug pairs with conservative pharmacovigilance signal summaries and a rationale pointer that supports human interpretation.

Methods

RxPairEvid is derived from the FDA Adverse Event Reporting System (FAERS) by resolving drug mentions to a stable 14-character InChIKey stem (IK14), enumerating co-medication pairs per case, joining outcomes at the MedDRA Preferred Term (PT) code level, and computing disproportionality statistics (PRR, ROR, and the continuity-corrected lower 95% confidence bound of ROR) for each pair–PT. Signals are rolled up to per-pair features under strict count floors (a_raw≥3, pair≥10, PT ≥ 10). RxPairEvid-50 K is a deterministic stratified sample from the strict matrix.

Conclusions

RxPairEvid-50 K contains 50,000 drug–drug pairs with stable identifiers, strict-regime FAERS signal features, PT-code rationale pointers, and audit artifacts. It is intended to support benchmarking, label construction, and exploratory modeling of interaction risk with transparent, reproducible processing steps.

Keywords

drug–drug interaction; pharmacovigilance; FAERS; MedDRA; disproportionality analysis; PRR; ROR; dataset

Introduction

Spontaneous reporting systems such as the FDA Adverse Event Reporting System (FAERS) can surface DDI-related safety signals at scale, but their reuse in machine learning is often limited by unstable drug identifiers and opaque, non-auditable label construction.1 RxPairEvid-50 K fills these gaps with a combination of unchanging canonical drug identifiers (IK14), conservative rollups of disproportionality, and a PT-code rationale pointer per pair that associates each pair with the individual most supportive adverse outcome to be reviewed by a human as rapidly as possible. Outcomes are represented at the Preferred Term (PT) level with the use of the MedDRA terminology.4 A compact pipeline overview is shown in Figure 1, with further implementation detail in Figures 2–3.

a6b81d51-bfd8-40fc-b474-ef114ff3fd9a_figure1.gif

Figure 1. Graphical overview of the RxPairEvid-50 K pipeline: FAERS (2018 onward) → ETL and IK14 harmonization → pair–PT contingency tables → continuity-corrected PRR/ROR and ROR95_LCL → strict per-pair roll-ups → stratified deterministic sampling → release bundle (CSV + schema + audits + codebook).

a6b81d51-bfd8-40fc-b474-ef114ff3fd9a_figure2.gif

Figure 2. End-to-end processing workflow, comprising of identifier mapping, canonical pair ordering, MedDRA PT joins, per pair - PT signal computation, loose/strict roll-ups, and export of release.

Elements shown for TWOSIDES and DrugBank are optional layers in the authors’ internal build and are not redistributed in the public RxPairEvid-50 K release.

a6b81d51-bfd8-40fc-b474-ef114ff3fd9a_figure3.gif

Figure 3. Internal integration points for PostgreSQL integration view and feature layer attachment points (stg/core/features/labels/ref ).

The public release packages ddi_pairs_50k.csv with schema.sql, audits, and a codebook; the other sources in the diagram are attachment points for users who use licensed data sets from their original providers.

The public release is deliberately license-clean. It redistributes only derived FAERS signal features and does not redistribute third-party resources whose terms may restrict redistribution (e.g., DrugBank, KEGG, PDBbind). The accompanying PostgreSQL schema DDL documents attachment points for readers who obtain those resources from original providers.512

Methods

RxPairEvid-50 K is a curated subset of a larger internal PostgreSQL database (ddi) that integrates multiple evidence layers for DDI modeling, including chemical structure (DrugBank, ChEMBL, PubChem), pathway and pharmacology (KEGG), targets and network biology (STRING, STITCH), protein–ligand bioactivity (BindingDB, PDB/PDBbind), transcriptomics (LINCS), curated adverse reactions (SIDER), and TWOSIDES signals, alongside FAERS pharmacovigilance.517 For public dissemination we release a compact, information-rich 50,000-pair matrix focused on FAERS-derived signal features and rationale pointers, together with provenance and audit artifacts.

Data sources

Redistributed source and derived outputs:

  • FAERS quarterly files (DRUG/REAC/DEMO) from 2018 onward (time window specified in provenance.md).1

Referenced as optional attachment points (not redistributed in RxPairEvid-50 K):

  • MedDRA (PT text not redistributed; PT codes only).4

  • DrugBank interaction knowledge and drug properties.5

  • KEGG pathway annotations.6

  • PDB and PDBbind structural/binding evidence.7,8

  • STRING and STITCH network context.9,10

  • SIDER adverse-effect associations.11

  • LINCS L1000 transcriptomic profiles.12

  • ChEMBL, PubChem, BindingDB for chemistry/bioactivity crosswalks.1315

  • RxNorm/ATC for normalization/stratification where available.16

  • TWOSIDES as an external signal set for ablations in the internal build.17

Processing environment and reproducible schema

The verified internal build ran on PostgreSQL 14.19 (Ubuntu) with a psql 18.0 client. The exported schema report records the database layout (schemas stg, core, features, labels, ml, ref ) and provides row estimates and column types for reproducibility. This build has 11,521 canonical drugs as core.drug, 1,073,256 canonical drug pairs as features.pair_features_all and 50,340 pairs satisfying strict floors (a_raw≥3, pair≥10, PT ≥ 10) as seen in strict FAERS rollup table.

Canonical identifiers

Every drug has a standardized identifier (IK14 identifier, the first 14 characters of the InChIKey) which can be used to stabilize joins across sources, and to enable decent leakage-aware evaluation splits. Pairs are represented in unordered keys (A < B) to eliminate duplication and ambiguity in ranking.

FAERS ingestion, normalization, and mapping quality control

FAERS DRUG quarterly and REAC quarterly records are broken out into normalized staging tables, with powerful management of typical encoding problems. Mentions of drugs are normalised (case folding, removal of punctuations, removal of dosage/form/route tokens) and compared to IK14 with token-based dictionaries and database indexes. In the verified build, there are 322,635 distinct normalized names in the normalized FAERS name table, and the final FAERS name-to-drug mapping table has 59,507 mapped names, which gives it an auditable interface to quality checks of the mapping (e.g., frequency-ranking inspection of unmapped names).

Disproportionality statistics and strict vs loose regimes

We construct a 2 x 2 contingency table and calculate measures of disproportionality of each (drug A, drug B, PT). PRR is widely employed in signal generation of spontaneous-report data.2 ROR is as well a popular tool and application of the lower confidence bound will assist in minimizing the false positives based on the limited number of counts.3 We apply a Haldane–Anscombe (+0.5) continuity correction before computing log-scale standard errors:

PRR=(a/(a+b))/(c/(c+d))
ROR=(a/b)/(c/d)
SE(logROR)=sqrt(1/a+1/b+1/c+1/d)
ROR95_LCL=exp(log(ROR)1.96×SE).

It has two regimes of evidence, coverage-oriented loose regime, and a conservative strict regime. Floors on minimum counts (a raw>3, pair>10, PT > 10) are imposed by the strict regime to eliminate instability in small counts and enhance interpretability of the retained signals.

Per-pair rollups and rationale pointer

Pair-level features are produced by rolling up eligible PT rows and retaining maxima and coverage counts:

  • n_faers_reports: co-report count for the pair.

  • faers_prr_max_strict: maximum PRR across PTs under strict floors.

  • faers_ror95_lcl_max_strict: maximum lower 95% bound of ROR across PTs under strict floors.

  • faers_pt_covered_strict: number of distinct PTs meeting strict floors.

  • faers_best_pt_code_strict: PT code corresponding to the maximum strict ROR95_LCL signal.

The PT-code pointer enables audit-friendly interpretation without redistributing MedDRA PT text.

Public subset construction (50,000 rows)

RxPairEvid-50 K is deterministic stratified sample of strict FAERS rollup matrix. When there is a coarse ATC grouping; (i) ROR95 LCL bins, (ii) PT coverage bins, and (iii) a coarse ATC grouping are used to define the strata. The pairs of the stratum are ordered deterministically on the basis of md5(pair_id) and a predefined quota is sampled to obtain 50,000 rows. This enables the selection to be reproduced in rebuilds with the same strict matrix and strata definition.

Data records

The information is stored on Mendeley Data18 and is available in flat files. Table 1 contains the major fields in the primary CSV (ddi_pairs_50k.csv), and Table 2 contains the files that were added in the Mendeley record. Figure 1 gives an overview in high level of the RxPairEvid-50 K generation and release bundle, Figure 2 gives details of the steps of the FAERS processing and roll-up on a strict regime prior to sampling, and Figure 3 summarizes the internal database integration and optional attachment points of licensed resources.

Table 1. Main fields in ddi_pairs_50k.csv (RxPairEvid-50 K).

FieldDescription
drug_a_ik14, drug_b_ik14 Canonical drug identifiers using 14-character InChIKey stems (IK14).
a_name, b_namePreferred drug names for display.
pair_idStable unordered pair key = LEAST (IK14_A, IK14_B) + ‘::’ + GREATEST (IK14_A, IK14_B).
n_faers_reports FAERS co-report count for the (A,B) pair.
faers_prr_max_strict Maximum PRR across PTs under strict floors.
faers_ror95_lcl_max_strict Maximum lower 95% CI bound of ROR across PTs under strict floors.
faers_pt_covered_strict Number of distinct PT codes meeting strict floors for the pair.
faers_best_pt_code_strict PT code corresponding to the strongest strict lower-bound signal.

Table 2. Files deposited in the RxPairEvid Mendeley Data record.

FilePurpose
ddi_pairs_50k.csvPrimary dataset: 50,000 drug–drug pairs with FAERS-derived strict rollups and PT-code rationale pointer.
schema.sqlPostgreSQL DDL (structure-only) to recreate tables and load the CSV consistently.
codebook.mdField-level definitions and data types.
provenance.mdPipeline notes including FAERS window (2018 onward), mapping rules, and audit guidance.
audit_subset_signal_quantiles.csvQuantile summaries of key signal fields for validation and reporting.
audit_subset_strata_counts.csvCounts per sampling stratum used to form the deterministic 50 K subset.
checksums.txtSHA-256 checksums for integrity verification.

Data validation

Validation is more about auditability and integrity, and not about predictive performance. They have been released with the sampling strata counts, signal quantile summaries, and SHA-256 checksums so they can be checked with the benefit of both integrity verification and rapid sanity checks. Moreover, high floors and confidence-bound screening minimize the small denominator instability, and rationale pointers are used to justify manual inspection of high-signaling pairs.

Reuse notes and limitations

Signals derived by FAERS are associative and prone to reporting bias, confounding, and stimulated reporting and should not be used as causal data of a DDI. We suggest RxPairEvid-50 K since (i) it is a benchmarking dataset to signal-based modeling (ii) it is an evidence layer used to construct labels in multi-evidence pipelines (iii) it is a transparent baseline used in ablation analysis. To evaluate leakage-aware, split on the drug level (IK14-disjoint) and then form pair splits, and use PR-AUC as a chief measurement in class imbalance.

Ethics and consent

Not applicable. RxPairEvid is derived from publicly available, de-identified secondary data sources and does not involve direct collection of human participant data.

Comments on this article Comments (0)

Version 1
VERSION 1 PUBLISHED 04 May 2026
Comment
Author details Author details
Competing interests
Grant information
Copyright
Download
 
Export To
metrics
Views Downloads
F1000Research - -
PubMed Central
Data from PMC are received and updated monthly.
- -
Citations
CITE
how to cite this article
Hashir Q, Asfand-E-Yar M, Ali Shah A and Shoukat S. RxPairEvid: an auditable, machine-learning-ready dataset of drug–drug pairs with pharmacovigilance signal features and MedDRA PT-code rationales [version 1; peer review: 1 not approved]. F1000Research 2026, 15:662 (https://doi.org/10.12688/f1000research.178856.1)
NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.
track
receive updates on this article
Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status: ?
Key to Reviewer Statuses VIEW
ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions
Version 1
VERSION 1
PUBLISHED 04 May 2026
Views
4
Cite
Reviewer Report 04 Jun 2026
Munir Pirmohamed, University of Liverpool, Liverpool, England, UK 
Matthew Bright, University of Liverpool Faculty of Science and Engineering, Liverpool, England, UK 
Not Approved
VIEWS 4
The authors have produced a tool (RxPairEvid) to assess DDIs using data from FAERS.  Comments are provided below. 
The paper is very hard to read and so it is difficult to see what they have done beyond some drug ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Pirmohamed M and Bright M. Reviewer Report For: RxPairEvid: an auditable, machine-learning-ready dataset of drug–drug pairs with pharmacovigilance signal features and MedDRA PT-code rationales [version 1; peer review: 1 not approved]. F1000Research 2026, 15:662 (https://doi.org/10.5256/f1000research.197294.r482736)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.

Comments on this article Comments (0)

Version 1
VERSION 1 PUBLISHED 04 May 2026
Comment
Alongside their report, reviewers assign a status to the article:
Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions
Sign In
If you've forgotten your password, please enter your email address below and we'll send you instructions on how to reset your password.

The email address should be the one you originally registered with F1000.

Email address not valid, please try again

You registered with F1000 via Google, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Google account password, please click here.

You registered with F1000 via Facebook, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Facebook account password, please click here.

Code not correct, please try again
Email us for further assistance.
Server error, please try again.