Background

F1000Research

2046-1402

F1000 Research Limited

London, UK

10.12688/f1000research.178856.1

Data Note

Articles

RxPairEvid: an auditable, machine-learning-ready dataset of drug–drug pairs with pharmacovigilance signal features and MedDRA PT-code rationales

[version 1; peer review: 1 not approved]

Hashir

Qadeer

Conceptualization Data Curation Formal Analysis Methodology Writing – Original Draft Preparation 1 Asfand-E-Yar

Muhammad

Methodology Project Administration Resources Supervision Validation Writing – Original Draft Preparation Writing – Review & Editing 1 Ali Shah

Asghar

Methodology Resources Validation Writing – Review & Editing https://orcid.org/0000-0002-0325-7579 a 2 Shoukat

Shabana

Formal Analysis Methodology Software Writing – Original Draft Preparation 3 1Center of Excellence in Artificial Intelligence (CoE-AI), Department of Computer Science, Bahria University, Islamabad, 44000, Pakistan 2Department of Computer Science, Kateb University, Kabul, 1007, Afghanistan 3Medical ICU, Holy Family Hospital, Rawalpindi Medical University, Rawalpindi, Punjab, 46000, Pakistan

a asghar.ali.shah@kateb.edu.af

No competing interests were disclosed.

4 5 2026

2026

662

14 4 2026

2026

This is an open access article distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Background

Drug–drug interactions (DDIs) remain a major source of preventable harm, yet many computational DDI resources are hard to reproduce, difficult to audit, or constrained by redistribution licenses. We created RxPairEvid-50 K to provide a small, license-clean, model-ready matrix of canonical drug pairs with conservative pharmacovigilance signal summaries and a rationale pointer that supports human interpretation.

Methods

RxPairEvid is derived from the FDA Adverse Event Reporting System (FAERS) by resolving drug mentions to a stable 14-character InChIKey stem (IK14), enumerating co-medication pairs per case, joining outcomes at the MedDRA Preferred Term (PT) code level, and computing disproportionality statistics (PRR, ROR, and the continuity-corrected lower 95% confidence bound of ROR) for each pair–PT. Signals are rolled up to per-pair features under strict count floors (a_raw≥3, pair≥10, PT ≥ 10). RxPairEvid-50 K is a deterministic stratified sample from the strict matrix.

Conclusions

RxPairEvid-50 K contains 50,000 drug–drug pairs with stable identifiers, strict-regime FAERS signal features, PT-code rationale pointers, and audit artifacts. It is intended to support benchmarking, label construction, and exploratory modeling of interaction risk with transparent, reproducible processing steps.

drug–drug interaction; pharmacovigilance; FAERS; MedDRA; disproportionality analysis; PRR; ROR; dataset

The author(s) declared that no grants were involved in supporting this work.

Introduction

Spontaneous reporting systems such as the FDA Adverse Event Reporting System (FAERS) can surface DDI-related safety signals at scale, but their reuse in machine learning is often limited by unstable drug identifiers and opaque, non-auditable label construction. ¹ RxPairEvid-50 K fills these gaps with a combination of unchanging canonical drug identifiers (IK14), conservative rollups of disproportionality, and a PT-code rationale pointer per pair that associates each pair with the individual most supportive adverse outcome to be reviewed by a human as rapidly as possible. Outcomes are represented at the Preferred Term (PT) level with the use of the MedDRA terminology. ⁴ A compact pipeline overview is shown in Figure 1, with further implementation detail in Figures 2–3.

Figure 1. Graphical overview of the RxPairEvid-50 K pipeline: FAERS (2018 onward) → ETL and IK14 harmonization → pair–PT contingency tables → continuity-corrected PRR/ROR and ROR95_LCL → strict per-pair roll-ups → stratified deterministic sampling → release bundle (CSV + schema + audits + codebook). Figure 2. End-to-end processing workflow, comprising of identifier mapping, canonical pair ordering, MedDRA PT joins, per pair - PT signal computation, loose/strict roll-ups, and export of release.

Elements shown for TWOSIDES and DrugBank are optional layers in the authors’ internal build and are not redistributed in the public RxPairEvid-50 K release.

Figure 3. Internal integration points for PostgreSQL integration view and feature layer attachment points (stg/core/features/labels/ref ).

The public release packages ddi_pairs_50k.csv with schema.sql, audits, and a codebook; the other sources in the diagram are attachment points for users who use licensed data sets from their original providers.

The public release is deliberately license-clean. It redistributes only derived FAERS signal features and does not redistribute third-party resources whose terms may restrict redistribution (e.g., DrugBank, KEGG, PDBbind). The accompanying PostgreSQL schema DDL documents attachment points for readers who obtain those resources from original providers. ^{5–
12}

Methods

RxPairEvid-50 K is a curated subset of a larger internal PostgreSQL database (ddi) that integrates multiple evidence layers for DDI modeling, including chemical structure (DrugBank, ChEMBL, PubChem), pathway and pharmacology (KEGG), targets and network biology (STRING, STITCH), protein–ligand bioactivity (BindingDB, PDB/PDBbind), transcriptomics (LINCS), curated adverse reactions (SIDER), and TWOSIDES signals, alongside FAERS pharmacovigilance. ^{5–
17} For public dissemination we release a compact, information-rich 50,000-pair matrix focused on FAERS-derived signal features and rationale pointers, together with provenance and audit artifacts.

Data sources

Redistributed source and derived outputs: •

FAERS quarterly files (DRUG/REAC/DEMO) from 2018 onward (time window specified in provenance.md). ¹

Referenced as optional attachment points (not redistributed in RxPairEvid-50 K): •

MedDRA (PT text not redistributed; PT codes only). ⁴

•

DrugBank interaction knowledge and drug properties. ⁵

•

KEGG pathway annotations. ⁶

•

PDB and PDBbind structural/binding evidence. ^{7,
8}

•

STRING and STITCH network context. ^{9,
10}

•

SIDER adverse-effect associations. ¹¹

•

LINCS L1000 transcriptomic profiles. ¹²

•

ChEMBL, PubChem, BindingDB for chemistry/bioactivity crosswalks. ^{13–
15}

•

RxNorm/ATC for normalization/stratification where available. ¹⁶

•

TWOSIDES as an external signal set for ablations in the internal build. ¹⁷

Processing environment and reproducible schema

The verified internal build ran on PostgreSQL 14.19 (Ubuntu) with a psql 18.0 client. The exported schema report records the database layout (schemas stg, core, features, labels, ml, ref ) and provides row estimates and column types for reproducibility. This build has 11,521 canonical drugs as core.drug, 1,073,256 canonical drug pairs as features.pair_features_all and 50,340 pairs satisfying strict floors (a_raw≥3, pair≥10, PT ≥ 10) as seen in strict FAERS rollup table.

Canonical identifiers

Every drug has a standardized identifier (IK14 identifier, the first 14 characters of the InChIKey) which can be used to stabilize joins across sources, and to enable decent leakage-aware evaluation splits. Pairs are represented in unordered keys (A < B) to eliminate duplication and ambiguity in ranking.

FAERS ingestion, normalization, and mapping quality control

FAERS DRUG quarterly and REAC quarterly records are broken out into normalized staging tables, with powerful management of typical encoding problems. Mentions of drugs are normalised (case folding, removal of punctuations, removal of dosage/form/route tokens) and compared to IK14 with token-based dictionaries and database indexes. In the verified build, there are 322,635 distinct normalized names in the normalized FAERS name table, and the final FAERS name-to-drug mapping table has 59,507 mapped names, which gives it an auditable interface to quality checks of the mapping (e.g., frequency-ranking inspection of unmapped names).

Disproportionality statistics and strict vs loose regimes

We construct a 2 x 2 contingency table and calculate measures of disproportionality of each (drug A, drug B, PT). PRR is widely employed in signal generation of spontaneous-report data. ² ROR is as well a popular tool and application of the lower confidence bound will assist in minimizing the false positives based on the limited number of counts. ³ We apply a Haldane–Anscombe (+0.5) continuity correction before computing log-scale standard errors: PRR = ( a / ( a + b ) ) / ( c / ( c + d ) ) ROR = ( a / b ) / ( c / d ) SE ( log ROR ) = sqrt ( 1 / a + 1 / b + 1 / c + 1 / d ) ROR 95 _ LCL = exp ( log ( ROR ) − 1.96 × SE ) .

It has two regimes of evidence, coverage-oriented loose regime, and a conservative strict regime. Floors on minimum counts (a raw>3, pair>10, PT > 10) are imposed by the strict regime to eliminate instability in small counts and enhance interpretability of the retained signals.

Per-pair rollups and rationale pointer

Pair-level features are produced by rolling up eligible PT rows and retaining maxima and coverage counts: •

n_faers_reports: co-report count for the pair.

•

faers_prr_max_strict: maximum PRR across PTs under strict floors.

•

faers_ror95_lcl_max_strict: maximum lower 95% bound of ROR across PTs under strict floors.

•

faers_pt_covered_strict: number of distinct PTs meeting strict floors.

•

faers_best_pt_code_strict: PT code corresponding to the maximum strict ROR95_LCL signal.

The PT-code pointer enables audit-friendly interpretation without redistributing MedDRA PT text.

Public subset construction (50,000 rows)

RxPairEvid-50 K is deterministic stratified sample of strict FAERS rollup matrix. When there is a coarse ATC grouping; (i) ROR95 LCL bins, (ii) PT coverage bins, and (iii) a coarse ATC grouping are used to define the strata. The pairs of the stratum are ordered deterministically on the basis of md5(pair_id) and a predefined quota is sampled to obtain 50,000 rows. This enables the selection to be reproduced in rebuilds with the same strict matrix and strata definition.

Data records

The information is stored on Mendeley Data ¹⁸ and is available in flat files. Table 1 contains the major fields in the primary CSV (ddi_pairs_50k.csv), and Table 2 contains the files that were added in the Mendeley record. Figure 1 gives an overview in high level of the RxPairEvid-50 K generation and release bundle, Figure 2 gives details of the steps of the FAERS processing and roll-up on a strict regime prior to sampling, and Figure 3 summarizes the internal database integration and optional attachment points of licensed resources.

Table 1. Main fields in ddi_pairs_50k.csv (RxPairEvid-50 K).

Field	Description
drug_a_ik14, drug_b_ik14	Canonical drug identifiers using 14-character InChIKey stems (IK14).
a_name, b_name	Preferred drug names for display.
pair_id	Stable unordered pair key = LEAST (IK14_A, IK14_B) + ‘::’ + GREATEST (IK14_A, IK14_B).
n_faers_reports	FAERS co-report count for the (A,B) pair.
faers_prr_max_strict	Maximum PRR across PTs under strict floors.
faers_ror95_lcl_max_strict	Maximum lower 95% CI bound of ROR across PTs under strict floors.
faers_pt_covered_strict	Number of distinct PT codes meeting strict floors for the pair.
faers_best_pt_code_strict	PT code corresponding to the strongest strict lower-bound signal.

Table 2. Files deposited in the RxPairEvid Mendeley Data record.

File	Purpose
ddi_pairs_50k.csv	Primary dataset: 50,000 drug–drug pairs with FAERS-derived strict rollups and PT-code rationale pointer.
schema.sql	PostgreSQL DDL (structure-only) to recreate tables and load the CSV consistently.
codebook.md	Field-level definitions and data types.
provenance.md	Pipeline notes including FAERS window (2018 onward), mapping rules, and audit guidance.
audit_subset_signal_quantiles.csv	Quantile summaries of key signal fields for validation and reporting.
audit_subset_strata_counts.csv	Counts per sampling stratum used to form the deterministic 50 K subset.
checksums.txt	SHA-256 checksums for integrity verification.

Data validation

Validation is more about auditability and integrity, and not about predictive performance. They have been released with the sampling strata counts, signal quantile summaries, and SHA-256 checksums so they can be checked with the benefit of both integrity verification and rapid sanity checks. Moreover, high floors and confidence-bound screening minimize the small denominator instability, and rationale pointers are used to justify manual inspection of high-signaling pairs.

Reuse notes and limitations

Signals derived by FAERS are associative and prone to reporting bias, confounding, and stimulated reporting and should not be used as causal data of a DDI. We suggest RxPairEvid-50 K since (i) it is a benchmarking dataset to signal-based modeling (ii) it is an evidence layer used to construct labels in multi-evidence pipelines (iii) it is a transparent baseline used in ablation analysis. To evaluate leakage-aware, split on the drug level (IK14-disjoint) and then form pair splits, and use PR-AUC as a chief measurement in class imbalance.

Ethics and consent

Not applicable. RxPairEvid is derived from publicly available, de-identified secondary data sources and does not involve direct collection of human participant data.

Data availability Underlying data

Mendeley Data: RxPairEvid doi: https://doi.org/10.17632/zrvzpfmzcz.1. ¹⁸

This project contains the following underlying data: •

audit_subset_signal_quantiles.csv

•

audit_subset_strata_counts.csv

•

checksums.txt

•

codebook.md

•

ddi_pairs_50k.csv

•

provenance.md

•

README.md

•

schema.sql

Data is available under the terms of the Creative Commons Attribution 4.0 International licence.

Acknowledgements

We thank the clinicians and domain experts who advised on label design for internal tasks and the auditability of evidence fields.

References 1

FDA: FDA Adverse Event Reporting System (FAERS).

Reference Source

Evans

SJW

Waller

Davis

: Use of proportional reporting ratios (PRRs) for signal generation from spontaneous adverse drug reaction reports. Pharmacoepidemiol Drug Saf. 2001;10(6):483–486. 10.1002/pds.677

Puijenbroek

van Bate

Leufkens

HGM

: A comparison of measures of disproportionality for signal detection in spontaneous reporting systems for adverse drug reactions. Pharmacoepidemiol Drug Saf. 2002;11(1):3–10. 11998548

10.1002/pds.668

Brown

Wood

: The medical dictionary for regulatory activities (MedDRA). Drug Saf. 1999;20(2):109–117. 10.2165/00002018-199920020-00002

Knox

Wilson

Klinger

: DrugBank 6.0: the DrugBank knowledgebase for 2024. Nucleic Acids Res. 2024;52(D1):D1265–D1275. 37953279

10.1093/nar/gkad976

PMC10767804

Kanehisa

Furumichi

Sato

: KEGG for taxonomy-based analysis of pathways and genomes. Nucleic Acids Res. 2023;51(D1):D587–D592. 36300620

10.1093/nar/gkac963

PMC9825424

wwPDB consortium: Protein Data Bank: the single global archive for 3D macromolecular structure data. Nucleic Acids Res. 2019;47(D1):D520–D528. 10.1093/nar/gky949

Wang

Fang

: The PDBbind database: methodologies and updates. J Med Chem. 2005;48(12):4111–4119. 10.1021/jm048957q

Szklarczyk

Gable

Lyon

: The STRING database in 2023: protein–protein association networks and functional enrichment analyses. Nucleic Acids Res. 2023;51(D1):D638–D646. 10.1093/nar/gkac1000

Szklarczyk

Santos

Mering

von : STITCH 5: augmenting protein–chemical interaction networks with tissue and affinity data. Nucleic Acids Res. 2016;44(D1):D380–D384. 26590256

10.1093/nar/gkv1277

PMC4702904

Kuhn

Letunic

Jensen

: The SIDER database of drugs and side effects. Nucleic Acids Res. 2016;44(D1):D1075–D1079. 26481350

10.1093/nar/gkv1075

PMC4702794

Subramanian

Narayan

Corsello

: A next generation connectivity map: L1000 platform and the first 1,000,000 profiles. Cell. 2017;171(6):1437–1452.e17. 29195078

10.1016/j.cell.2017.10.049

PMC5990023

Zdrazil

Felix

Hunter

: The ChEMBL database in 2023: a drug discovery platform spanning multiple bioactivity data types and time periods. Nucleic Acids Res. 2024;52(D1):D1180–D1192. 37933841

10.1093/nar/gkad1004

PMC10767899

Kim

Chen

Cheng

: PubChem 2023 update. Nucleic Acids Res. 2023;51(D1):D1373–D1380. 36305812

10.1093/nar/gkac956

PMC9825602

Liu

Lin

Wen

: BindingDB: a web-accessible database of experimentally determined protein–ligand binding affinities. Nucleic Acids Res. 2007;35:D198–D201. 17145705

10.1093/nar/gkl999

PMC1751547

Nelson

Zeng

Kilbourne

: Normalized names for clinical drugs: RxNorm at 6 years. J Am Med Inform Assoc. 2011;18(4):441–448. 21515544

10.1136/amiajnl-2011-000116

PMC3128404

Tatonetti

Daneshjou

: Data-driven prediction of drug effects and interactions. Sci Transl Med. 2012;4(125). 10.1126/scitranslmed.3003377

Hashir

Asfand-e-Yar

Algarni

: RxPairEvid. Mendeley Data.

10.17632/zrvzpfmzcz.1

10.5256/f1000research.197294.r482736

Reviewer response for version 1

Pirmohamed

Munir

1 Referee https://orcid.org/0000-0002-7534-7266 Bright

Matthew

1 Co-referee 1University of Liverpool, Liverpool, England, UK

Competing interests: No competing interests were disclosed.

4 6 2026

2026

This is an open access peer review report distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

recommendation

reject

The authors have produced a tool (RxPairEvid) to assess DDIs using data from FAERS. Comments are provided below.

The paper is very hard to read and so it is difficult to see what they have done beyond some drug name standardisation and fairly standard data cleansing. They mention some sort of stratification scheme but this is very hazily described.

Page 3: ‘…but their reuse in machine learning is often limited by unstable drug identifiers and opaque, non-auditable label construction.’ It’s unclear what this means, and the reference given is just to the FAERS database so it’s unlikely to support the assertion.

Page 3: ‘For public dissemination we release a compact, information-rich 50,000-pair matrix focused on FAERS derived..”: Various data sources are mentioned but it is unclear how they are used to produce a valid DDI data set. It’s what ‘signal features’ are used in FAERS signal features and rationale pointers, together with provenance and audit artifacts

Page 3: ‘FAERS DRUG quarterly and REAC quarterly records are broken out into normalized staging tables, with powerful management of typical encoding problems’: It’s not clear what this means – how were the problems ‘managed’. What is a ‘normalised staging table’?

Page 4: ‘We construct a 2 x 2 contingency table and calculate measures of disproportionality of each (drug A, drug B, PT). PRR is widely employed in signal generation of spontaneous-report data.2 ROR is as well a popular tool and application of the lower confidence bound will assist in minimizing the false positives based on the limited number of counts.’: PRR with just drug pair and PT count is inappropriate in the DDI setting – it will show a signal if one of the drugs causes the effect alone, or if both do but there is no actual interaction. It would be better to use specifically designed statistics such as the Omega statistic [Norén, G.N, et al, Statist. Med., 27: 3057-3070. (2008)] or INTSS [Almenoff, June S., et al, Pharmacoepidemiology and drug safety 12.6 (2003): 517-521]

Page 5: ‘RxPairEvid-50 K is deterministic stratified sample of strict FAERS rollup matrix.: More detail is needed on how the ‘rollups’ were done. Why is the maximum PRR/ROR figure selected from groupings?

Page 5: ‘a predefined quota is sampled to obtain 50,000 rows’: How is this sampling done?

Page 6: ‘Validation is more about auditability and integrity, and not about predictive performance.’: If the claim is that this data is a genuine database of DDIs, then this statement needs further justification – they would need to show that they function as a good set of controls.

Page 7: ‘Signals derived by FAERS are associative and prone to reporting bias, confounding, and stimulated reporting and should not be used as causal data of a DDI. We suggest RxPairEvid-50 K since…’: It is certainly the case that PRR/ROR signals would be unreliable (see above). It is hard to see how the dataset could be useful in any of the applications discussed if it is not, or at least unlikely to be, a list of actual DDIs.

Are sufficient details of methods and materials provided to allow replication by others?

Is the rationale for creating the dataset(s) clearly described?

Yes

Are the datasets clearly presented in a useable and accessible format?

Are the protocols appropriate and is the work technically sound?

Reviewer Expertise:

Pharmacovigilance, clinical pharmacology

We confirm that we have read this submission and believe that we have an appropriate level of expertise to state that we do not consider it to be of an acceptable scientific standard, for reasons outlined above.