RxPairEvid: an auditable, machine-learning-ready dataset of drug–drug pairs with pharmacovigilance signal features and MedDRA PT-code rationales

Qadeer Hashir; Muhammad Asfand-E-Yar; Asghar Ali Shah; Shabana Shoukat

doi:10.12688/f1000research.178856.1

Home Browse RxPairEvid: an auditable, machine-learning-ready dataset of drug–drug...

ALL Metrics

-

Views

-

Downloads

Get PDF

Get XML

Export

▬

✚

Data Note

RxPairEvid: an auditable, machine-learning-ready dataset of drug–drug pairs with pharmacovigilance signal features and MedDRA PT-code rationales

[version 1; peer review: 1 not approved]

Qadeer Hashir¹, Muhammad Asfand-E-Yar¹, Asghar Ali Shah ², Shabana Shoukat³

PUBLISHED 04 May 2026

Author details Author details

¹ Center of Excellence in Artificial Intelligence (CoE-AI), Department of Computer Science, Bahria University, Islamabad, 44000, Pakistan
² Department of Computer Science, Kateb University, Kabul, 1007, Afghanistan
³ Medical ICU, Holy Family Hospital, Rawalpindi Medical University, Rawalpindi, Punjab, 46000, Pakistan

Qadeer Hashir
Roles: Conceptualization, Data Curation, Formal Analysis, Methodology, Writing – Original Draft Preparation

Muhammad Asfand-E-Yar
Roles: Methodology, Project Administration, Resources, Supervision, Validation, Writing – Original Draft Preparation, Writing – Review & Editing

Asghar Ali Shah
Roles: Methodology, Resources, Validation, Writing – Review & Editing

Shabana Shoukat
Roles: Formal Analysis, Methodology, Software, Writing – Original Draft Preparation

OPEN PEER REVIEW

REVIEWER STATUS

Abstract

Background

Drug–drug interactions (DDIs) remain a major source of preventable harm, yet many computational DDI resources are hard to reproduce, difficult to audit, or constrained by redistribution licenses. We created RxPairEvid-50 K to provide a small, license-clean, model-ready matrix of canonical drug pairs with conservative pharmacovigilance signal summaries and a rationale pointer that supports human interpretation.

Methods

RxPairEvid is derived from the FDA Adverse Event Reporting System (FAERS) by resolving drug mentions to a stable 14-character InChIKey stem (IK14), enumerating co-medication pairs per case, joining outcomes at the MedDRA Preferred Term (PT) code level, and computing disproportionality statistics (PRR, ROR, and the continuity-corrected lower 95% confidence bound of ROR) for each pair–PT. Signals are rolled up to per-pair features under strict count floors (a_raw≥3, pair≥10, PT ≥ 10). RxPairEvid-50 K is a deterministic stratified sample from the strict matrix.

Conclusions

RxPairEvid-50 K contains 50,000 drug–drug pairs with stable identifiers, strict-regime FAERS signal features, PT-code rationale pointers, and audit artifacts. It is intended to support benchmarking, label construction, and exploratory modeling of interaction risk with transparent, reproducible processing steps.

Keywords

drug–drug interaction; pharmacovigilance; FAERS; MedDRA; disproportionality analysis; PRR; ROR; dataset

Corresponding author: Asghar Ali Shah

Competing interests: No competing interests were disclosed.

Grant information: The author(s) declared that no grants were involved in supporting this work.

Copyright: © 2026 Hashir Q et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

How to cite: Hashir Q, Asfand-E-Yar M, Ali Shah A and Shoukat S. RxPairEvid: an auditable, machine-learning-ready dataset of drug–drug pairs with pharmacovigilance signal features and MedDRA PT-code rationales [version 1; peer review: 1 not approved]. F1000Research 2026, 15:662 (https://doi.org/10.12688/f1000research.178856.1) First published: 04 May 2026, 15:662 (https://doi.org/10.12688/f1000research.178856.1) Latest published: 04 May 2026, 15:662 (https://doi.org/10.12688/f1000research.178856.1)

Introduction

Spontaneous reporting systems such as the FDA Adverse Event Reporting System (FAERS) can surface DDI-related safety signals at scale, but their reuse in machine learning is often limited by unstable drug identifiers and opaque, non-auditable label construction.¹ RxPairEvid-50 K fills these gaps with a combination of unchanging canonical drug identifiers (IK14), conservative rollups of disproportionality, and a PT-code rationale pointer per pair that associates each pair with the individual most supportive adverse outcome to be reviewed by a human as rapidly as possible. Outcomes are represented at the Preferred Term (PT) level with the use of the MedDRA terminology.⁴ A compact pipeline overview is shown in Figure 1, with further implementation detail in Figures 2–3.

Figure 1. Graphical overview of the RxPairEvid-50 K pipeline: FAERS (2018 onward) → ETL and IK14 harmonization → pair–PT contingency tables → continuity-corrected PRR/ROR and ROR95_LCL → strict per-pair roll-ups → stratified deterministic sampling → release bundle (CSV + schema + audits + codebook).

Figure 2. End-to-end processing workflow, comprising of identifier mapping, canonical pair ordering, MedDRA PT joins, per pair - PT signal computation, loose/strict roll-ups, and export of release.

Elements shown for TWOSIDES and DrugBank are optional layers in the authors’ internal build and are not redistributed in the public RxPairEvid-50 K release.

Figure 3. Internal integration points for PostgreSQL integration view and feature layer attachment points (stg/core/features/labels/ref ).

The public release packages ddi_pairs_50k.csv with schema.sql, audits, and a codebook; the other sources in the diagram are attachment points for users who use licensed data sets from their original providers.

The public release is deliberately license-clean. It redistributes only derived FAERS signal features and does not redistribute third-party resources whose terms may restrict redistribution (e.g., DrugBank, KEGG, PDBbind). The accompanying PostgreSQL schema DDL documents attachment points for readers who obtain those resources from original providers.^5–12

Methods

RxPairEvid-50 K is a curated subset of a larger internal PostgreSQL database (ddi) that integrates multiple evidence layers for DDI modeling, including chemical structure (DrugBank, ChEMBL, PubChem), pathway and pharmacology (KEGG), targets and network biology (STRING, STITCH), protein–ligand bioactivity (BindingDB, PDB/PDBbind), transcriptomics (LINCS), curated adverse reactions (SIDER), and TWOSIDES signals, alongside FAERS pharmacovigilance.^5–17 For public dissemination we release a compact, information-rich 50,000-pair matrix focused on FAERS-derived signal features and rationale pointers, together with provenance and audit artifacts.

Data sources

Redistributed source and derived outputs:

• FAERS quarterly files (DRUG/REAC/DEMO) from 2018 onward (time window specified in provenance.md).¹

Referenced as optional attachment points (not redistributed in RxPairEvid-50 K):

• MedDRA (PT text not redistributed; PT codes only).⁴
• DrugBank interaction knowledge and drug properties.⁵
• KEGG pathway annotations.⁶
• PDB and PDBbind structural/binding evidence.^7,8
• STRING and STITCH network context.^9,10
• SIDER adverse-effect associations.¹¹
• LINCS L1000 transcriptomic profiles.¹²
• ChEMBL, PubChem, BindingDB for chemistry/bioactivity crosswalks.^13–15
• RxNorm/ATC for normalization/stratification where available.¹⁶
• TWOSIDES as an external signal set for ablations in the internal build.¹⁷

Processing environment and reproducible schema

The verified internal build ran on PostgreSQL 14.19 (Ubuntu) with a psql 18.0 client. The exported schema report records the database layout (schemas stg, core, features, labels, ml, ref ) and provides row estimates and column types for reproducibility. This build has 11,521 canonical drugs as core.drug, 1,073,256 canonical drug pairs as features.pair_features_all and 50,340 pairs satisfying strict floors (a_raw≥3, pair≥10, PT ≥ 10) as seen in strict FAERS rollup table.

Canonical identifiers

Every drug has a standardized identifier (IK14 identifier, the first 14 characters of the InChIKey) which can be used to stabilize joins across sources, and to enable decent leakage-aware evaluation splits. Pairs are represented in unordered keys (A < B) to eliminate duplication and ambiguity in ranking.

FAERS ingestion, normalization, and mapping quality control

FAERS DRUG quarterly and REAC quarterly records are broken out into normalized staging tables, with powerful management of typical encoding problems. Mentions of drugs are normalised (case folding, removal of punctuations, removal of dosage/form/route tokens) and compared to IK14 with token-based dictionaries and database indexes. In the verified build, there are 322,635 distinct normalized names in the normalized FAERS name table, and the final FAERS name-to-drug mapping table has 59,507 mapped names, which gives it an auditable interface to quality checks of the mapping (e.g., frequency-ranking inspection of unmapped names).

Disproportionality statistics and strict vs loose regimes

We construct a 2 x 2 contingency table and calculate measures of disproportionality of each (drug A, drug B, PT). PRR is widely employed in signal generation of spontaneous-report data.² ROR is as well a popular tool and application of the lower confidence bound will assist in minimizing the false positives based on the limited number of counts.³ We apply a Haldane–Anscombe (+0.5) continuity correction before computing log-scale standard errors:

PRR = (a / (a + b)) / (c / (c + d))

ROR = (a / b) / (c / d)

SE (log ROR) = sqrt (1 / a + 1 / b + 1 / c + 1 / d)

ROR 95_LCL = exp (log (ROR) - 1.96 \times SE) .

It has two regimes of evidence, coverage-oriented loose regime, and a conservative strict regime. Floors on minimum counts (a raw>3, pair>10, PT > 10) are imposed by the strict regime to eliminate instability in small counts and enhance interpretability of the retained signals.

Per-pair rollups and rationale pointer

Pair-level features are produced by rolling up eligible PT rows and retaining maxima and coverage counts:

• n_faers_reports: co-report count for the pair.
• faers_prr_max_strict: maximum PRR across PTs under strict floors.
• faers_ror95_lcl_max_strict: maximum lower 95% bound of ROR across PTs under strict floors.
• faers_pt_covered_strict: number of distinct PTs meeting strict floors.
• faers_best_pt_code_strict: PT code corresponding to the maximum strict ROR95_LCL signal.

The PT-code pointer enables audit-friendly interpretation without redistributing MedDRA PT text.

Public subset construction (50,000 rows)

RxPairEvid-50 K is deterministic stratified sample of strict FAERS rollup matrix. When there is a coarse ATC grouping; (i) ROR95 LCL bins, (ii) PT coverage bins, and (iii) a coarse ATC grouping are used to define the strata. The pairs of the stratum are ordered deterministically on the basis of md5(pair_id) and a predefined quota is sampled to obtain 50,000 rows. This enables the selection to be reproduced in rebuilds with the same strict matrix and strata definition.

Data records

The information is stored on Mendeley Data¹⁸ and is available in flat files. Table 1 contains the major fields in the primary CSV (ddi_pairs_50k.csv), and Table 2 contains the files that were added in the Mendeley record. Figure 1 gives an overview in high level of the RxPairEvid-50 K generation and release bundle, Figure 2 gives details of the steps of the FAERS processing and roll-up on a strict regime prior to sampling, and Figure 3 summarizes the internal database integration and optional attachment points of licensed resources.

Table 1. Main fields in ddi_pairs_50k.csv (RxPairEvid-50 K).

Field	Description
drug_a_ik14, drug_b_ik14	Canonical drug identifiers using 14-character InChIKey stems (IK14).
a_name, b_name	Preferred drug names for display.
pair_id	Stable unordered pair key = LEAST (IK14_A, IK14_B) + ‘::’ + GREATEST (IK14_A, IK14_B).
n_faers_reports	FAERS co-report count for the (A,B) pair.
faers_prr_max_strict	Maximum PRR across PTs under strict floors.
faers_ror95_lcl_max_strict	Maximum lower 95% CI bound of ROR across PTs under strict floors.
faers_pt_covered_strict	Number of distinct PT codes meeting strict floors for the pair.
faers_best_pt_code_strict	PT code corresponding to the strongest strict lower-bound signal.

Table 2. Files deposited in the RxPairEvid Mendeley Data record.

File	Purpose
ddi_pairs_50k.csv	Primary dataset: 50,000 drug–drug pairs with FAERS-derived strict rollups and PT-code rationale pointer.
schema.sql	PostgreSQL DDL (structure-only) to recreate tables and load the CSV consistently.
codebook.md	Field-level definitions and data types.
provenance.md	Pipeline notes including FAERS window (2018 onward), mapping rules, and audit guidance.
audit_subset_signal_quantiles.csv	Quantile summaries of key signal fields for validation and reporting.
audit_subset_strata_counts.csv	Counts per sampling stratum used to form the deterministic 50 K subset.
checksums.txt	SHA-256 checksums for integrity verification.

Data validation

Validation is more about auditability and integrity, and not about predictive performance. They have been released with the sampling strata counts, signal quantile summaries, and SHA-256 checksums so they can be checked with the benefit of both integrity verification and rapid sanity checks. Moreover, high floors and confidence-bound screening minimize the small denominator instability, and rationale pointers are used to justify manual inspection of high-signaling pairs.

Reuse notes and limitations

Signals derived by FAERS are associative and prone to reporting bias, confounding, and stimulated reporting and should not be used as causal data of a DDI. We suggest RxPairEvid-50 K since (i) it is a benchmarking dataset to signal-based modeling (ii) it is an evidence layer used to construct labels in multi-evidence pipelines (iii) it is a transparent baseline used in ablation analysis. To evaluate leakage-aware, split on the drug level (IK14-disjoint) and then form pair splits, and use PR-AUC as a chief measurement in class imbalance.

Ethics and consent

Not applicable. RxPairEvid is derived from publicly available, de-identified secondary data sources and does not involve direct collection of human participant data.

Data availability

Underlying data

Mendeley Data: RxPairEvid doi:https://doi.org/10.17632/zrvzpfmzcz.1.¹⁸

This project contains the following underlying data:

• audit_subset_signal_quantiles.csv
• audit_subset_strata_counts.csv
• checksums.txt
• codebook.md
• ddi_pairs_50k.csv
• provenance.md
• README.md
• schema.sql

Data is available under the terms of the Creative Commons Attribution 4.0 International licence.

Acknowledgements

We thank the clinicians and domain experts who advised on label design for internal tasks and the auditability of evidence fields.

References

1. FDA: FDA Adverse Event Reporting System (FAERS).Reference Source
2. Evans SJW, Waller PC, Davis S: Use of proportional reporting ratios (PRRs) for signal generation from spontaneous adverse drug reaction reports. Pharmacoepidemiol Drug Saf. 2001; 10(6): 483–486. Publisher Full Text
3. van Puijenbroek EP , Bate A, Leufkens HGM, et al.: A comparison of measures of disproportionality for signal detection in spontaneous reporting systems for adverse drug reactions. Pharmacoepidemiol Drug Saf. 2002; 11(1): 3–10. PubMed Abstract | Publisher Full Text
4. Brown EG, Wood L, Wood S: The medical dictionary for regulatory activities (MedDRA). Drug Saf. 1999; 20(2): 109–117. Publisher Full Text
5. Knox C, Wilson M, Klinger CM, et al.: DrugBank 6.0: the DrugBank knowledgebase for 2024. Nucleic Acids Res. 2024; 52(D1): D1265–D1275. PubMed Abstract | Publisher Full Text | Free Full Text
6. Kanehisa M, Furumichi M, Sato Y, et al.: KEGG for taxonomy-based analysis of pathways and genomes. Nucleic Acids Res. 2023; 51(D1): D587–D592. PubMed Abstract | Publisher Full Text | Free Full Text
7. wwPDB consortium: Protein Data Bank: the single global archive for 3D macromolecular structure data. Nucleic Acids Res. 2019; 47(D1): D520–D528. Publisher Full Text
8. Wang R, Fang X, Lu Y, et al.: The PDBbind database: methodologies and updates. J Med Chem. 2005; 48(12): 4111–4119. Publisher Full Text
9. Szklarczyk D, Gable AL, Lyon D, et al.: The STRING database in 2023: protein–protein association networks and functional enrichment analyses. Nucleic Acids Res. 2023; 51(D1): D638–D646. Publisher Full Text
10. Szklarczyk D, Santos A, von Mering C , et al.: STITCH 5: augmenting protein–chemical interaction networks with tissue and affinity data. Nucleic Acids Res. 2016; 44(D1): D380–D384. PubMed Abstract | Publisher Full Text | Free Full Text
11. Kuhn M, Letunic I, Jensen LJ, et al.: The SIDER database of drugs and side effects. Nucleic Acids Res. 2016; 44(D1): D1075–D1079. PubMed Abstract | Publisher Full Text | Free Full Text
12. Subramanian A, Narayan R, Corsello SM, et al.: A next generation connectivity map: L1000 platform and the first 1,000,000 profiles. Cell. 2017; 171(6): 1437–1452.e17. PubMed Abstract | Publisher Full Text | Free Full Text
13. Zdrazil B, Felix E, Hunter F, et al.: The ChEMBL database in 2023: a drug discovery platform spanning multiple bioactivity data types and time periods. Nucleic Acids Res. 2024; 52(D1): D1180–D1192. PubMed Abstract | Publisher Full Text | Free Full Text
14. Kim S, Chen J, Cheng T, et al.: PubChem 2023 update. Nucleic Acids Res. 2023; 51(D1): D1373–D1380. PubMed Abstract | Publisher Full Text | Free Full Text
15. Liu T, Lin Y, Wen X, et al.: BindingDB: a web-accessible database of experimentally determined protein–ligand binding affinities. Nucleic Acids Res. 2007; 35: D198–D201. PubMed Abstract | Publisher Full Text | Free Full Text
16. Nelson SJ, Zeng K, Kilbourne J, et al.: Normalized names for clinical drugs: RxNorm at 6 years. J Am Med Inform Assoc. 2011; 18(4): 441–448. PubMed Abstract | Publisher Full Text | Free Full Text
17. Tatonetti NP, Ye PP, Daneshjou R, et al.: Data-driven prediction of drug effects and interactions. Sci Transl Med. 2012; 4(125). Publisher Full Text
18. Hashir Q, Asfand-e-Yar M, Algarni F, et al.: RxPairEvid. Mendeley Data.Publisher Full Text

Comments on this article Comments (0)

Version 1

VERSION 1 PUBLISHED 04 May 2026

Author details Author details

¹ Center of Excellence in Artificial Intelligence (CoE-AI), Department of Computer Science, Bahria University, Islamabad, 44000, Pakistan
² Department of Computer Science, Kateb University, Kabul, 1007, Afghanistan
³ Medical ICU, Holy Family Hospital, Rawalpindi Medical University, Rawalpindi, Punjab, 46000, Pakistan

Qadeer Hashir
Roles: Conceptualization, Data Curation, Formal Analysis, Methodology, Writing – Original Draft Preparation

Muhammad Asfand-E-Yar
Roles: Methodology, Project Administration, Resources, Supervision, Validation, Writing – Original Draft Preparation, Writing – Review & Editing

Asghar Ali Shah
Roles: Methodology, Resources, Validation, Writing – Review & Editing

Shabana Shoukat
Roles: Formal Analysis, Methodology, Software, Writing – Original Draft Preparation

Competing interests

No competing interests were disclosed.

Grant information

The author(s) declared that no grants were involved in supporting this work.

Article Versions (1)

version 1

Published: 04 May 2026, 15:662

https://doi.org/10.12688/f1000research.178856.1

Copyright

© 2026 Hashir Q et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download

Export To

metrics

	Views	Downloads
F1000Research	-	-
PubMed Central Data from PMC are received and updated monthly.	-	-

Citations

0

SEE MORE DETAILS

CITE

how to cite this article

Hashir Q, Asfand-E-Yar M, Ali Shah A and Shoukat S. RxPairEvid: an auditable, machine-learning-ready dataset of drug–drug pairs with pharmacovigilance signal features and MedDRA PT-code rationales [version 1; peer review: 1 not approved]. F1000Research 2026, 15:662 (https://doi.org/10.12688/f1000research.178856.1)

NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.

track

receive updates on this article

Track an article to receive email alerts on any updates to this article.

Open Peer Review

Version 1

VERSION 1

PUBLISHED 04 May 2026

Views

10

Reviewer Report 04 Jun 2026

Munir Pirmohamed, University of Liverpool, Liverpool, England, UK

Matthew Bright, University of Liverpool Faculty of Science and Engineering, Liverpool, England, UK

Not Approved

https://doi.org/10.5256/f1000research.197294.r482736

The authors have produced a tool (RxPairEvid) to assess DDIs using data from FAERS. Comments are provided below.
The paper is very hard to read and so it is difficult to see what they have done beyond some drug ... Continue reading

The authors have produced a tool (RxPairEvid) to assess DDIs using data from FAERS. Comments are provided below.
The paper is very hard to read and so it is difficult to see what they have done beyond some drug name standardisation and fairly standard data cleansing. They mention some sort of stratification scheme but this is very hazily described.

Page 3: ‘…but their reuse in machine learning is often limited by unstable drug identifiers and opaque, non-auditable label construction.’ It’s unclear what this means, and the reference given is just to the FAERS database so it’s unlikely to support the assertion.

Page 3: ‘For public dissemination we release a compact, information-rich 50,000-pair matrix focused on FAERS derived..”: Various data sources are mentioned but it is unclear how they are used to produce a valid DDI data set. It’s what ‘signal features’ are used in FAERS signal features and rationale pointers, together with provenance and audit artifacts

Page 3: ‘FAERS DRUG quarterly and REAC quarterly records are broken out into normalized staging tables, with powerful management of typical encoding problems’: It’s not clear what this means – how were the problems ‘managed’. What is a ‘normalised staging table’?

Page 4: ‘We construct a 2 x 2 contingency table and calculate measures of disproportionality of each (drug A, drug B, PT). PRR is widely employed in signal generation of spontaneous-report data.2 ROR is as well a popular tool and application of the lower confidence bound will assist in minimizing the false positives based on the limited number of counts.’: PRR with just drug pair and PT count is inappropriate in the DDI setting – it will show a signal if one of the drugs causes the effect alone, or if both do but there is no actual interaction. It would be better to use specifically designed statistics such as the Omega statistic [Norén, G.N, et al, Statist. Med., 27: 3057-3070. (2008)] or INTSS [Almenoff, June S., et al, Pharmacoepidemiology and drug safety 12.6 (2003): 517-521]

Page 5: ‘RxPairEvid-50 K is deterministic stratified sample of strict FAERS rollup matrix.: More detail is needed on how the ‘rollups’ were done. Why is the maximum PRR/ROR figure selected from groupings?

Page 5: ‘a predefined quota is sampled to obtain 50,000 rows’: How is this sampling done?

Page 6: ‘Validation is more about auditability and integrity, and not about predictive performance.’: If the claim is that this data is a genuine database of DDIs, then this statement needs further justification – they would need to show that they function as a good set of controls.

Page 7: ‘Signals derived by FAERS are associative and prone to reporting bias, confounding, and stimulated reporting and should not be used as causal data of a DDI. We suggest RxPairEvid-50 K since…’: It is certainly the case that PRR/ROR signals would be unreliable (see above). It is hard to see how the dataset could be useful in any of the applications discussed if it is not, or at least unlikely to be, a list of actual DDIs.

Is the rationale for creating the dataset(s) clearly described?

Yes
Are the protocols appropriate and is the work technically sound?

No
Are sufficient details of methods and materials provided to allow replication by others?

No
Are the datasets clearly presented in a useable and accessible format?

No

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: Pharmacovigilance, clinical pharmacology

We confirm that we have read this submission and believe that we have an appropriate level of expertise to state that we do not consider it to be of an acceptable scientific standard, for reasons outlined above.

CITE

Report a concern

Respond or Comment

Comments on this article Comments (0)

Version 1

VERSION 1 PUBLISHED 04 May 2026

Open Peer Review

Reviewer Status

Reviewer Reports

	Invited Reviewers
	1
Version 1 04 May 26	read

Munir Pirmohamed, University of Liverpool, Liverpool, UK

Matthew Bright, University of Liverpool Faculty of Science and Engineering, Liverpool, UK

Comments on this article

All Comments(0)

Add a comment

Sign up for content alerts

Browse by related subjects

Back to all reports

Reviewer Report

10 Views

04 Jun 2026 | for Version 1

Munir Pirmohamed, University of Liverpool, Liverpool, England, UK

Matthew Bright, University of Liverpool Faculty of Science and Engineering, Liverpool, England, UK

10 Views Cite this report Responses(0)

Not Approved

The authors have produced a tool (RxPairEvid) to assess DDIs using data from FAERS. Comments are provided below.
The paper is very hard to read and so it is difficult to see what they have done beyond some drug name standardisation and fairly standard data cleansing. They mention some sort of stratification scheme but this is very hazily described.

Page 3: ‘…but their reuse in machine learning is often limited by unstable drug identifiers and opaque, non-auditable label construction.’ It’s unclear what this means, and the reference given is just to the FAERS database so it’s unlikely to support the assertion.

Page 3: ‘For public dissemination we release a compact, information-rich 50,000-pair matrix focused on FAERS derived..”: Various data sources are mentioned but it is unclear how they are used to produce a valid DDI data set. It’s what ‘signal features’ are used in FAERS signal features and rationale pointers, together with provenance and audit artifacts

Page 3: ‘FAERS DRUG quarterly and REAC quarterly records are broken out into normalized staging tables, with powerful management of typical encoding problems’: It’s not clear what this means – how were the problems ‘managed’. What is a ‘normalised staging table’?

Page 4: ‘We construct a 2 x 2 contingency table and calculate measures of disproportionality of each (drug A, drug B, PT). PRR is widely employed in signal generation of spontaneous-report data.2 ROR is as well a popular tool and application of the lower confidence bound will assist in minimizing the false positives based on the limited number of counts.’: PRR with just drug pair and PT count is inappropriate in the DDI setting – it will show a signal if one of the drugs causes the effect alone, or if both do but there is no actual interaction. It would be better to use specifically designed statistics such as the Omega statistic [Norén, G.N, et al, Statist. Med., 27: 3057-3070. (2008)] or INTSS [Almenoff, June S., et al, Pharmacoepidemiology and drug safety 12.6 (2003): 517-521]

Page 5: ‘RxPairEvid-50 K is deterministic stratified sample of strict FAERS rollup matrix.: More detail is needed on how the ‘rollups’ were done. Why is the maximum PRR/ROR figure selected from groupings?

Page 5: ‘a predefined quota is sampled to obtain 50,000 rows’: How is this sampling done?

Page 6: ‘Validation is more about auditability and integrity, and not about predictive performance.’: If the claim is that this data is a genuine database of DDIs, then this statement needs further justification – they would need to show that they function as a good set of controls.

Page 7: ‘Signals derived by FAERS are associative and prone to reporting bias, confounding, and stimulated reporting and should not be used as causal data of a DDI. We suggest RxPairEvid-50 K since…’: It is certainly the case that PRR/ROR signals would be unreliable (see above). It is hard to see how the dataset could be useful in any of the applications discussed if it is not, or at least unlikely to be, a list of actual DDIs.

Is the rationale for creating the dataset(s) clearly described?

Yes
Are the protocols appropriate and is the work technically sound?

No
Are sufficient details of methods and materials provided to allow replication by others?

No
Are the datasets clearly presented in a useable and accessible format?

No

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

Pharmacovigilance, clinical pharmacology

We confirm that we have read this submission and believe that we have an appropriate level of expertise to state that we do not consider it to be of an acceptable scientific standard, for reasons outlined above.

Respond to this report

Responses (0)

[1] 1. FDA: FDA Adverse Event Reporting System (FAERS).Reference Source

[2] 2. Evans SJW, Waller PC, Davis S: Use of proportional reporting ratios (PRRs) for signal generation from spontaneous adverse drug reaction reports. Pharmacoepidemiol Drug Saf. 2001; 10(6): 483–486. Publisher Full Text

[3] 3. van Puijenbroek EP , Bate A, Leufkens HGM, et al.: A comparison of measures of disproportionality for signal detection in spontaneous reporting systems for adverse drug reactions. Pharmacoepidemiol Drug Saf. 2002; 11(1): 3–10. PubMed Abstract | Publisher Full Text

[4] 4. Brown EG, Wood L, Wood S: The medical dictionary for regulatory activities (MedDRA). Drug Saf. 1999; 20(2): 109–117. Publisher Full Text

[5] 5. Knox C, Wilson M, Klinger CM, et al.: DrugBank 6.0: the DrugBank knowledgebase for 2024. Nucleic Acids Res. 2024; 52(D1): D1265–D1275. PubMed Abstract | Publisher Full Text | Free Full Text

[6] 6. Kanehisa M, Furumichi M, Sato Y, et al.: KEGG for taxonomy-based analysis of pathways and genomes. Nucleic Acids Res. 2023; 51(D1): D587–D592. PubMed Abstract | Publisher Full Text | Free Full Text

[7] 7. wwPDB consortium: Protein Data Bank: the single global archive for 3D macromolecular structure data. Nucleic Acids Res. 2019; 47(D1): D520–D528. Publisher Full Text

[8] 8. Wang R, Fang X, Lu Y, et al.: The PDBbind database: methodologies and updates. J Med Chem. 2005; 48(12): 4111–4119. Publisher Full Text

[9] 9. Szklarczyk D, Gable AL, Lyon D, et al.: The STRING database in 2023: protein–protein association networks and functional enrichment analyses. Nucleic Acids Res. 2023; 51(D1): D638–D646. Publisher Full Text

[10] 10. Szklarczyk D, Santos A, von Mering C , et al.: STITCH 5: augmenting protein–chemical interaction networks with tissue and affinity data. Nucleic Acids Res. 2016; 44(D1): D380–D384. PubMed Abstract | Publisher Full Text | Free Full Text

[11] 11. Kuhn M, Letunic I, Jensen LJ, et al.: The SIDER database of drugs and side effects. Nucleic Acids Res. 2016; 44(D1): D1075–D1079. PubMed Abstract | Publisher Full Text | Free Full Text

[12] 12. Subramanian A, Narayan R, Corsello SM, et al.: A next generation connectivity map: L1000 platform and the first 1,000,000 profiles. Cell. 2017; 171(6): 1437–1452.e17. PubMed Abstract | Publisher Full Text | Free Full Text

[13] 13. Zdrazil B, Felix E, Hunter F, et al.: The ChEMBL database in 2023: a drug discovery platform spanning multiple bioactivity data types and time periods. Nucleic Acids Res. 2024; 52(D1): D1180–D1192. PubMed Abstract | Publisher Full Text | Free Full Text

[14] 14. Kim S, Chen J, Cheng T, et al.: PubChem 2023 update. Nucleic Acids Res. 2023; 51(D1): D1373–D1380. PubMed Abstract | Publisher Full Text | Free Full Text

[15] 15. Liu T, Lin Y, Wen X, et al.: BindingDB: a web-accessible database of experimentally determined protein–ligand binding affinities. Nucleic Acids Res. 2007; 35: D198–D201. PubMed Abstract | Publisher Full Text | Free Full Text

[16] 16. Nelson SJ, Zeng K, Kilbourne J, et al.: Normalized names for clinical drugs: RxNorm at 6 years. J Am Med Inform Assoc. 2011; 18(4): 441–448. PubMed Abstract | Publisher Full Text | Free Full Text

[17] 17. Tatonetti NP, Ye PP, Daneshjou R, et al.: Data-driven prediction of drug effects and interactions. Sci Transl Med. 2012; 4(125). Publisher Full Text

[18] 18. Hashir Q, Asfand-e-Yar M, Algarni F, et al.: RxPairEvid. Mendeley Data.Publisher Full Text

RxPairEvid: an auditable, machine-learning-ready dataset of drug–drug pairs with pharmacovigilance signal features and MedDRA PT-code rationales

Abstract

Background

Methods

Conclusions

Keywords

Introduction

Figure 2. End-to-end processing workflow, comprising of identifier mapping, canonical pair ordering, MedDRA PT joins, per pair - PT signal computation, loose/strict roll-ups, and export of release.

Figure 3. Internal integration points for PostgreSQL integration view and feature layer attachment points (stg/core/features/labels/ref ).

Methods

Data sources

Processing environment and reproducible schema

Canonical identifiers

FAERS ingestion, normalization, and mapping quality control

Disproportionality statistics and strict vs loose regimes

Per-pair rollups and rationale pointer

Public subset construction (50,000 rows)

Data records

Table 1. Main fields in ddi_pairs_50k.csv (RxPairEvid-50 K).

Table 2. Files deposited in the RxPairEvid Mendeley Data record.

Data validation

Reuse notes and limitations

Ethics and consent

Data availability

Underlying data

Acknowledgements

References

Comments on this article Comments (0)

Open Peer Review

Comments on this article Comments (0)

Open Peer Review

Reviewer Status

Reviewer Reports

Comments on this article

Browse by related subjects

Competing Interests Policy

Stay Updated