ALL Metrics
-
Views
-
Downloads
Get PDF
Get XML
Cite
Export
Track
Software Tool Article

R package “QRISK3”: an unofficial research purposed implementation of ClinRisk’s QRISK3 algorithm into R

[version 1; peer review: 1 not approved]
PUBLISHED 23 Dec 2019
Author details Author details
OPEN PEER REVIEW
REVIEWER STATUS

This article is included in the RPackage gateway.

Abstract

Cardiovascular disease has been the leading cause of death for decades. Risk prediction models are used to identify high risk patients; the most common model used in the UK is ClinRisk’s QRISK3. In this paper we describe the implementation of the QRISK3 algorithm into an R package. The package was successfully validated by the open sourced QRISK3 algorithm and QRISK3 SAS program. We provide detailed examples of the use of the package, including assigning QRISK3 scores for a large cohort of patients. This R package could help the research community to better understand risk prediction scores and improve future risk prediction models. The package is available from CRAN: https://cran.r-project.org/web/packages/QRISK3/index.html.

Keywords

CVD, risk prediction model, R, QRISK3

Introduction

Cardiovascular disease (CVD) was responsible for 17.9 million deaths in 2016, which represents 31% of overall global deaths, and over 75% of these deaths happened in low/middle-income countries1. People who are at high risk of CVD need to be identified and treated early1. Risk prediction models that use risk factors to calculate the probability of patients developing diseases are often used to identify high risk patients2. QRISK3 is the most popular risk prediction model for CVD developed in the UK. It calculates risk of patients developing CVD in the next 10 years and has been incorporated into the electronic health records (EHRs) system in the UK in order to detect high risk CVD patients and help clinicians make treatment decisions3,4. NICE guidelines recommend clinicians to consider prescribing statins to patients with a risk over 10% identified from QRISK35. QRISK3 was developed from historical patients’ EHR data using Cox proportional hazard model6 and has been well validated at population level corresponding to discrimination and calibration3,4,7.

The implementation of QRISK3 into R would not only benefit researchers to improve future risk prediction but also enable them to use QRISK3 scores to identify patients at certain risk levels, e.g. for clinical trial recruitment. There is also scope to improve these risk predictions; it has been found that QRISK3 has uncertainty on individual risk prediction7,8 due to unmeasured heterogeneity between practices, which was not captured. A follow-up study suggests that QRISK3 may need to include additional causal risk factors as this uncertainty on individual risk prediction was not related to data quality and variation of association between disease and outcome9. The current QRISK3 can only be accessed through an online web calculator or specialised commercial software10 and its original algorithm was written by C, which is a low level programming language appealing to software engineering rather than data science11. R is the most popular statistical programming language in the data science field due to its great advantage as free and open-source, with fast computing and a well-supported community12. This paper explains the incorporation of the QRISK3 algorithm into R for ease of research concerning QRISK3 and how the package was developed and validated. The package aims to help researchers to improve risk prediction models and better detect high risk CVD patients.

Methods

Extraction of the QRISK3 algorithm

The original QRISK3 algorithm was written in C by ClinRisk under a GNU Lesser General Public License11. Their previously published QRISK3 paper was used to understand the original algorithm and the associations between variables used in the original algorithm and risk factors of QRISK33.

Development and validation of the QRISK3 R package

The QRISK3 algorithm was written in both R (3.4.2) and SAS (9.4)13 independently, in order to mimic double programming, with a plan to use the SAS implementation to validate the R package. An additional C program, which could directly call the original QRISK3 algorithm to calculate risk, was written for validation. Two validation datasets (QRISK3_2017_test and QRISK3_2019_test) were then created and included in the R package. Dataset QRISK3_2017_test was created by manually recording the calculated QRISK3 risk score from the original QRISK3 algorithm for a group of simulated patients. The simulated patient groups were generated by changing each risk factor sequentially covering the changes of all QRISK3 risk factors. For example, patient 1 in QRISK3_2017_test does not have any positive CVD risk factors, patient 2 is similar to patient 1 expect he has atrial fibrillation, patient ID 3 is similar to patient 2 except he is on atypical antipsychotic medication rather than atrial fibrillation and so on until all the change of CVD predictors are covered. Therefore, each patient is similar to the previous patient except the change of one CVD predictor. QRISK3_2019_test was the version recorded using the original QRISK3 algorithm with different value changes for each risk factor. Risk scores of the same simulated patient groups (QRISK3_2017_test and QRISK3_2019_test) was compared among different versions of QRISK3, including QRISK3 R package, QRISK3 SAS program and QRISK3 C function for validation. The R package was created using R CMD tool14 with several useful online tutorials1518.

Implementation

The QRISK3 package can be directly installed from CRAN19 using “install(QRISK3)” or GitHub respiratory20 with “install_github("YanLiUK/QRISK3")”. The package contains one function (QRISK3_2017) to calculate the risk of patients developing CVD in the next 10 years using the QRISK3 algorithm11 and the two datasets for testing.

Variables used by the QRISK3 package were summarised and compared to the original algorithm in Table 1. All variables have the same definition as the QRISK3 paper3, most of variables were coded into numeric variables similar to the original algorithm. The coding of ethnicity and smoking was different from the original algorithm (written in C), as the C index starts from 0 but R’s index starts from 1.

Table 1. Description of QRISK3 variables.

Parameters in QRISK3
R package
Meaning of variablesVariables in original
algorithm
ageSpecify the age of the patient in year (e.g. 64 years-old)age
atrial_fibrillationAtrial fibrillation? (0: No, 1: Yes)b_AF
atypical_antipsyOn atypical antipsychotic medication? (0: No, 1: Yes)b_atypicalantipsy
regular_steroid_tabletsOn regular steroid tablets? (0: No, 1: Yes)b_corticosteroids
erectile_disfunctionA diagnosis of or treatment for erectile disfunction?
(0: No, 1: Yes)
b_impotence2
(only for men)
migraineDo patients have migraines? (0: No, 1: Yes)b_migraine
rheumatoid_arthritisRheumatoid arthritis? (0: No, 1: Yes)b_ra
chronic_kidney_diseaseChronic kidney disease (stage 3, 4 or 5)? (0: No, 1: Yes)b_renal
severe_mental_illnessSevere mental illness? (0: No, 1: Yes)b_semi
systemic_lupus_erythematosisSystemic lupus erythematosis (SLE)? (0: No, 1: Yes)b_sle
blood_pressure_treatmentOn blood pressure treatment? (0: No, 1: Yes)b_treatedhyp
diabetes1Diabetes status: type 1? (0: No, 1: Yes)b_type1
diabetes2Diabetes status: type 2? (0: No, 1: Yes)b_type2
Weight (kg)WeightNot available
Height (cm)HeightNot available
Weight (m) / (Height (cm) /100)2Body mass index (BMI)bmi
ethnicity1 White or not stated
2 Indian
3 Pakistani
4 Bangladeshi
5 Other Asian
6 Black Caribbean
7 Black African
8 Chinese
9 Other ethnic group
ethrisk:
0, --not stated
1, --white
2, --inidan
3, --Pakistani
4,--Bangladeshi
5,--Other Asian
6,--Black Caribbean
7,--Black African
8,--Chinese
9--Other ethnic group
heart_attack_relativeAngina or heart attack in a 1st degree relative < 60?
(0: No, 1: Yes)
fh_cvd
cholesterol_HDL_ratioCholesterol/HDL ratio? (range from 1 to 11, e.g. 4)rati
systolic_blood_pressureSystolic blood pressure (mmHg, e.g. 180 mmHg)sbp
std_systolic_blood_pressureStandard deviation of at least two most recent systolic
blood pressure readings(mmHg)
sbps5
smoke1 non-smoker
2 ex-smoker
3 light smoker (less than 10)
4 moderate smoker (10 to 19)
5 heavy smoker (20 or over)
smoke_cat:
0 non-smoker
1 ex-smoker
2 light smoker
(less than 10)
3 moderate smoker
(10 to 19)
4 heavy smoker
(20 or over)
townsendTownsend deprivation scorestown

Validation

The two datasets QRISK3_2017_test and QRISK3_2019_test were used for validation. Risk scores calculated from this QRISK3 package, the original algorithm and the SAS version on the same group of patients was exactly the same. The external validation of this QRISK3 package in a big CPRD cohorts with 3.6 million patients shows a good and similar discrimination (C statistic: 0.85) and calibration to a previous study7 compared to the original QRISK3 paper3.

Usage and features

A patient cohort with anonymous patient identifiers and CVD risk factors should first be extracted and coded similarly to QRISK3 by the user. Missing values in the dataset should be handled (e.g. multiple imputation) before using this package. Column names of CVD risk factors (e.g. “age”) should then be specified correctly to the QRISK3_2017 function. The function returns calculated risk scores through a dataset with three columns, including patient identifier, calculated QRISK3 score and calculated QRISK3 score with one digit. It also reminds users to double check whether the definition of their variables was the same as the definition of QRISK3. The package also automatically detects whether all variables were coded as numeric and whether age of patients was ranged between 28 and 84, if not an error message returns (explained in Table 2).

Table 2. Description of error message in the QRISK3 R package.

Error messageConditionsExplanation
“Variables including XXX, XXX must
be coded as numeric (0/1) variable.”
When at least one of variables in
dataset are not numeric
QRISK3 algorithm needs numeric
variable (0/1) to calculate risk
“Age of patients must be between
25 and 84.”
When at least one patient in the dataset
has age below 25 or above 84
QRISK3 algorithm was developed
from a population with age between
25 and 84
“Variables including XXX, XXX has
missing values.”
When at least one of variables in
dataset has missing value
Missing values must be handled
before using this QRISK3 algorithm

Workflow

1. Set path and read data from CSV file

dataPath <- "yourPath"                          
dataName <- "yourDataName.csv"                  
                                                
setwd(dataPath)                                 
myData <- read.csv(dataName, check.names=FALSE) 

2. See the data structure and other information

#See data structure 
str(myData)         

## 'data.frame':    48 obs. of  26 variables:                                               
##  $ QRISK_C_algorithm_score  : num  17.2 36 21.6 24.1 17.2 19.1 20.9 22.3 19.3 23.5 ...   
##  $ age              : int  64 64 64 64 64 64 64 64 64 64 ...                             
##  $ gender           : num  1 1 1 1 1 1 1 1 1 1 ...                                       
##  $ b_AF             : int  0 1 0 0 0 0 0 0 0 0 ...                                       
##  $ b_atypicalantipsy: int  0 0 1 0 0 0 0 0 0 0 ...                                       
##  $ b_corticosteroids: int  0 0 0 1 0 0 0 0 0 0 ...                                       
##  $ b_impotence2     : int  0 0 0 0 1 0 0 0 0 0 ...                                       
##  $ b_migraine       : int  0 0 0 0 0 1 0 0 0 0 ...                                       
##  $ b_ra             : int  0 0 0 0 0 0 1 0 0 0 ...                                       
##  $ b_renal          : int  0 0 0 0 0 0 0 1 0 0 ...                                       
##  $ b_semi           : int  0 0 0 0 0 0 0 0 1 0 ...                                       
##  $ b_sle            : int  0 0 0 0 0 0 0 0 0 1 ...                                       
##  $ b_treatedhyp     : int  0 0 0 0 0 0 0 0 0 0 ...                                       
##  $ b_type1          : int  0 0 0 0 0 0 0 0 0 0 ...                                       
##  $ b_type2          : int  0 0 0 0 0 0 0 0 0 0 ...                                       
##  $ weight           : int  70 70 70 70 70 70 70 70 70 70 ...                             
##  $ height           : int  180 180 180 180 180 180 180 180 180 180 ...                   
##  $ ethrisk          : int  2 2 2 2 2 2 2 2 2 2 ...                                       
##  $ fh_cvd           : int  0 0 0 0 0 0 0 0 0 0 ...                                       
##  $ rati             : int  4 4 4 4 4 4 4 4 4 4 ...                                       
##  $ sbp              : int  180 180 180 180 180 180 180 180 180 180 ...                   
##  $ sbps5            : int  20 20 20 20 20 20 20 20 20 20 ...                             
##  $ smoke_cat        : int  1 1 1 1 1 1 1 1 1 1 ...                                       
##  $ surv             : int  10 10 10 10 10 10 10 10 10 10 ...                             
##  $ town             : int  0 0 0 0 0 0 0 0 0 0 ...                                       
##  $ ID               : int  1 2 3 4 5 6 7 8 9 10 ...                                      
                                                                                            
#See missing value                                                                          
# summary(myData)                                                                           
                                                                                            
#If there is any missing value                                                              
#please use methods (e.g. multiple imputation) to impute missing value                      
                                                                                            
#Once there is no missing value                                                             
#Get all variable names in your data                                                        
# colnames(myData)                                                                          
                                                                                            
#Use help of this package to map your variable to QRISK3 variables                          
# ?QRISK3_2017                                                                              

3. Call the QRISK3 function to calculate risk score

test_all_rst <-  QRISK3_2017(data= myData, patid="ID", gender="gender",age="age",                     
atrial_fibrillation="b_AF", atypical_antipsy="b_atypicalantipsy",                                     
regular_steroid_tablets="b_corticosteroids", erectile_disfunction="b_impotence2",                     
migraine="b_migraine", rheumatoid_arthritis="b_ra",                                                   
chronic_kidney_disease="b_renal", severe_mental_illness="b_semi",                                     
systemic_lupus_erythematosis="b_sle",                                                                 
blood_pressure_treatment="b_treatedhyp", diabetes1="b_type1",                                         
diabetes2="b_type2", weight="weight", height="height",                                                
ethiniciy="ethrisk", heart_attack_relative="fh_cvd",                                                   
cholesterol_HDL_ratio="rati", systolic_blood_pressure="sbp",                                          
std_systolic_blood_pressure="sbps5", smoke="smoke_cat", townsend="town")                              
##                                                                                                    
## This R package was based on open-sourced original QRISK3-2017 algorithm.                           
## <https://qrisk.org/three/src.php> Copyright 2017 ClinRisk Ltd.                                     
##                                                                                                    
## The risk score calculated from this R package can only be used for  research purpose.              
##                                                                                                    
## Please refer to QRISK3 website for more information                                                
## <https://qrisk.org/three/index.php>                                                                
##                                                                                                    
## Important: Please double check whether your variables are coded the same as the QRISK3 calculator  
##                                                                                                    
## Height should have unit as (cm)                                                                    
## Weight should have unit as (kg)                                                                    
##                                                                                                    
## Ethnicity should be coded as:                                                                      
##    Ethnicity_category Ethnicity                                                                    
## 1 White or not stated         1                                                                   
## 2              Indian         2                                                                   
## 3           Pakistani         3                                                                   
## 4         Bangladeshi         4                                                                   
## 5         Other Asian         5                                                                   
## 6     Black Caribbean         6                                                                   
##                                                                                                    
## Smoke should be coded as:                                                                          
##                Smoke_category Smoke                                                                
## 1                  non-smoker     1                                                                
## 2                   ex-smoker     2                                                                
## 3 light smoker (less than 10)     3                                                                
## 4  moderate smoker (10 to 19)     4                                                                
## 5   heavy smoker (20 or over)     5                                                                
##                                                                                                    
## The head of result in all patients is:                                                             
##   ID QRISK3_2017 QRISK3_2017_1digit                                                                
## 1  1    17.22985               17.2                                                                
## 2  2    17.89260               17.9                                                                
## 3  3    36.02081               36.0                                                                
## 4  4    21.60346               21.6                                                                
## 5  5    24.06195               24.1                                                                
## 6  6    17.22985               17.2                                                                

Use case

Users first need to create a statistical analysis dataset similar to the provided test dataset (e.g. QRISK3_2019_test) which contains information of patients’ identifier and QRISK3 risk factors and mimics QRISK3’s training cohort3. The structure of this statistical analysis dataset should be set out so that each row (observation) represents one individual patient and each column represents one QRISK3 predictor. The exact definition of all QRISK3 predictors can be found from Box 1 of the original QRISK3 paper3. Variables used by QRISK3 can be extracted from EHR databases, such as CPRD21 or QResearch22. Code lists (Read code) for the outcome variable (CVD) can be obtained from the supplementary materials of the QRISK3 paper3. Code lists for variables included in QRISK2 can be extracted from a previous study23. Code lists for other variables including anxiety, alcohol abuse, atypical anti-psychotic medication, erectile dysfunction, HIV/AIDS, left ventricular hypertrophy, migraine and systemic lupus erythematosus could be found from CPRD24 or clinical codes website25. All CVD risk factors should be coded as numeric, binary variables should be coded as 0 or 1, categorical variables such as smoking status should be coded as the same as this package. Any differences between users’ variables and QRISK3 predictors (e.g. different criteria to define smoking status) should be mentioned in users’ final report. Once the analysis dataset was extracted, it is recommended to compare the distribution of users’ analysis dataset to Qresearch’s cohort using their baseline table3,26. Missing values should be imputed with multiple imputation27. Finally, users should follow the above workflow and carefully match their variable names to pre-defined QRISK3 predictors to calculate risk score. The function will return a dataset with patient identifier, calculated score and calculated score with 1 digit.

Discussion

This R package successfully implements the QRISK3 algorithm into R, which allows researchers to calculate CVD risk of patients in the next 10 years. The R package was validated by the original algorithm and a SAS version. This is also the first R implementation of the QRISK3 algorithm at the date of writing.

Though QRISK3 was already published and released from the online website, it is time consuming for researchers to calculate QRISK3 risk score, as the online calculator cannot be used as a service to obtain QRISK3 scores for a large cohort, and the original algorithm is written in C rather than a well-established data science language such as R. This package bridges this gap. It allows researchers to obtain QRISK3 scores for large cohorts, which could help to improve model accuracy of QRISK3 and help with any more applied tasks that require knowing CVD risk at a patient level.

Although it is easy to use this R function to calculate a risk score, researchers should carefully check whether their variables are coded the same as the original QRISK3 cohort, otherwise the calculated score might not be the correct risk of the patient in the cohort. For example, a patient who is a smoker is coded as “1” in the variable “smoking” would be in conflict with the definition of the QRISK3 algorithm (“smoking” equals 1 in this R package means non-smoker). Since QRISK is updated annually every spring, researchers who are interested in the latest work should refer to their website10.

In conclusion, we developed this R package to allow researchers to obtain QRISK3 scores for large cohorts. It allows the research community to better understand and apply a currently used risk prediction model for CVD risk.

Data availability

Underlying data

Original QRISK3 algorithm: https://qrisk.org/three/src.php

Software availability

Package available from CRAN: https://cran.r-project.org/web/packages/QRISK3/index.html

Source code available from: https://github.com/YanLiUK/QRISK3

Archived source code as at time of publication: https://doi.org/10.5281/zenodo.357068228

License: GPL-3

C source code, SAS version and QRISK3_2017_test and QRISK3_2019_test datasets used for validation available from: https://github.com/YanLiUK/QRISK3_valid

Archived C code, SAS version and test datasets as at time of publication: https://doi.org/10.5281/zenodo.357130429

License: GPL-3

Comments on this article Comments (0)

Version 3
VERSION 3 PUBLISHED 23 Dec 2019
Comment
Author details Author details
Competing interests
Grant information
Copyright
Download
 
Export To
metrics
Views Downloads
F1000Research - -
PubMed Central
Data from PMC are received and updated monthly.
- -
Citations
CITE
how to cite this article
Li Y, Sperrin M and van Staa T. R package “QRISK3”: an unofficial research purposed implementation of ClinRisk’s QRISK3 algorithm into R [version 1; peer review: 1 not approved]. F1000Research 2019, 8:2139 (https://doi.org/10.12688/f1000research.21679.1)
NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.
track
receive updates on this article
Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status: ?
Key to Reviewer Statuses VIEW
ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions
Version 1
VERSION 1
PUBLISHED 23 Dec 2019
Views
56
Cite
Reviewer Report 12 Feb 2020
Mohieddin Jafari, University of Helsinki, Helsinki, Finland 
Ali Amiryousefi, University of Helsinki, Helsinki, Finland 
Not Approved
VIEWS 56
The authors provided an R package for the QRISK3 algorithm to predict the risk of cardiovascular disease. My main comment is related to the size of this package, which has only one simple function with two simple datasets. I am ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Jafari M and Amiryousefi A. Reviewer Report For: R package “QRISK3”: an unofficial research purposed implementation of ClinRisk’s QRISK3 algorithm into R [version 1; peer review: 1 not approved]. F1000Research 2019, 8:2139 (https://doi.org/10.5256/f1000research.23899.r59681)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
  • Author Response 28 Feb 2020
    Yan Li, Health e-Research Centre, School of Health Sciences, Faculty of Biology, Medicine and Health, the University of Manchester, Manchester, UK
    28 Feb 2020
    Author Response
    We appreciate reviewers’ time and effort in review this package paper, we would like to make following reply to these comments.
     
    Reviewer’s comments:
    The authors provided an R package ... Continue reading
COMMENTS ON THIS REPORT
  • Author Response 28 Feb 2020
    Yan Li, Health e-Research Centre, School of Health Sciences, Faculty of Biology, Medicine and Health, the University of Manchester, Manchester, UK
    28 Feb 2020
    Author Response
    We appreciate reviewers’ time and effort in review this package paper, we would like to make following reply to these comments.
     
    Reviewer’s comments:
    The authors provided an R package ... Continue reading

Comments on this article Comments (0)

Version 3
VERSION 3 PUBLISHED 23 Dec 2019
Comment
Alongside their report, reviewers assign a status to the article:
Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions
Sign In
If you've forgotten your password, please enter your email address below and we'll send you instructions on how to reset your password.

The email address should be the one you originally registered with F1000.

Email address not valid, please try again

You registered with F1000 via Google, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Google account password, please click here.

You registered with F1000 via Facebook, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Facebook account password, please click here.

Code not correct, please try again
Email us for further assistance.
Server error, please try again.