Keywords
CVD, risk prediction model, R, QRISK3
This article is included in the RPackage gateway.
CVD, risk prediction model, R, QRISK3
We appreciate the time and effort that reviewers spent on this paper. We edited the paper according to the comments and provided more references to clarify technical details.
See the authors' detailed response to the review by Fadratul Hafinaz Hassan
See the authors' detailed response to the review by Mohieddin Jafari and Ali Amiryousefi
See the authors' detailed response to the review by Angela K. Birnbaum and Ashwin Karanam
Cardiovascular disease (CVD) was responsible for 17.9 million deaths in 2016, which represents 31% of overall global deaths, and over 75% of these deaths happened in low/middle-income countries1. People who are at high risk of CVD need to be identified and treated early1. Risk prediction models that use risk factors to calculate the probability of patients developing diseases are often used to identify high risk patients2. QRISK3 is the most popular risk prediction model for CVD developed in the UK. It calculates risk of patients developing CVD in the next 10 years and has been incorporated into the electronic health records (EHRs) system in the UK in order to detect high risk CVD patients and help clinicians make treatment decisions3,4. NICE guidelines recommend clinicians to consider prescribing statins to patients with a risk over 10% identified from QRISK35. QRISK3 was developed from historical patients’ EHR data using Cox proportional hazard model6 and has been well validated at population level corresponding to discrimination and calibration3,4,7.
The implementation of QRISK3 into R would not only benefit researchers to improve future risk prediction but also enable them to use QRISK3 scores to identify patients at certain risk levels, e.g. for clinical trial recruitment. There is also scope to improve these risk predictions; it has been found that QRISK3 has uncertainty on individual risk prediction7,8 due to unmeasured heterogeneity between practices, which was not captured. A follow-up study suggests that QRISK3 may need to include additional causal risk factors as this uncertainty on individual risk prediction was not related to data quality and variation of association between disease and outcome9. The current QRISK3 can only be accessed through an online web calculator or specialised commercial software10 and its original algorithm was written by C, which is a low level programming language appealing to software engineering rather than data science11. R is the most popular statistical programming language in the data science field due to its great advantage as free and open-source, with fast computing and a well-supported community12. This paper explains the incorporation of the QRISK3 algorithm into R for ease of research concerning QRISK3 and how the package was developed and validated. The package aims to help researchers to improve risk prediction models and better detect high risk CVD patients.
The original QRISK3 algorithm was written in C by ClinRisk under a GNU Lesser General Public License11. Their previously published QRISK3 paper was used to understand the original algorithm and the associations between variables used in the original algorithm and risk factors of QRISK33.
The QRISK3 algorithm was written in both R (3.4.2) and SAS (9.4)13 independently, in order to mimic double programming, with a plan to use the SAS implementation to validate the R package. An additional C program, which could directly call the original QRISK3 algorithm to calculate risk, was written for validation. Two validation datasets (QRISK3_2017_test and QRISK3_2019_test) were then created and included in the R package. Dataset QRISK3_2017_test was created by manually recording the calculated QRISK3 risk score from the original QRISK3 algorithm for a group of simulated patients. The simulated patient groups were generated by changing each risk factor sequentially covering the changes of all QRISK3 risk factors. For example, patient 1 in QRISK3_2017_test does not have any positive CVD risk factors, patient 2 is similar to patient 1 expect he has atrial fibrillation, patient ID 3 is similar to patient 2 except he is on atypical antipsychotic medication rather than atrial fibrillation and so on until all the change of CVD predictors are covered. Therefore, each patient is similar to the previous patient except the change of one CVD predictor. QRISK3_2019_test was the version recorded using the original QRISK3 algorithm with different value changes for each risk factor. Risk scores of the same simulated patient groups (QRISK3_2017_test and QRISK3_2019_test) was compared among different versions of QRISK3, including QRISK3 R package, QRISK3 SAS program and QRISK3 C function for validation. The R package was created using R CMD tool14 with several useful online tutorials15–18.
The QRISK3 package can be directly installed from CRAN19 using “install(QRISK3)” or GitHub respiratory20 with “install_github(“YanLiUK/QRISK3”)”. The package contains one function (QRISK3_2017) to calculate the risk of patients developing CVD in the next 10 years using the QRISK3 algorithm11 and the two datasets for testing.
Variables used by the QRISK3 package were summarised and compared to the original algorithm in Table 1. All variables have the same definition as the QRISK3 paper3, most of variables were coded into numeric variables similar to the original algorithm. The coding of ethnicity and smoking was different from the original algorithm (written in C), as the C index starts from 0 but R’s index starts from 1.
The two datasets QRISK3_2017_test and QRISK3_2019_test were used for validation. Risk scores calculated from this QRISK3 package, the original algorithm and the SAS version on the same group of patients were exactly the same. Applying this QRISK3 package to a big CPRD cohorts21 with 3.6 million patients7 showed a good discrimination (C statistic: 0.85) and calibration22, similar to the original QRISK33.
A patient cohort with anonymous patient identifiers and CVD risk factors should first be extracted and coded similarly to QRISK3 by the user. Missing values in the dataset should be handled (e.g. multiple imputation) before using this package23. Column names of CVD risk factors (e.g. “age”) should then be specified correctly to the QRISK3_2017 function. The function returns calculated risk scores through a dataset with three columns, including patient identifier, calculated QRISK3 score and calculated QRISK3 score with one digit. It also reminds users to double check whether the definition of their variables was the same as the definition of QRISK3. The package also automatically detects whether all variables were coded as numeric and whether age of patients was ranged between 25 and 84, if not an error message returns (explained in Table 2).
1. Set path and read data from CSV file
dataPath <- "yourPath" dataName <- "yourDataName.csv" setwd(dataPath) myData <- read.csv(dataName, check.names=FALSE)
2. See the data structure and other information
#See data structure str(myData)
## 'data.frame': 48 obs. of 26 variables: ## $ QRISK_C_algorithm_score : num 17.2 36 21.6 24.1 17.2 19.1 20.9 22.3 19.3 23.5 ... ## $ age : int 64 64 64 64 64 64 64 64 64 64 ... ## $ gender : num 1 1 1 1 1 1 1 1 1 1 ... ## $ b_AF : int 0 1 0 0 0 0 0 0 0 0 ... ## $ b_atypicalantipsy: int 0 0 1 0 0 0 0 0 0 0 ... ## $ b_corticosteroids: int 0 0 0 1 0 0 0 0 0 0 ... ## $ b_impotence2 : int 0 0 0 0 1 0 0 0 0 0 ... ## $ b_migraine : int 0 0 0 0 0 1 0 0 0 0 ... ## $ b_ra : int 0 0 0 0 0 0 1 0 0 0 ... ## $ b_renal : int 0 0 0 0 0 0 0 1 0 0 ... ## $ b_semi : int 0 0 0 0 0 0 0 0 1 0 ... ## $ b_sle : int 0 0 0 0 0 0 0 0 0 1 ... ## $ b_treatedhyp : int 0 0 0 0 0 0 0 0 0 0 ... ## $ b_type1 : int 0 0 0 0 0 0 0 0 0 0 ... ## $ b_type2 : int 0 0 0 0 0 0 0 0 0 0 ... ## $ weight : int 70 70 70 70 70 70 70 70 70 70 ... ## $ height : int 180 180 180 180 180 180 180 180 180 180 ... ## $ ethrisk : int 2 2 2 2 2 2 2 2 2 2 ... ## $ fh_cvd : int 0 0 0 0 0 0 0 0 0 0 ... ## $ rati : int 4 4 4 4 4 4 4 4 4 4 ... ## $ sbp : int 180 180 180 180 180 180 180 180 180 180 ... ## $ sbps5 : int 20 20 20 20 20 20 20 20 20 20 ... ## $ smoke_cat : int 1 1 1 1 1 1 1 1 1 1 ... ## $ surv : int 10 10 10 10 10 10 10 10 10 10 ... ## $ town : int 0 0 0 0 0 0 0 0 0 0 ... ## $ ID : int 1 2 3 4 5 6 7 8 9 10 ... #See missing value # summary(myData) #If there is any missing value #please use methods (e.g. multiple imputation) to impute missing value #Once there is no missing value #Get all variable names in your data # colnames(myData) #Use help of this package to map your variable to QRISK3 variables # ?QRISK3_2017
3. Call the QRISK3 function to calculate risk score
test_all_rst <- QRISK3_2017(data= myData, patid="ID", gender="gender",age="age", atrial_fibrillation="b_AF", atypical_antipsy="b_atypicalantipsy", regular_steroid_tablets="b_corticosteroids", erectile_disfunction="b_impotence2", migraine="b_migraine", rheumatoid_arthritis="b_ra", chronic_kidney_disease="b_renal", severe_mental_illness="b_semi", systemic_lupus_erythematosis="b_sle", blood_pressure_treatment="b_treatedhyp", diabetes1="b_type1", diabetes2="b_type2", weight="weight", height="height", ethiniciy="ethrisk", heart_attack_relative="fh_cvd", cholesterol_HDL_ratio="rati", systolic_blood_pressure="sbp", std_systolic_blood_pressure="sbps5", smoke="smoke_cat", townsend="town") ## ## This R package was based on open-sourced original QRISK3-2017 algorithm. ## <https://qrisk.org/three/src.php> Copyright 2017 ClinRisk Ltd. ## ## The risk score calculated from this R package can only be used for research purpose. ## ## Please refer to QRISK3 website for more information ## <https://qrisk.org/three/index.php> ## ## Important: Please double check whether your variables are coded the same as the QRISK3 calculator ## ## Height should have unit as (cm) ## Weight should have unit as (kg) ## ## Ethnicity should be coded as: ## Ethnicity_category Ethnicity ## 1 White or not stated 1 ## 2 Indian 2 ## 3 Pakistani 3 ## 4 Bangladeshi 4 ## 5 Other Asian 5 ## 6 Black Caribbean 6 ## ## Smoke should be coded as: ## Smoke_category Smoke ## 1 non-smoker 1 ## 2 ex-smoker 2 ## 3 light smoker (less than 10) 3 ## 4 moderate smoker (10 to 19) 4 ## 5 heavy smoker (20 or over) 5 ## ## The head of result in all patients is: ## ID QRISK3_2017 QRISK3_2017_1digit ## 1 1 17.22985 17.2 ## 2 2 17.89260 17.9 ## 3 3 36.02081 36.0 ## 4 4 21.60346 21.6 ## 5 5 24.06195 24.1 ## 6 6 17.22985 17.2
Users first need to structure their data file similar to the provided test dataset (e.g. QRISK3_2019_test) which contains information of patients’ identifier and QRISK3 risk factors and mimics QRISK3’s training cohort3. The structured data file (statistical analysis dataset) should be set out so that each row (observation) represents one individual patient and each column represents one QRISK3 predictor. The exact definition of all QRISK3 predictors can be found from Box 1 of the original QRISK3 paper3. Variables used by QRISK3 can be extracted from EHR databases, such as CPRD24 or QResearch25. Code lists (Read code) for the outcome variable (CVD) can be obtained from the supplementary materials of the QRISK3 paper3. Code lists for variables included in QRISK2 can be extracted from a previous study26. Code lists for other variables including anxiety, alcohol abuse, atypical anti-psychotic medication, erectile dysfunction, HIV/AIDS, left ventricular hypertrophy, migraine and systemic lupus erythematosus could be found from CPRD27 or clinical codes website28. All CVD risk factors should be coded as numeric, binary variables should be coded as 0 or 1, categorical variables such as smoking status should be coded as the same as this package. Any differences between users’ variables and QRISK3 predictors (e.g. different criteria to define smoking status) should be mentioned in users’ final report. Once the analysis dataset was extracted, it is recommended to compare the distribution of users’ analysis dataset to Qresearch’s cohort using their baseline table3,8. Missing values should be imputed with multiple imputation29. Finally, users should follow the above workflow and carefully match their variable names to pre-defined QRISK3 predictors to calculate risk score. The function will return a dataset with patient identifier, calculated score and calculated score with one digit.
This R package successfully implements the QRISK3 algorithm into R, which allows researchers to calculate CVD risk of patients in the next 10 years. The R package was validated by the original algorithm and a SAS version. This is also the first R implementation of the QRISK3 algorithm at the date of writing.
Though QRISK3 was already published and released from the online website, it is time consuming for researchers to calculate QRISK3 risk score, as the online calculator cannot be used as a service to obtain QRISK3 scores for a large cohort, this package bridges this gap by helping researchers to apply QRISK3 model to their own cohort.
Although it is easy to use this R function to calculate a risk score, researchers should carefully check whether their variables are coded the same as the original QRISK3 cohort, otherwise the calculated score might not be the correct risk of the patient in the cohort. For example, a patient who is a smoker and has the smoking variable coded as “1” would conflict with the definition of the QRISK3 algorithm (“smoking” equals 1 in this R package means non-smoker). Since QRISK is updated annually every spring, researchers who are interested in the latest work should refer to their website10.
In conclusion, we developed this R package to allow researchers to obtain QRISK3 scores for large cohorts. This tool could help researchers to improve risk prediction modelling based on a currently used risk prediction model.
Package available from CRAN: https://cran.r-project.org/web/packages/QRISK3/index.html
Source code available from: https://github.com/YanLiUK/QRISK3
Archived source code as at time of publication: https://doi.org/10.5281/zenodo.357068227
License: GPL-3
C source code, SAS version and QRISK3_2017_test and QRISK3_2019_test datasets used for validation available from: https://github.com/YanLiUK/QRISK3_valid
Archived C code, SAS version and test datasets as at time of publication: https://doi.org/10.5281/zenodo.357130428
License: GPL-3
Views | Downloads | |
---|---|---|
F1000Research | - | - |
PubMed Central
Data from PMC are received and updated monthly.
|
- | - |
Is the rationale for developing the new software tool clearly explained?
Partly
Is the description of the software tool technically sound?
Yes
Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?
Yes
Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?
Yes
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?
Partly
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Optimization and Machine Learning
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Pharmacometric modeling, neuropharmacology, clinical pharmacology
Is the rationale for developing the new software tool clearly explained?
Yes
Is the description of the software tool technically sound?
Partly
Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?
Partly
Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?
Partly
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?
Partly
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Pharmacometric modeling, neuropharmacology, clinical pharmacology
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Computational biologist
Is the rationale for developing the new software tool clearly explained?
Yes
Is the description of the software tool technically sound?
Yes
Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?
Yes
Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?
Yes
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?
Yes
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Computational biologist
Alongside their report, reviewers assign a status to the article:
Invited Reviewers | |||
---|---|---|---|
1 | 2 | 3 | |
Version 3 (revision) 22 May 20 |
read | read | |
Version 2 (revision) 28 Feb 20 |
read | read | |
Version 1 23 Dec 19 |
read |
Provide sufficient details of any financial or non-financial competing interests to enable users to assess whether your comments might lead a reasonable person to question your impartiality. Consider the following examples, but note that this is not an exhaustive list:
Sign up for content alerts and receive a weekly or monthly email with all newly published articles
Already registered? Sign in
The email address should be the one you originally registered with F1000.
You registered with F1000 via Google, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Google account password, please click here.
You registered with F1000 via Facebook, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Facebook account password, please click here.
If your email address is registered with us, we will email you instructions to reset your password.
If you think you should have received this email but it has not arrived, please check your spam filters and/or contact for further assistance.
Comments on this article Comments (0)