R package “ QRISK 3 ” : an unofficial research purposed implementation of ClinRisk ’ s QRISK 3 algorithm into R

Cardiovascular disease has been the leading cause of death for decades. Risk prediction models are used to identify high risk patients; the most common model used in the UK is ClinRisk’s QRISK3. In this paper we describe the implementation of the QRISK3 algorithm into an R package. The package was successfully validated by the open sourced QRISK3 algorithm and QRISK3 SAS program. We provide detailed examples of the use of the package, including assigning QRISK3 scores for a large cohort of patients. This R package could help the research community to improve future risk prediction models based on a currently used risk prediction model. The package is available from CRAN: https://cran.rproject.org/web/packages/QRISK3/index.html.


Introduction
Cardiovascular disease (CVD) was responsible for 17.9 million deaths in 2016, which represents 31% of overall global deaths, and over 75% of these deaths happened in low/middle-income countries 1 . People who are at high risk of CVD need to be identified and treated early 1 . Risk prediction models that use risk factors to calculate the probability of patients developing diseases are often used to identify high risk patients 2 . QRISK3 is the most popular risk prediction model for CVD developed in the UK. It calculates risk of patients developing CVD in the next 10 years and has been incorporated into the electronic health records (EHRs) system in the UK in order to detect high risk CVD patients and help clinicians make treatment decisions 3,4 . NICE guidelines recommend clinicians to consider prescribing statins to patients with a risk over 10% identified from QRISK3 5 . QRISK3 was developed from historical patients' EHR data using Cox proportional hazard model 6 and has been well validated at population level corresponding to discrimination and calibration 3,4,7 .
The implementation of QRISK3 into R would not only benefit researchers to improve future risk prediction but also enable them to use QRISK3 scores to identify patients at certain risk levels, e.g. for clinical trial recruitment. There is also scope to improve these risk predictions; it has been found that QRISK3 has uncertainty on individual risk prediction 7,8 due to unmeasured heterogeneity between practices, which was not captured. A follow-up study suggests that QRISK3 may need to include additional causal risk factors as this uncertainty on individual risk prediction was not related to data quality and variation of association between disease and outcome 9 . The current QRISK3 can only be accessed through an online web calculator or specialised commercial software 10 and its original algorithm was written by C, which is a low level programming language appealing to software engineering rather than data science 11 . R is the most popular statistical programming language in the data science field due to its great advantage as free and open-source, with fast computing and a well-supported community 12 . This paper explains the incorporation of the QRISK3 algorithm into R for ease of research concerning QRISK3 and how the package was developed and validated. The package aims to help researchers to improve risk prediction models and better detect high risk CVD patients.

Methods
Extraction of the QRISK3 algorithm The original QRISK3 algorithm was written in C by ClinRisk under a GNU Lesser General Public License 11 . Their previously published QRISK3 paper was used to understand the original algorithm and the associations between variables used in the original algorithm and risk factors of QRISK3 3 .

Development and validation of the QRISK3 R package
The QRISK3 algorithm was written in both R (3.4.2) and SAS (9.4) 13 independently, in order to mimic double programming, with a plan to use the SAS implementation to validate the R package. An additional C program, which could directly call the original QRISK3 algorithm to calculate risk, was written for validation. Two validation datasets (QRISK3_2017_test and QRISK3_2019_test) were then created and included in the R package. Dataset QRISK3_2017_test was created by manually recording the calculated QRISK3 risk score from the original QRISK3 algorithm for a group of simulated patients. The simulated patient groups were generated by changing each risk factor sequentially covering the changes of all QRISK3 risk factors. For example, patient 1 in QRISK3_2017_test does not have any positive CVD risk factors, patient 2 is similar to patient 1 expect he has atrial fibrillation, patient ID 3 is similar to patient 2 except he is on atypical antipsychotic medication rather than atrial fibrillation and so on until all the change of CVD predictors are covered. Therefore, each patient is similar to the previous patient except the change of one CVD predictor. QRISK3_2019_test was the version recorded using the original QRISK3 algorithm with different value changes for each risk factor. Risk scores of the same simulated patient groups (QRISK3_2017_test and QRISK3_2019_test) was compared among different versions of QRISK3, including QRISK3 R package, QRISK3 SAS program and QRISK3 C function for validation. The R package was created using R CMD tool 14 with several useful online tutorials 15-18 .

Implementation
The QRISK3 package can be directly installed from CRAN 19 using "install(QRISK3)" or GitHub respiratory 20 with "install_github("YanLiUK/QRISK3")". The package contains one function (QRISK3_2017) to calculate the risk of patients developing CVD in the next 10 years using the QRISK3 algorithm 11 and the two datasets for testing.

Amendments from Version 2
We appreciate the time and effort that reviewers spent on this paper. We edited the paper according to the comments and provided more references to clarify technical details. Variables used by the QRISK3 package were summarised and compared to the original algorithm in Table 1. All  variables have the same definition as the QRISK3 paper 3 , most of variables were coded into numeric variables  similar to the original algorithm. The coding of ethnicity and smoking was different from the original algorithm  (written in C), as the C index starts from 0 but R's index starts from 1.

Validation
The two datasets QRISK3_2017_test and QRISK3_2019_test were used for validation. Risk scores calculated from this QRISK3 package, the original algorithm and the SAS version on the same group of patients were exactly the same. Applying this QRISK3 package to a big CPRD cohorts 21 with 3.6 million patients 7 showed a good discrimination (C statistic: 0.85) and calibration 22 , similar to the original QRISK3 3 .

Usage and features
A patient cohort with anonymous patient identifiers and CVD risk factors should first be extracted and coded similarly to QRISK3 by the user. Missing values in the dataset should be handled (e.g. multiple imputation) before using this package 23 . Column names of CVD risk factors (e.g. "age") should then be specified correctly to the QRISK3_2017 function. The function returns calculated risk scores through a dataset with three columns, including patient identifier, calculated QRISK3 score and calculated QRISK3 score with one digit. It also reminds users to double check whether the definition of their variables was the same as the definition of QRISK3. The package also automatically detects whether all variables were coded as numeric and whether age of patients was ranged between 25 and 84, if not an error message returns (explained in Table 2).

Use case
Users first need to structure their data file similar to the provided test dataset (e.g. QRISK3_2019_test) which contains information of patients' identifier and QRISK3 risk factors and mimics QRISK3's training cohort 3 . The structured data file (statistical analysis dataset) should be set out so that each row (observation) represents one individual patient and each column represents one QRISK3 predictor. The exact definition of all QRISK3 predictors can be found from Box 1 of the original QRISK3 paper 3 . Variables used by QRISK3 can be extracted from EHR databases, such as CPRD 24 or QResearch 25 . Code lists (Read code) for the outcome variable (CVD) can be obtained from the supplementary materials of the QRISK3 paper 3 . Code lists for variables included in QRISK2 can be extracted from a previous study 26 . Code lists for other variables including anxiety, alcohol abuse, atypical anti-psychotic medication, erectile dysfunction, HIV/AIDS, left ventricular hypertrophy, migraine and systemic lupus erythematosus could be found from CPRD 27 or clinical codes website 28 . All CVD risk factors should be coded as numeric, binary variables should be coded as 0 or 1, categorical variables such as smoking status should be coded as the same as this package. Any differences between users' variables and QRISK3 predictors (e.g. different criteria to define smoking status) should be mentioned in users' final report. Once the analysis dataset was extracted, it is recommended to compare the distribution of users' analysis dataset to Qresearch's cohort using their baseline table 3,8 . Missing values should be imputed with multiple imputation 29 . Finally, users should follow the above workflow and carefully match their variable names to pre-defined QRISK3 predictors to calculate risk score. The function will return a dataset with patient identifier, calculated score and calculated score with one digit.

Discussion
This R package successfully implements the QRISK3 algorithm into R, which allows researchers to calculate CVD risk of patients in the next 10 years. The R package was validated by the original algorithm and a SAS version. This is also the first R implementation of the QRISK3 algorithm at the date of writing.
Though QRISK3 was already published and released from the online website, it is time consuming for researchers to calculate QRISK3 risk score, as the online calculator cannot be used as a service to obtain QRISK3 scores for a large cohort, this package bridges this gap by helping researchers to apply QRISK3 model to their own cohort.
Although it is easy to use this R function to calculate a risk score, researchers should carefully check whether their variables are coded the same as the original QRISK3 cohort, otherwise the calculated score might not be the correct risk of the patient in the cohort. For example, a patient who is a smoker and has the smoking variable coded as "1" would conflict with the definition of the QRISK3 algorithm ("smoking" equals 1 in this R package means non-smoker). Since QRISK is updated annually every spring, researchers who are interested in the latest work should refer to their website 10 .
In conclusion, we developed this R package to allow researchers to obtain QRISK3 scores for large cohorts. This tool could help researchers to improve risk prediction modelling based on a currently used risk prediction model.

Fadratul Hafinaz Hassan
School of Computer Sciences, Universiti Sains Malaysia, Pulau Pinang, Malaysia QRISK3 algorithm is openly available and already embedded in SAS. However, the authors didn't clearly state what is the rationale behind embedding the algorithm QRISK3 algorithm in another statistical software such as R.
The conclusion doesn't strongly show the significant performance of QRISK3 algorithm in R. Also, there is not a clear comparison performance between QRISK3 in SAS and QRISK3 in R.

Is the description of the software tool technically sound? Yes
Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others? Yes

Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool? Yes
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article? Partly Competing Interests: No competing interests were disclosed.

Reviewer Expertise: Optimization and Machine Learning
I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.
Author Response 12 Jan 2021 Yan Li, the University of Manchester, Manchester, UK Many thanks to reviewer's interest in our work, we would like to address the comments as below.
To our knowledge, QRISK3 was not already embedded or available in SAS before this R package (if this is not the case, we would like to ask for a reference for any pre-released QRISK3-SAS version). We did provide QRISK3 SAS version ( https://github.com/YanLiUK/QRISK3_valid/blob/master/SAS/QRISK3_valid.sas ) along with this R package as a procedure to mimic double-programming to verify the correctness of R version of QRISK3 (The SAS program can be used to calculate QRISK3 score too with minor correct mapping of variables).
QRISK3 was only initially provided with C code on the official website without clear definition of variables and outputs. Researchers who do not know C language (which as mentioned in the paper is a low-level programming language not specialised for data analysis comparing to R) and unfamiliar with risk prediction models would find it hard to use QRISK3 in their study. R is currently one of the top languages with well-established support of data structure and useful statistical tool packages for data analysis, this is reason we provide this R version of QRISK3 to bridge this gap.
Since in the package we provided both of R version and SAS version of QRISK3 algorithm, it really depends on users to decide which one better suit their project. QRISK3 algorithm itself is rather a simple calculation of a score from a polynomial function, so we find it less interest to compare the performance between SAS version and R version especially when they return the same results very quickly. The general difference between SAS and R is that R would load the data into Random Access Memory (RAM) in calculation which means for medium size/small size of data (as long as the size of the data fits into the RAM that R can access), R would calculate the score in one run with very high speed. SAS load data into hard disk drive (HDD) but it can cope with large data size in one run (i.e., when the sample size is too large for RAM). However, when the sample size is larger than the accessible size of RAM, R users can cope with this issue by calculating risk score proportionally from overall dataset (e.g., run 10 times and each time for 1/10 number of patients). The performance also depends on what data structure software they were using in SAS or R and how they programmed. Overall, we think to compare performance of R version of SAS version is beyond the scope of this paper, as users with knowledge of both languages could easily calculate them both very quickly and would end up with the same calculated score.
We again are thankful for the time and effort that the reviewer has spent on this paper.
Although the authors addressed our previous comments we believe there are still some outstanding issues that were not adequately addressed. The reply to the comment concerning development and validation of the QRISK3 R package states that "There are no sequential addition of predictors in two datasets." However, the use of a sequential design is stated in the paper: "The simulated patient groups were generated by changing each risk factor sequentially covering the changes of all QRISK3 risk factors." The readers would still benefit from an explanation of the reasoning behind the use of the created dataset. QRISK for large cohort databases. The revised sentence is confusing. The original comment simply asked to remove "to better understand" as the package only calculates a number and does not provide an understanding of methodology or underlying pathology. We still believe a larger package with more than one function is needed for a published R package. For this package, at this level the vignette associated with the package should suffice rather than a publication.

Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool? Partly
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article? Partly

Ashwin Karanam
University of Minnesota, Minneapolis, MN, USA This paper describes an R package for calculating QRISK3 scores in large datasets. The introduction reads well and provides justification for the paper although the package only describes one function. The methods section needs more detail and adjustments. Some of the items can be fixed by improving grammar and readability although others need clarification or details added. Please find our particular comments below.
Development and validation of the QRISK3 R package: The description of the simulation used to create the QRISK3_2017_test and QRISK3_2019_test datasets are unclear. It would be beneficial to explain the reasoning behind creating a dataset with sequential addition of risk factors. Clinically, it would be very unlikely to encounter such a dataset. A more translatable dataset would have been patients with randomly sampled risk factors rather than a sequential addition. Additionally, it could be beneficial for the authors to show a summary of the test datasets (number of individuals, demographic information etc.). The last sentence "with several useful online tutorials 15-18 ' should be deleted as these are not relevant to the manuscript.
1. Table 1: The ratio concerning cholesterol should read "Total cholesterol/HDL ration?" in order to clarify the input data for users.

2.
For the Validation section CPRD needs to be defined. A description of the CPRD dataset along with the methodology and statistical terminology (e.g. discrimination, calibration, Cstatistic etc.) should be clearly described in the methods. Additionally, the study that includes 3.6 million patients that was used is not described and there is no assurance of what the data were, how it was formatted, or how it performed. The authors emphasize the importance of how fields are entered and presented into the R package, therefore, a description of the validation set showing any issues and to elaborate on the results with the R package are needed in order to determine how well the package is performing. The last sentence needs to be adjusted as it is confusing and needs to read better in order to determine what the authors are trying to convey.

3.
Usage and features: The authors state that missing values should be handled, but do not elaborate as to how they should be handled. What if a value is truly missing? If data handling is an issue with QRISK3 then that should be presented and how it will affect the QRISK3 score being calculated should be explained. A reference could be added to indicate how these should be handled.

4.
Use case: The instructions for creation of a dataset is unclear. Does the user just need to structure their data file similar to the test dataset as opposed to creating a statistical analysis dataset?

5.
Discussion, 2 nd paragraph: Please rewrite the last half of the second paragraph. The reason for creation of this package was presented in the Introduction and the description in this 6.
section is not useful. The last portion of the second sentence needs to be deleted or rewritten.
Discussion, 3 rd paragraph: The second to last sentence is confusing. It could be rewritten from "…who is a smoker is coded as "1" in the variable….." to "…who is a smoker and has the smoking variable coded as "1" would be in conflict……".

7.
Discussion: Please delete the words "to better understand" as this package only calculates a number and does not enable exploration of understanding.

8.
Overall observation: Although the package implements the QRISK3 algorithm in an R package, the ultimate goal as stated by the authors is for investigators to readily use the algorithm for large cohorts. Therefore, one would require working knowledge of R to use the package. One way to increase the ease of use would be to create a Shiny App using the algorithm written in R to make the program more accessible to clinicians and investigators. Additionally, there are no functions to create data/results visualizations. A package should be somewhat standalone in the sense that it provides all functions from data exploration to modeling. A single function usually does not merit a research article.

Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool? Partly
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article? Partly Yan Li, the University of Manchester, Manchester, UK

Reviewer comments:
This paper describes an R package for calculating QRISK3 scores in large datasets. The introduction reads well and provides justification for the paper although the package only describes one function. The methods section needs more detail and adjustments. Some of the items can be fixed by improving grammar and readability although others need clarification or details added. Please find our particular comments below. Reply: We appreciate the time and effort that reviewers spent on this paper. We thank for those helpful comments and improved the paper.

Reviewer comments:
Development and validation of the QRISK3 R package: The description of the simulation used to create the QRISK3_2017_test and QRISK3_2019_test datasets are unclear. It would be beneficial to explain the reasoning behind creating a dataset with sequential addition of risk factors. Clinically, it would be very unlikely to encounter such a dataset. A more translatable dataset would have been patients with randomly sampled risk factors rather than a sequential addition. Additionally, it could be beneficial for the authors to show a summary of the test datasets (number of individuals, demographic information etc.). The last sentence "with several useful online tutorials15-18' should be deleted as these are not relevant to the manuscript. Reply: There are no sequential addition of predictors in two datasets. Each patient only has one positive CVD predictor in the test datasets and all the patients covered change of all predictors. In this case dataset with randomly sampled risk factors is equivalent to datasets which covered change of all the predictors for package testing purpose, as each observation of the data with randomly sampled risk factors is a combination of each individual predictor, i.e. for each individual predictor, the package calculates the same score as original algorithm then it would calculate the same score for patients with a combination of individual predictors. Also because the more positive predictors a patient have a higher risk would be expected, it would be more difficult to verify whether the implementation is correct when a patient with many predictors has a very high risk such as 99.9% in R version and 99.8% in original version of QRISK3, and data with randomly sampled risk factors cannot directly provide insights of what predictors were implemented incorrectly (e.g. patients with 10 predictors result different score, which predictors among these 10 predictors contribute to this?). Therefore, comparing low risk patients (i.e. with only one predictor each and cover the change of all predictors) is preferred. Furthermore, we and ClinRisk (in the terms of agreement) encourage users with more hesitation to verify this R package with their own simulated datasets using the provided original QRISK3 algorithm and R package as a part of their data quality control process.

○
The demographic information of testing data was not shown as it was mainly used for testing whether the implemented QRISK3 calculates the same score as the original algorithm, which only serves for package testing and example. The structure information of data shows it has 48 patients records and covered changes of all QRISK3 predictors.
We did refer to contents of those online tutorials to create this package, i.e. three of them guided us how to publish the package and one guided how to organize package paper. As reference aims to acknowledge the contribution of other writers and researchers in our work 1 , these are relevant to this paper.
Reviewer comments: Table 1: The ratio concerning cholesterol should read "Total cholesterol/HDL ration?" in order to clarify the input data for users.
Reply: Thanks. corrected as suggested.

Reviewer comments:
For the Validation section CPRD needs to be defined. A description of the CPRD dataset along with the methodology and statistical terminology (e.g. discrimination, calibration, Cstatistic etc.) should be clearly described in the methods. Additionally, the study that includes 3.6 million patients that was used is not described and there is no assurance of what the data were, how it was formatted, or how it performed. The authors emphasize the importance of how fields are entered and presented into the R package, therefore, a description of the validation set showing any issues and to elaborate on the results with the R package are needed in order to determine how well the package is performing. The last sentence needs to be adjusted as it is confusing and needs to read better in order to determine what the authors are trying to convey.

Reply:
We added references to clarify these. Description of CPRD could be found here 2 . Explanation of model performance measurements including discrimination and calibration were detailed explained here 3 . The study includes 3.6 million patients can be found here 4 . The main validation of this R package is to calculate the same risk score as original algorithm, and the test group which covered change of all the predictors has shown this. Applying this R package to a larger cohort is an addition of verification, as incorrect implementation would result very poor model performance.
We have rephrased the last sentence according to the suggestion.

○
In manuscript "Applying this QRISK3 package to a big CPRD cohorts 2 with 3.6 million patients 4 showed a good discrimination (C statistic: 0.85) and calibration 3 similar to the original QRISK3 5 ."

Reviewer comments:
Usage and features: The authors state that missing values should be handled, but do not elaborate as to how they should be handled. What if a value is truly missing? If data handling is an issue with QRISK3 then that should be presented and how it will affect the QRISK3 score being calculated should be explained. A reference could be added to indicate how these should be handled.
Reply: There are no truly missing risk factors for the GP, as they can enter additional data when reviewing QRISK; for researchers, dealing with missing value with methods such as multiple imputation can be found in chapter 7 of the reference 6 .
We have added this reference in manuscript.

In manuscript
"Missing values in the dataset should be handled (e.g. multiple imputation) before using this package 6 . "

Reviewer comments:
Use case: The instructions for creation of a dataset is unclear. Does the user just need to structure their data file similar to the test dataset as opposed to creating a statistical analysis dataset?
Reply: Yes, user need to structure their data file to a data frame similar to the test dataset. We rephrased sentences in main manuscript to better clarify this.

○
In manuscript "Users first need to structure their data file similar to the provided test dataset (e.g. QRISK3_2019_test) which contains information of patients' identifier and QRISK3 risk factors and mimic QRISK3's training cohort 5 . The structured data file (statistical analysis dataset) would be each row (observation) represents one individual patient and each column represents one of QRISK3 predictors."

Reviewer comments:
Discussion, 2nd paragraph: Please rewrite the last half of the second paragraph. The reason for creation of this package was presented in the Introduction and the description in this section is not useful. The last portion of the second sentence needs to be deleted or rewritten.
Reply: Thanks. We have rewritten this paragraph.

○
In manuscript "Though QRISK3 was already published and released from the online website, it is time consuming for researchers to calculate QRISK3 risk score, as the online calculator cannot be used as a service to obtain QRISK3 scores for a large cohort, this package bridges this gap by helping researchers to apply QRISK3 model to their own cohort."

Reviewer comments:
Discussion, 3rd paragraph: The second to last sentence is confusing. It could be rewritten from "…who is a smoker is coded as "1" in the variable….." to "…who is a smoker and has the smoking variable coded as "1" would be in conflict……".
Reply: Thanks. We have rephrased this sentence. ○ In manuscript "For example, a patient who is a smoker and has the smoking variable coded as "1" would conflict with the definition of the QRISK3 algorithm ("smoking" equals 1 in this R package means non-smoker)."

Reviewer comments:
Discussion: Please delete the words "to better understand" as this package only calculates a number and does not enable exploration of understanding.
Reply: Okay, we deleted and rephrased this.

○
In manuscript "In conclusion, we developed this R package to allow researchers to obtain QRISK3 scores for large cohorts. This tool could help researchers to improve risk prediction modelling based on a currently used risk prediction model."

Reviewer comments:
Overall observation: Although the package implements the QRISK3 algorithm in an R package, the ultimate goal as stated by the authors is for investigators to readily use the algorithm for large cohorts. Therefore, one would require working knowledge of R to use the package. One way to increase the ease of use would be to create a Shiny App using the algorithm written in R to make the program more accessible to clinicians and investigators.
Additionally, there are no functions to create data/results visualizations. A package should be somewhat standalone in the sense that it provides all functions from data exploration to modelling. A single function usually does not merit a research article.

Reply:
The Shiny app works for clinicians and investigators was already made by the QRISK3 developer and could be found in https://qrisk.org/three/, it was also integrated with electronic health records system such as EMIS 7 .

○
The main function returns a dataset with calculated QRISK score. As suggested before, the visualisation is purely depending on what research question was asked. Also, the web version of QRISK3 did not provide visualisation as well 8 . For a continuous variable like calculated QRISK risk score, a summary function in R (i.e. "summary(test_all_rst$QRISK3_2017)" ) would provide basic statistics such as mean and standard deviation, and with a histogram function (i.e. "hist(test_all_rst$QRISK3_2017)" ) would plot histogram of the calculated risk score and "plot(density(test_all_rst$QRISK3_2017))" to plot the distribution of the calculated risk score.
We understand the desire of collecting all functions into one place, but not a single R package from CRAN 9 can provide all such functions in one place. This is because either these functions were already provided by other packages or massive number of functions in one package could mask those truly helpful functions. However, CRAN task review may help, as it integrates all sort of R packages within a relevant topic. "CRAN Task View: Missing Data 10 " describes packages to deal with missing data. "CRAN Task View: Survival Analysis 11 " provides all sorts of R packages which supports model development in survival analysis.
Other generic R programming skill could be found in this free online tutorial 12 .
Overall, we believe the merit of this work is to provide a tool along with references to save researchers from re-implementing and re-validating QRISK3 into R. This tool could help data scientists to compare model performance and predicted risk of their own risk prediction model to a currently used risk prediction model, help other researchers to apply QRISK3 to a large cohort and help clinical trials which requires identify patients at certain risk levels.

Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool? Yes
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article? Yes explain the meaning of these variables in original algorithm. Users also need to validate their implemented version is the same as the original version of QRISK3, this requires users to have programming knowledge of both R and C while R is the dominated language in data science. Therefore, researchers don't have table 1 of this paper if they start to implement QRISK3 from the original algorithm. This creates challenges that it is hard to acquire a similar statistic model formula as presented in Framingham, and it is hard to identify which variable represents what risk factors especially considering they were written in C. For example, how to know variable such as "b_impotence2", "ethrisk" and "fh_cvd" represents "erectile disfunction", "ethnicity" and "relative of CVD". QRISK was used to predict 10 years risk, why there are 11 values in a vector which represents baseline risk from the original algorithm, and which variable represents this baseline risk? There is a long distance between input value of predictors to output a correct predicted risk. This package solved all the challenges mentioned above and bridges the gaps.
This R package acts like a bridge which enables researchers to easily access and use QRISK3 model in R like using Framingham model.
Without this R package and this paper described how this R package was developed and validated, researchers may spend months to re-implement QRISK into R and they also need to spend a lot of effort on validating whether their implemented version of QRISK3 does the similar thing as the original algorithm. Overall, our package and this paper solved these, and saved time and energy for researchers who only require calculated QRISK3 score as part of their new research.

Reviewer's comments:
I suggest enriching this package with more visualization and data manipulation functions and datasets. Besides, I do not think it needs to highlight validation of R performance with SAS or C. This is part of your quality control, not an outcome. Also, I have some minor comments: Reply: We appreciate reviewers' suggestion, but this package was a simple implementation of QRISK3 algorithm. We decide to make this package as simple as we can as: It would be easier for user to quickly obtain QRISK3 score. Rather than searching a list of redundant functions, user could instantly test and use this package.

1.
There are mature R packages which support reviewer suggested functions. Due to the nature of this R package (i.e. only write with base R language without dependent packages), it is well compatible to all the other R packages. For visualization, users could easily use R package like "ggplot2" with this R package. For data manipulation, R package like "dplyr" is well compatible to our package.

2.
The main function of this R package is to provide QRISK3 calculated risk score. What visualization and datasets user might require were purely dependent on their research question. In the current version, we have provided a dataset with patient identifier and calculated QRISK score, which could be easily used by researchers to draw their own visualization.

3.
Though there are only two datasets present in this package, it contains all the key information of what the datasets would look like and what variables were needed. They contain enough information for user to understand and use this package easily. Overall, simplistic is a part of design of this R package and there are well established R packages which could be used with this simple R package. We do not find any necessary reasons to add more functions in this R package, as the aim of this R package is to help users to obtain a QRISK3 score from large dataset quickly and current version fulfilled this purpose. We are happy to add more functions or tools if there are more specific requirements from users in future.
We also believe it is essential to show user how this R package was validated with the original algorithm, as this R package aims to provide a correct QRISK3 predicted score. Users who decide to use our package may wish to replicate our validation process.

Reviewer's comments:
Why didn't you use the R 3.6.2 instead of 3.4.2?
Reply: R 3.4.2 was used to develop the prototype of this R package in 2017. However, all functions in this package used R base language, so it can be used in any R version including 3.6.2 (tested).

Reviewer's comments:
Why you did not make the categorical variables as ordered factor R object instead of a simple numeric object? (in this case, you do not need to think about the conflict of the meaning of "1" and "0").
Reply: This was due to how QRISK3 score was calculated. It was calculated from a mathematic formula where we need to calculate a linear predictor which requires numeric value of predictors. We coded all variables into numeric rather than some are numeric, and some are factor is for the ease of users (follow the simplistic principle). In this case, users would know all variables need to be numeric rather than to remember which variable should be a factor especially when there are over 20 predictors in the model. We also provided an automate function to help user double check which variables were not coded into numeric. However, we do realize that we may never prevent all unforeseen mistake, so we highlight this in descriptions of both of paper and package.

Reviewer's comments:
The range of patient ages is not consistent between text and table.
Reply: thanks. Corrected in the next version.

Reviewer's comments:
"Height", "Weight" and "Weight/Height" in table 1 should be lowercase the same as the package.

Other reviewer's comments:
Is the rationale for developing the new software tool clearly explained?
Yes Is the description of the software tool technically sound?

Yes
Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?
Yes Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?

Yes
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article? Yes