Simple and adaptable R implementation of WHO / ISH cardiovascular risk charts for all epidemiological subregions of the world

The World Health Organisation and International Society of Hypertension (WHO/ISH) cardiovascular disease (CVD) risk assessment charts have been implemented in many lowand middle-income countries as part of the WHO Package of Essential Non-Communicable Disease (PEN) Interventions for Primary Health Care in Low-Resource settings. Evaluation of the WHO/ISH cardiovascular risk charts and their use is a key priority and since they only exist in paper or PDF formats, we developed a simple R implementation of the charts for all epidemiological subregions of the world. The main strengths of this implementation are that it is built in a free, open-source, coding language with simple syntax, can be modified by the user, and can be used with a standard computer.


Introduction
Cardiovascular disease (CVD) is the leading cause of death worldwide, including in many low-and-middle income countries (LMIC) 1,2 . Preventing CVD is therefore a worldwide priority and the World Health Organisation (WHO) is coordinating a global strategy for LMIC to systematically prevent CVD in primary care 3 .
In 2007 the WHO and the International Society of Hypertension (ISH) published the WHO/ISH CVD risk charts for all WHO epidemiological subregions of the world 4 . These charts are to be used as part of the WHO's Package of Essential NCD (PEN) Interventions for Primary Health Care in Low-Resource Settings in jurisdictions that do not have their own population-derived risk assessment algorithms. While these charts are a good resource for many health systems, little is known about their validity 5 . Therefore, it is important that jurisdictions that implement these charts conduct operational research and attempt to validate and optimise them for their setting.
Two paper-based versions of WHO/ISH charts are available for each subregion: one that requires measured total cholesterol and one that does not. The latter was made available for use in settings with limited access to laboratory testing or where the cost of cholesterol testing is prohibitive. Both charts require information on age, gender, diabetes status, smoking status, and systolic blood pressure to stratify people into one of five risk categories of 10-year risk of a fatal or non-fatal CVD event. Further instructions for their use have been published 3 .
Through our experience collaborating with LMIC with the implementation of WHO PEN, we identified a common need for an open-source tool to facilitate the implementation of WHO/ISH risk charts and operational research of WHO PEN at a population level. We therefore developed an open source tool in R (https://www.r-project.org/), which we describe here and make available to researchers in LMIC.

Extraction of WHO/ISH cardiovascular risk charts
We extracted all versions of the paper-based WHO/ISH CVD risk charts by hand into a standardized Microsoft Excel template, independently and in duplicate. We used RStudio (version 0.99.489) to compare the duplicate extractions and to calculate Cohen's kappa coefficient for inter-rater reliability, using the irr package (version 0.84). Discrepancies were reviewed by the same two extractors and resolved by referring to the original paper chart.

Development of the WHO/ISH risk function
One author wrote the initial code for the WHO/ISH risk function in R (DC). This was reviewed and adapted by a second author experienced in the R language (CK). Two additional authors (JL, NB), new to the R language, reviewed the code to ensure the syntax was simple and comprehensible.

Validation
A MatLab implementation of WHO/ISH risk charts for epidemiological subregion SEAR D had been previously reported 6 . We used Octave (www.gnu.org/software/octave/) version 8.3.2 to calculate the SEAR D WHO/ISH risk score for every possible combination of risk factors using the previously reported MatLab implementation, and compared the percent agreement to the risk scores generated by our R implementation.

Results
Extraction of WHO/ISH cardiovascular risk charts All WHO/ISH risk charts were extracted by hand into a single comma delimited file (Dataset 1). Our function is dependent on this file. Cohen's kappa for initial agreement between the independent extractors was 0.97, indicating excellent agreement. All remaining discrepancies were resolved by consensus.

Development of the WHO/ISH risk function
We developed a simple function, named WHO_ISH_Risk(), that, when loaded in the R workspace, will calculate the WHO/ ISH risk score for any epidemiological subregion (Dataset 2) ( Figure 1). We intentionally used simple syntax such that users with a beginner's level of experience with R can adapt the code as needed. The WHO_ISH_Risk function requires seven parameters: age, gender, smoking status, diabetes status, systolic blood pressure, total cholesterol, and the appropriate WHO epidemiological subregion. These parameters and their codes are summarised in Table 1. The function format in the workspace is: WHO_ISH_ Risk(age, gdr, smk, sbp, dm, chl, subregion). No default values are specified for any parameter.
WHO_ISH_Risk() function uses the base package in R and requires no package dependencies. Once the function is loaded in the workspace, it requires access to the comma delimited file named "WHO_ISH_Scores.csv" (Dataset 1). The user needs to ensure that this file is accessible in the working directory of R before running the WHO_ISH_Risk() function. We have included a worked example of how to use the function in Dataset 2.
A unique identification code is generated corresponding to the combinations of risk factors for each individual. This code is matched to a reference code from the "WHO_ISH_Scores.csv" file, which the function automatically calls into the workspace (Dataset 1). Internally, the function stores the risk scores in a data frame that includes the risk factors, and ultimately returns a vector containing the risk scores.

Validation
Comparison with the published MatLab implementation of the SEAR D risk charts 6 showed 100% agreement with our R implementation, for all possible combinations of risk factors.

Discussion
To our knowledge, this is the first publically available R implementation of WHO/ISH CVD risk charts for all WHO epidemiological subregions of the world. Our implementation may be used for analysis of cardiovascular risk when electronic patient data is available. The code will automatically apply WHO/ISH risk scores to patients based on age, gender, systolic blood pressure, smoking status, diabetes status, total cholesterol, and epidemiological subregion. This code could be used, for example, during a pilot implementation of WHO PEN to audit the accuracy of risk assessment by comparing documented risk scores to actual risk scores calculated using this tool. We have provided a complete worked example in the data files. While more sophisticated implementations are possible, we intentionally sought to use simple syntax in the base package to allow for easy interpretation and use by novice R users on standard computers.
Although we modelled the function based on WHO PEN guidance for risk assessment, we recognise that some users may wish to change the boundaries of certain risk factor parameters. While WHO PEN guidance specifies the range of systolic blood pressure values for each systolic blood pressure category, it provides no such guidance for categorising total cholesterol. Based on our opinion and clinical experience, and on a previously published implementation in MatLab 6 , we chose to categorise total cholesterol by rounding up at 0.5 to the next integer. These boundaries could be changed by users to adapt to local practice. We caution changing the boundaries beyond recommended guidance.
The "WHO_ISH_Scores.csv" file can be adapted by the user if desired. Each row of the file represents one unique combination of risk factors. The first six columns specify the risk factor values, and the last 14 columns specify the corresponding risk category for a given subregion. These risk categories can be changed by the user, but in their current state they represent the WHO/ISH risk charts as published.

Conclusion
We created a simple R implementation of WHO/ISH CVD risk charts for all WHO epidemiological subregions of the world, requiring only the base R package. It has one dependency file from which it draws the WHO/ISH risk scores based on a combination of seven parameters: age, gender, systolic blood pressure, smoking status, diabetes status, total cholesterol, and epidemiological subregion. This tool can be used, and adapted, by policy-makers and researchers involved in the implementation and evaluation of WHO/ISH CVD risk charts. Author contributions DC conceived of the idea, extracted data, wrote the initial code, and manuscript. JL and NB extracted data and contributed to writing the manuscript. CK reviewed and adapted the code and contributed to writing the manuscript. AW and CH reviewed and contributed to writing the manuscript.

Data and software availability
Competing interests DC has received payment from the WHO for consulting work. AW and CH have received expenses and grant income from the WHO for projects related to CVD and Self Care in NCDs, and direct a WHO Collaborating Centre. CH also receives funding form the National Institute of Health Research (NIHR) School of Primary Care Research. JL, CK, and NB declare no competing interests.

Grant information
The WHO Collaborating Centre for Self Care paid for the open access publishing fees. No other funding was provided for this work.
As mentioned by the other reviewers, RTF is far from an optimal format to distribute the code. The authors should present their code as a package hosted in some repository. Currently the RTF file contains the function definition, data loading and working example in a single view. When creating the package the authors should clearly separate the code from the documentation and provide a working example. If the authors intent this tool to be used and possibly modified by users with little R expertise then a separate file with detailed instruction should be provided.
The authors state that users can modify the WHO_ISH_Scores.csv file to represent other risk categories. That would require explanation of the content of the file. Although the used abbreviations gdr for gender might seem obvious, they should be explained. Also when downloading the file from F1000, the file name is changed and appears as "5ae9107XXX..._WHO_ISH_Scores.csv" so that the reading fails. This could be fixed by including the file as a dataset in the corresponding package.
When data from multiple patients are provided (as in the example) the output is a factor. I think it would help the users match input and output values is some tabular format or some patient identifier is provided in the output.
If any of the input parameters are missing (NA values) the code outputs "NA" but no information regarding the missing values is provided. It might be helpful if the code were to report which value was actually missing. Fig. 1 shows the code and contains mainly information on how some continuous variables (age, Fig. 1 shows the code and contains mainly information on how some continuous variables (age, cholesterol and systolic blood pressure are discretized). I think a more efficient way of conveying this information is by including it in Table 1.
No competing interests were disclosed.

Competing Interests:
I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above. We created an R package including all documentation files which can be downloaded from github We no longer suggest users to change the "WHO_ISH_Scores.csv" files and we have updated the file name hosted by F1000 to be "WHO_ISH_Scores.csv". This file, as suggested, is internal to the package and while we report it here is not necessary to download.
We have added a series of warning messages to help explain outputs of "NA" and to help catch errors in the input values (e.g. out of range parameters) We have described how continuous variables are discretized in the main text Every Best,

Dylan Collins
No competing interests were disclosed.

Competing Interests:
The code is in an .rtf file. This is very bad software practice. The authors should at the very very least put the code in a file with .R extension.
Ideally the code should be put in an R package. It is relatively straight-forward to make an R package these days. See http://r-pkgs.had.co.nz/ for help. This makes it easy to add documentation (of which there is currently essentially none for the current function included in the manuscript), tests, etc. Looks like the authors have a Github repository (https://github.com/DylanRJCollins/WHO_ISH_R_Implementation) -this could be made into an R package.
Another benefit of making a package, is that R packages have a way of including datasets in them. The csv file is small enough that you can include it in the package if that's appropriate for the use case here. If the `WHO_ISH_Scores.csv` dataset is not likely to change, or not likely to change very often, they can include it in the package.
The dataset `WHO_ISH_Scores.csv` has two columns that are probably row names that should be removed. If they aren't data, remove them.
The dataset `WHO_ISH_Scores.csv`: the data is not in a tidy format that is easy to work with. There shouldn't be numeric data combined with symbols (e.g. >=40%). If possible, I urge authors to find a way to make these into numeric columns while retaining the same information. However, it may be that it's too difficult to separate numeric values from the greater than / less than symbols etc.
R packages should be cited by reference. e.g. "using the irr package (version 0.84)". It is good they cite the version, but put a reference in in your references. Run `citation("irr")` in an R session to get a reference for it.
"We used RStudio (version 0.99.489) to compare the duplicate extractions ...": Rstudio is just an IDE. Say that you used R, not RStudio. It's fine to cite RStudio, but also cite R. Run `citation()` in an R session to get the citation for R.
Octave is an open source language, and thus the authors should share the Octave code. "[...] we intentionally sought to use simple syntax in the base package" -the authors script is in fact not a package, but if they do make a package this wording is good.
In the Conclusion: "We created a simple R implementation of WHO/ISH CVD risk charts for all WHO epidemiological subregions of the world, requiring only the base R package." -refer to this as "base R", not "the base R package".
If the dataset "WHO_ISH_Scores.csv" can be modified, thus different versions of it may be used by the user, it makes more sense to pass in the dataset as a parameter, instead of hard coding reading in the data in the function.
I urge authors to include a license with their R package (to be created) -and to submit the package to CRAN so it's easy for all to install.

Code comments:
I'm not sure the authors meant to do this, but the use of single ampersand in the `ifelse` statements I'm not sure the authors meant to do this, but the use of single ampersand in the `ifelse` statements (e.g., `ifelse(df$age > 17 & df$age < 50, 40, df$age)`) means that they get a vector of logicals (e..g., `df$age > 17 & df$age < 50` evaluates to `TRUE FALSE FALSE FALSE FALSE`) using their example, when I think what they want is a single logical return value. If so, use double ampersand instead: `&&` instead of `&`.
I'm pretty sure, but have not tested thoroughly, that the long series of if statements in the section "Match the look up value with the reference value" can be replaced with just this: ref [[subregion]

Raivo Kolde
Philips Research North America, Cambridge, MA, USA The paper describes a software tool for calculation of WHO/ISH cardiovascular risk scores for different epidemiological subregions of the world. This could be a useful piece of software for researchers working with cardiovascular epidemiological data and make the practices for calculating such scores more reproducible over multiple studies. However, in the present form I found the paper to be severely lacking in both implementation and substance.
First, the code should not be distributed as an RTF file. R packages are well accepted standard for reproducibly distributing R code that also forces to provide some rudimentary documentation and examples. As an additional benefit, hosting the package in a repository like CRAN, Bioconductor or even Github, makes it easy to install and update the program. There is no reason to not follow this convention for this particular project, thus, the code should be converted into package format and hosted in one of the above-mentioned services to be any use to the community.
The argument of the code in the present form being simply adjustable is not valid. One has to spend quite a bit of time understanding what is happening in the code, especially when not an R expert. For the code to fulfill its purpose it has to be rewritten in a way that adjustable parts are presented to the user as function parameters (with reasonable defaults). This way users have to know even less R to use the code effectively.
Also, having a figure depicting the code is uncommon to say the least and not useful in any way. It would make much more sense to include a worked out example to the text, where you would describe a common use case of your R package.
To enhance usefulness of the package and make it more universally applicable it would be good to see some more risk prediction models like Framingham and SCORE added to the software. This would make it easy to compare the validity of different scores over a study population.
No competing interests were disclosed.

Competing Interests:
I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above. Thank you for your insightful review and comments. In response we have made the following changes: As suggested, we created an R package which can be downloaded directly from github We no longer advocate for the code to be adjustable and through the development of a package have made it simpler to use We removed the figure of the code, and have added a worked example in the text The main difference between scores like Framingham, SCORE, and QRISK2 is that they The main difference between scores like Framingham, SCORE, and QRISK2 is that they have underlying cox model equations which can be implemented in R, whereas WHO/ISH risk charts do not -hence the rationale for the development of this package. While the additional of further risk scores was outside the scope of this work, we welcome collaborators who might want to contribute in this respect. Every Best,

Dylan Collins
No competing interests were disclosed. Competing Interests: