whoishRisk – an R package to calculate WHO/ISH cardiovascular risk scores for all epidemiological subregions of the world

The World Health Organisation and International Society of Hypertension (WHO/ISH) cardiovascular disease (CVD) risk assessment charts have been implemented in many low- and middle-income countries as part of the WHO Package of Essential Non-Communicable Disease (PEN) Interventions for Primary Health Care in Low-Resource settings. Evaluation of the WHO/ISH cardiovascular risk charts and their use is a key priority and since they only existed in paper or PDF formats, we developed an R implementation of the charts for all epidemiological subregions of the world. The main strengths of this implementation are that it is built in a free, open-source, coding language with simple syntax, can be downloaded from github as a package (“whoishRisk”), and can be used with a standard computer.

This article is included in the channel. RPackage Cardiovascular disease (CVD) is the leading cause of death worldwide, including in many low-and-middle income countries (LMIC) 1,2 . Preventing CVD is therefore a worldwide priority and the World Health Organisation (WHO) is coordinating a global strategy for LMIC to systematically prevent CVD in primary care 3 .
In 2007 the WHO and the International Society of Hypertension (ISH) published the WHO/ISH CVD risk charts for all WHO epidemiological subregions of the world 4 . These charts are to be used as part of the WHO's Package of Essential NCD (PEN) Interventions for Primary Health Care in Low-Resource Settings in jurisdictions that do not have their own population-derived risk assessment algorithms. While these charts are a good resource for many health systems, little is known about their validity 5 . Therefore, it is important that jurisdictions that implement these charts conduct operational research and attempt to validate and optimise them for their setting.
Two paper-based versions of WHO/ISH charts are available for each subregion: one that requires measured total cholesterol and one that does not. The latter was made available for use in settings with limited access to laboratory testing or where the cost of cholesterol testing is prohibitive. Both charts require information on age, gender, diabetes status, smoking status, and systolic blood pressure to stratify people into one of five risk categories of 10-year risk of a fatal or non-fatal CVD event. Further instructions for their use have been published, including the definition and classification of the fourteen epidemiological sub-regions 3 .
Through our experience collaborating with LMIC with the implementation of WHO PEN, we identified a common need for an open-source tool to facilitate the implementation of WHO/ISH risk charts and operational research of WHO PEN at a population level. We therefore developed an R package called whoishRisk, which we describe here and make available to researchers in LMIC. R is a statistical computing language and environment which is open source and freely available to anyone 6 .

Extraction of WHO/ISH cardiovascular risk charts
We extracted all versions of the paper-based WHO/ISH CVD risk charts by hand into a standardized Microsoft Excel template, independently and in duplicate. We compared the duplicate extractions and calculated Cohen's kappa coefficient for inter-rater reliability, using the irr package in R 7 . Discrepancies were reviewed by the same two extractors and resolved by referring to the original paper chart.
Development of the WHO/ISH risk function One author wrote the initial code for the WHO/ISH risk function in R and created the whoishRisk package (DC). This was reviewed and adapted by a second author experienced in the R language (CK). Two additional authors (JL, NB), new to the R language, reviewed the code to ensure the syntax was comprehensible.

Validation
A MatLab implementation of WHO/ISH risk charts for epidemiological subregion SEAR D had been previously reported 8 . We used Octave (www.gnu.org/software/octave/) version 8.3.2 to calculate the SEAR D WHO/ISH risk score for every possible combination of risk factors using the previously reported MatLab implementation, and compared the percent agreement to the risk scores generated by our R package, whoishRisk.

whoishRisk Package
The whoishRisk package can be downloaded and installed directly from github using the install_github() command in the devtools package, with the argument "DylanRJCollins/whoishRisk" 9 . The package contains a single function, WHO_ISH_Risk(), which calculates the WHO/ISH CVD risk score for any epidemiological subregion of the world based on the parameter values passed to it.
Extraction of WHO/ISH cardiovascular risk charts All WHO/ISH risk charts were extracted by hand into a single comma delimited file (Dataset 1). The first six columns specify the risk factor values, and the last 14 columns specify the corresponding risk category for a given subregion. whoishRisk uses these data internally to calculate the WHO/ISH risk score. Cohen's kappa for initial agreement between the independent extractors was 0.97, indicating excellent agreement. All remaining discrepancies were resolved by consensus. Development of the WHO_ISH_Risk() Function whoishRisk contains a single function, called WHO_ISH_Risk(), which calculates the WHO/ISH risk score for any epidemiological subregion. The function code is reported herein (Dataset 2).

Amendments from Version 1
The main update of this version is the development of an R package using the previously reported R code. The package, called whoishRisk, can be downloaded directly from github as described in the paper. The benefit of this is that users now do not need to load the R code or any dependency files in the work space; instead they must download the whoishRisk package and load it from the library.
The major update to the code itself is the addition of warning messages when parameters passed to the WHO_ISH_Risk() function appear out of range.
The title was changed accordingly to reflect the development of the whoishRisk package. Figure 1 was removed as it was no longer relevant, and Internally, WHO_ISH_Risk() requires access to the comma delimited file named "WHO_ISH_Scores.csv", which it calls automatically from within the package but is also included herein (Dataset 1).
The WHO_ISH_Risk() function creates an internal data frame of the risk factor values passed to it. Parameters can be single values or vectors of equal length. It then categorises the continuous parameters age, systolic blood pressure, and total cholesterol. Age and systolic blood pressure were categorised according to WHO guidance 10 . Total cholesterol was categorised into one of the five possible categories (<=4,5,6,7 and >=8 mmol/L) according to common clinical practice, rounding up from 0.5 to the nearest integer.
Internally, a unique identification code is generated corresponding to the combinations of risk factors for each individual. This code is matched to a reference code from the "WHO_ISH_Scores. csv" file. The function stores the risk scores in a data frame that includes the risk factors, and ultimately returns a vector containing the risk scores. The output of the function is one of five different character strings, corresponding to the five different WHO/ISH risk categories: "<10%", "10 to <20%", "20 to <30%", "30 to <40%", >=40%".
Warning messages are included when parameters appear out of range. These messages, their conditions, and their intended interpretation are described in Table 2. Out of range continuous parameters (age, systolic blood pressure, total cholesterol) are non-fatal and a risk score will be generated with a warning message. Out of range dichotomous variables (gender, smoking status, diabetes status) are fatal errors and the output (NA) will be generated with a warning.
Worked example whoishRisk can be installed in one step using install_github() from the devtools package 9 . We have included a worked example of how to install whoishRisk and use the WHO_ISH_Risk() function to calculate the risk score for five individuals. #Step 4: Pass the risk factor vectors to the WHO_ISH_Risk() function, and set subregion equal to the name of the appropriate epidemiological subregion (e.g. "EMR_B"). This will return a vector of WHO/ISH risk scores.

Validation
Comparison with the published MatLab implementation of the SEAR D risk charts 8 showed 100% agreement with our R implementation, for all possible combinations of risk factors.

Discussion
To our knowledge, this is the first publically available R implementation of WHO/ISH CVD risk charts for all WHO epidemiological subregions of the world. Our package, whoishRisk, may be used for analysis of cardiovascular risk when electronic patient data is available. The code will automatically apply WHO/ISH risk scores to patients based on age, gender, systolic blood pressure, smoking status, diabetes status, total cholesterol, and epidemiological subregion. This code could be used, for example, during a pilot implementation of WHO PEN to audit the accuracy of risk assessment by comparing documented risk scores to actual risk scores calculated using this tool. We have provided a complete worked example.
While WHO PEN guidance specifies the range of systolic blood pressure values for each systolic blood pressure category, it provides no such guidance for categorising total cholesterol. Based on our opinion and clinical experience, and on a previously published implementation in MatLab 8 , we chose to categorise total cholesterol by rounding up at 0.5 to the next integer.
The "WHO_ISH_Scores.csv" file is provided herein for transparency and to promote collaboration and cross validation. While the risk score values it stores are returned to the workspace as characters (e.g. "10 to <20%"), a user could simply convert these to class numeric or factor. We chose to return them as character strings that are identical to the patient charts in order to produce a literal implementation of the risk charts.

Conclusion
We created an R package called whoishRisk to be used for the calculation of WHO/ISH CVD risk charts for all WHO epidemiological subregions of the world. It contains a single function, WHO_ISH_Risk(), that requires seven parameters: age, gender, systolic blood pressure, smoking status, diabetes status, total cholesterol, and epidemiological subregion. whoishRisk can be used to quickly calculate WHO/ISH risk scores from routinely collected electronic patient data and therefore aid in the implementation and evaluation of these risk charts in lowresource settings. I think the developed package represents a useful addition to the field. I was able to install and run the package and indeed the package reports the absence of data points (NA). package and indeed the package reports the absence of data points (NA). Only a minor concern: within R, getting the citation 'citation("whoishRisk")' returns the names of the name of the package, the names of the authors and the R version. I recommend they include a reference to this work.

I have read this submission. I believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.
No competing interests were disclosed. Competing Interests: 09  authors intent this tool to be used and possibly modified by users with little R expertise then a separate file with detailed instruction should be provided.
The authors state that users can modify the WHO_ISH_Scores.csv file to represent other risk categories. That would require explanation of the content of the file. Although the used abbreviations gdr for gender might seem obvious, they should be explained. Also when downloading the file from F1000, the file name is changed and appears as "5ae9107XXX..._WHO_ISH_Scores.csv" so that the reading fails. This could be fixed by including the file as a dataset in the corresponding package.
When data from multiple patients are provided (as in the example) the output is a factor. I think it would help the users match input and output values is some tabular format or some patient identifier is provided in the output.
If any of the input parameters are missing (NA values) the code outputs "NA" but no information regarding the missing values is provided. It might be helpful if the code were to report which value was actually missing. Fig. 1 shows the code and contains mainly information on how some continuous variables (age, cholesterol and systolic blood pressure are discretized). I think a more efficient way of conveying this information is by including it in Table 1.

I have read this submission. I believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.
No competing interests were disclosed.

Competing Interests:
Author Response 06 Mar 2017 , University of Oxford, UK

Dylan Collins
Dear Maria Suarez-Diez, Thank you for your valuable comments. In response we have made the following changes: We created an R package including all documentation files which can be downloaded from github We no longer suggest users to change the "WHO_ISH_Scores.csv" files and we have updated the file name hosted by F1000 to be "WHO_ISH_Scores.csv". This file, as suggested, is internal to the package and while we report it here is not necessary to download. We have added a series of warning messages to help explain outputs of "NA" and to help catch errors in the input values (e.g. out of range parameters) The authors describe some R code for helping to calculate cardiovascular risk scores. I think the code needs significant work.
The code is in an .rtf file. This is very bad software practice. The authors should at the very very least put the code in a file with .R extension.
Ideally the code should be put in an R package. It is relatively straight-forward to make an R package these days. See http://r-pkgs.had.co.nz/ for help. This makes it easy to add documentation (of which there is currently essentially none for the current function included in the manuscript), tests, etc. Looks like the authors have a Github repository (https://github.com/DylanRJCollins/WHO_ISH_R_Implementation) -this could be made into an R package.
Another benefit of making a package, is that R packages have a way of including datasets in them. The csv file is small enough that you can include it in the package if that's appropriate for the use case here. If the `WHO_ISH_Scores.csv` dataset is not likely to change, or not likely to change very often, they can include it in the package.
The dataset `WHO_ISH_Scores.csv` has two columns that are probably row names that should be removed. If they aren't data, remove them.
The dataset `WHO_ISH_Scores.csv`: the data is not in a tidy format that is easy to work with. There shouldn't be numeric data combined with symbols (e.g. >=40%). If possible, I urge authors to find a way to make these into numeric columns while retaining the same information. However, it may be that it's too difficult to separate numeric values from the greater than / less than symbols etc.
R packages should be cited by reference. e.g. "using the irr package (version 0.84)". It is good they cite the version, but put a reference in in your references. Run `citation("irr")` in an R session to get a reference for it.
"We used RStudio (version 0.99.489) to compare the duplicate extractions ...": Rstudio is just an IDE. Say that you used R, not RStudio. It's fine to cite RStudio, but also cite R. Run `citation()` in an R session to get the citation for R.
Octave is an open source language, and thus the authors should share the Octave code.
"[...] we intentionally sought to use simple syntax in the base package" -the authors script is in fact not a package, but if they do make a package this wording is good.
In the Conclusion: "We created a simple R implementation of WHO/ISH CVD risk charts for all WHO epidemiological subregions of the world, requiring only the base R package." -refer to this as "base R", not "the base R package".
"base R", not "the base R package".
If the dataset "WHO_ISH_Scores.csv" can be modified, thus different versions of it may be used by the user, it makes more sense to pass in the dataset as a parameter, instead of hard coding reading in the data in the function.
I urge authors to include a license with their R package (to be created) -and to submit the package to CRAN so it's easy for all to install. The paper describes a software tool for calculation of WHO/ISH cardiovascular risk scores for different epidemiological subregions of the world. This could be a useful piece of software for researchers working with cardiovascular epidemiological data and make the practices for calculating such scores more reproducible over multiple studies. However, in the present form I found the paper to be severely lacking in both implementation and substance.
First, the code should not be distributed as an RTF file. R packages are well accepted standard for reproducibly distributing R code that also forces to provide some rudimentary documentation and examples. As an additional benefit, hosting the package in a repository like CRAN, Bioconductor or even Github, makes it easy to install and update the program. There is no reason to not follow this convention for this particular project, thus, the code should be converted into package format and hosted in one of the above-mentioned services to be any use to the community.
The argument of the code in the present form being simply adjustable is not valid. One has to spend quite a bit of time understanding what is happening in the code, especially when not an R expert. For the code to fulfill its purpose it has to be rewritten in a way that adjustable parts are presented to the user as function parameters (with reasonable defaults). This way users have to know even less R to use the code effectively.
Also, having a figure depicting the code is uncommon to say the least and not useful in any way. It would make much more sense to include a worked out example to the text, where you would describe a common use case of your R package.
To enhance usefulness of the package and make it more universally applicable it would be good to see some more risk prediction models like Framingham and SCORE added to the software. This would make it easy to compare the validity of different scores over a study population.
I have read this submission. I believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.
No competing interests were disclosed.

Competing Interests:
Author Response 06 Mar 2017 , University of Oxford, UK

Dylan Collins
Dear Raivo Kolde, Thank you for your insightful review and comments. In response we have made the following changes: As suggested, we created an R package which can be downloaded directly from github changes: As suggested, we created an R package which can be downloaded directly from github We no longer advocate for the code to be adjustable and through the development of a package have made it simpler to use We removed the figure of the code, and have added a worked example in the text The main difference between scores like Framingham, SCORE, and QRISK2 is that they have underlying cox model equations which can be implemented in R, whereas WHO/ISH risk charts do not -hence the rationale for the development of this package. While the additional of further risk scores was outside the scope of this work, we welcome collaborators who might want to contribute in this respect. Every Best,

Dylan Collins
No competing interests were disclosed. Competing Interests: Please make the R code available in plain text format.

Discuss this Article
No competing interests were disclosed. Competing Interests: