Extraction of CPRD additional clinical data using R [version 1; peer review: 1 approved with reservations]

The Clinical Practice Research Datalink is a nation-wide database of primary healthcare data records in England (UK) linked to several health services. A visit to a health practitioner can result in the digital storing of diagnostic and prescription therapeutic information. Access to patient primary care and linked service data depends on the research in mind; however, typically several flat files that describe patient interactions with a health practitioner are delivered. Some of these files will describe additional data such as the result of medical tests and patient lifestyles, denoted collectively into entity values. This data is used to supplement the medical notes recorded by a general practitioner. We have made available a set of R scripts that reads the clinical flat files, additional clinical flat files and entity values, and returns patient clinical data linked with the requested additional data. We have also included medcode descriptions associated with several entities along with instruction of how to extend the code for additional entities. The code is free to download under the MIT license: https://github.com/acnash/CPRD_Additional_Clinical


Introduction
The Clinical Practice Research Datalink (CPRD) is an NHS primary care service that stores clinical, referral, therapeutic data and linked medical services such as imaging and hospital records of patients enrolled in England (UK). In its current form, CPRD has been active for over twenty years and holds up to 11 million records 1 . CPRD data has been widely used to improve medical practice and to further our understanding of drug efficacy and drug safety and disease mechanism 2 . As with all longitudinal data, understanding the outcome of a disease is usually confounded by several factors 3 . Patient comorbidities, therapy data and social habits (to name but a few) are all potential confounders that should be treated appropriately (see 4 for a research example of common patient characteristics using a CPRD data release). However, the curation and manipulation of CPRD data can cause significant barriers for many researchers, especially those less familiar with manipulating text-based large datasets or programming.
Data released from the CPRD Gold database data is presented as a series of flat files. A patient's longitudinal data is presented across several rows, with each column denoting a record property. Patient records are linked between flat files using an anonymized patient identifier and several additional identifiers depending on the data at hand. Clinical diagnosis and prescribed therapeutics is encoded as a medcode and a prodcode, with a corresponding description available using the CPRD data dictionary. Most requests for CPRD Gold primary care data will usually return several clinical flat files. These are primary care records added by the patient's General Practitioner (GP). The GP may also record additional clinical information, such as whether the patient smokes, how much they weigh, how often they drink, or the results from medical tests. This data is contained in a linked additional clinical set of flat files.
We present an R script that we have used to retrieve additional clinical data for patients. The script contains the code necessary to retrieve several patient characteristics (smoking, alcohol consumption, etc.) and is relatively straight forward to extend. We aim to continue updating our GitHub release with further additional clinical properties and any corresponding medcode descriptions whilst our projects remain active. In this manuscript, we outline the link between clinical and additional clinical patient data, the execution flow of the R scripts, and how to expand on the existing code. The R script is available for free and to download under an MIT license on GitHub: https://github.com/acnash/CPRD_Additional_Clinical

Methods
We describe the link between patient clinical data and additional clinical data in cases where such a CPRD data release has been made available.
Both clinical data and additional clinical data are stored in two separate sets of files, often denoted as head_Extract_Clinical_##.txt and head_Extract_Additional_###.txt, respectively (this may depend on the service provided by those who extract the CPRD data). Additional clinical data, stored in the second set of files, is typically accompanied by a lookup data table with lookup codes stored in text files. Additional clinical data is identified by a unique entity (enttype, the entity type) value, for example, smoking status has an entity value of 4. Both clinical files and additional clinical files will have an enttype column. A researcher can retrieve all patients with a smoking status using this column. Then, depending on the particular additional information of interest, data on that entity can be retrieved from up to seven data fields, for example, whether they smoke, how many packs of cigarettes per week, whether they smoke a pipe, and how many ounces of tobacco. Some values are encoded and require the corresponding lookup files which share the same names as the entry in the data lookup column. For example, smoking status is defined in the first data column and the values are found in the YND.txt lookup file as denoted in the corresponding first data lookup column.
We have built an extensible R script which parses the clinical and additional clinical text files for an entity and then returns all clinical data with the corresponding additional clinical data for those patients with an entity of interest on record. As this is longitudinal data, patients may have several additional clinical rows on record for one entity. Therefore, the adid value (present in both sets of records) is used to match the clinical data with additional clinical data. If there are additional clinical records without a matching clinical record, the code ignores and moves to the next patient. We have also added medcode descriptions for those entities we are currently using. As our research expands so too will this list.
The R code is easy to expand and allows the researcher to treat each entity according to their needs. As the code stands, it currently supports our active research and we have been able to retrieve the additional clinical values and paired clinical medcode descriptions for smoking, alcohol consumption, exercise, and weight/BMI. A user need only execute several short lines of R code to retrieve any of the entities. For example, to retrieve patient data on smoking habits: #source the R core source("CPRDLookups.R") #define the location of both sets of files additionalFiles <-"./head_Extract_Additional_folder" clinicalFiles <-"./Clinical_folder" #add the file locations to a list with element names additionalFileList <-list(additional=additionalFiles, clinical=clinicalFiles) #look up smoking additional data resultDF <-getEntityValue ("smoking", additionalFileList, idList) The idList is an R list or vector of patids (patient identifiers) of those patients of interest. The additionalFileList list contains the dedicated location of both sets of files. The content of these files is loaded into data frames and stored in the R environment. Ensure that there is enough system RAM available to open CPRD data files.
The entity value smoking is used to instruct the algorithm to look for all clinical and additional clinical data with an entity value of 4. A list of currently supported entities can be retrieved by executing getEntityValue(). All matching clinical data with additional clinical data (matched using the adid identifier if there are multiple rows per patient) are returned as a data frame with an additional column for medcode description. The medcode descriptions have been added into the R script and are used by the entity values currently supported.
The user executes one function and the existing R framework steps through a series of predefined functions before entering the user's bespoke function (Figure 1). To extend the code to support additional clinical data, the researcher need only expand the if() statement in the function getEntityValue() with a function call to their own get<enttype>Data() function, following the format presented in the functions getExerciseData(), getSmokingData(), etc. Whether the researcher wishes to expand the results with a description of the medcodes is controlled by calling the defined function addMedcodeDescription(resultDF, <medcode_description_list>) within the if() statement. Both areas are illustrated below, with the potential code extension commented and a user defined entity name denoted with the place holder new_entity: if(tolower(entityString) Unfortunately, it is not entirely possible to automate the decoding of the additional clinical data for every entity. There currently 501 unique entities (CPRD Gold release February 2018) and 91 unique lookup text files. As presented, we have used this code to return the decoded entity data along with a corresponding clinical medcode description. However, depending on the research, a user can return a result bespoke to their needs by simply defining a new function. Up to that stage, the existing code finds and retrieves all data with a matching enttype value.
The script requires the R data.table package.

Results
We present a sample of a use-case where clinical records were linked with smoking patient data. The data presented in this manuscript and uploaded to the GitHub repository has been fabricated in the interest of patient confidentiality. A test data set of 2,000 clinical primary care records were parsed for additional data concerning smoking. Of those patients, 1,993 had a record for smoking. Having matched those patients, retrieved the data and decoded their smoking status for either "Data not entered", "Yes", "No" or "Ex smoker", each medcode from a corresponding clinical record was populated with a readable description. A snapshot of fabricated patient records is presented in Figure 2.

Conclusion
We present an R script that searches for and retrieves additional clinical entity values and combines the results with patient clinical data. Although access to CPRD data requires an entrusted license holder, not all institutes will be fortunate enough to have access to software necessary to perform CPRD data curation and analysis 5 . At the time of writing, the software can support smoking, alcohol consumption, exercise and weight/BMI entity lookups. The script has been designed in such a way that most R users can extend the code and provide unique behaviour for a lookup entity of interest. We will continue to support the scripts, adding in new entities and medcode descriptions as our research requires. This script is part of a family of R packages and scripts currently under development.
This project contains the following underlying data: -smokingDF.rda -a fabricated result R data frame, as presented in the Results section.
License: MIT

RRID:SCR_018959
The script also holds medcode descriptions taken from the CPRD data dictionary for several entities. Samples of the clinical data and additional clinical data have not been provided as access to this data is only permitted following execution of the appropriate license agreements. If access to any of this data is required, we can provide medcode lists and the criteria used to extract the dataset (ISAC project 17_255R2) used to create this software, and we recommend contacting CPRD directly.
structure, the main R script can be stored in the /R folder. All the medcode from Line 9 to