epicontacts: Handling, visualisation and analysis of epidemiological contacts

Epidemiological outbreak data is often captured in line list and contact format to facilitate contact tracing for outbreak control. epicontacts is an R package that provides a unique data structure for combining these data into a single object in order to facilitate more efficient visualisation and analysis. The package incorporates interactive visualisation functionality as well as network analysis techniques. Originally developed as part of the Hackout3 event, it is now developed, maintained and featured as part of the R Epidemics Consortium (RECON). The package is available for download from the Comprehensive R Archive Network (CRAN) and GitHub.


Introduction
In order to study, prepare for, and intervene against disease outbreaks, infectious disease modellers and public health professionals need an extensive data analysis toolbox. Disease outbreak analytics involve a wide range of tasks that need to be linked together, from data collection and curation to exploratory analyses, and more advanced modelling techniques used for incidence forecasting 1,2 or to predict the impact of specific interventions 3,4 . Recent outbreak responses suggest that for such analyses to be as informative as possible, they need to rely on a wealth of available data, including timing of symptoms, characterisation of key delay distributions (e.g. incubation period, serial interval), and data on contacts between patients 5-8 .
The latter type of data is particularly important for outbreak analysis, not only because contacts between patients are useful for unravelling the drivers of an epidemic 9,10 , but also because identifying new cases early can reduce ongoing transmission via contact tracing, i.e. follow-up of individuals who reported contacts with known cases 11,12 . However, curating contact data and linking them to existing line lists of cases is often challenging, and tools for storing, handling, and visualising contact data are often missing 13,14 .
Here, we introduce epicontacts, an R 15 package providing a suite of tools aimed at merging line lists and contact data, and providing basic functionality for handling, visualising and analysing epidemiological contact data. Maintained as part of the R Epidemics Consortium (RECON), the package is integrated into an ecosystem of tools for outbreak response using the R language.

Use cases
Those interested in using epicontacts should have a line list of cases as well as a record of contacts between individuals. Both datasets must be enumerated in tabular format with rows and columns. At minimum the line list requires one column with a unique identifier for every case. The contact list needs two columns for the source and destination of each pair of contacts. The datasets can include arbitrary features of case or contact beyond these columns. Once loaded into R and stored as data.frame objects, these datasets can be passed to the make_epicontacts() function (see 'Methods' section for more detail). For an example of data prepared in this format, users can refer to the outbreaks R package. : Factor w/ 2 levels "Death","Recover": NA NA 2 1 2 NA 2 1 2 1 ... ## $ gender : Factor w/ 2 levels "f","m": 1 2 1 1 1 1 1 1 2 2 ... ## $ hospital : Factor w/ 11 levels "Connaught Hopital",..: 4 2 7 NA 7 NA 2 9 7 11 ... # contact list str(ebola_sim$contacts) ## 'data.frame': 3800 obs. of 3 variables: ## $ infector: chr "d1fafd" "cac51e" "f5c3d8" "0f58c4" ...

Amendments from Version 1
In response to suggestions provided during the peer review process, the authors have made several updates to the manuscript. The body of the text is now organized more intuitively, introducing use cases for the epicontacts package before discussing specific functionality. Furthermore, the provenance of the data set is now described in the "Data handling" sub-section. The text has also been updated to include links to additional resources that demonstrate package usage. The authors feel that these changes have improved the manuscript and would like to thank the reviewers for providing their feedback.
The data handling, visualization, and analysis methods described above represent the bulk of epicontacts features. More examples of how the package can be used as well as demonstrations of additional features can be found through the RECON learn platform and the epicontacts vignettes.

Methods
Operation epicontacts is released as an open-source R package. A stable release is available for Windows, Mac and Linux operating systems via the CRAN repository. The latest development version of the package is available through the RECON Github organization. At minimum users must have R installed. No other system dependencies are required.
# install from CRAN install.packages("epicontacts") # install from Github install.packages("devtools") devtools::install_github("reconhub/epicontacts") # load and attach the package library(epicontacts) Implementation Data handling. epicontacts includes a novel data structure to accommodate line list and contact list datasets in a single object. This object is constructed with the make_epiconctacts() function and includes attributes from the original datasets. Once combined, these are mapped internally in a graph paradigm as nodes and edges. The epicontacts data structure also includes a logical attribute for whether or not this resulting network is directed.
The package takes advantage of R's generic functions, which call specific methods depending on the class of an object. This is implemented several places, including the summary.epicontacts() and print.epicontacts() methods, both of which are respectively called when the summary() or print() functions are used on an epicontacts object. The package does not include built-in data, as exemplary contact and line list datasets are available in the outbreaks package 16 .
The example that follows will use the mers_korea_2015 dataset from outbreaks, which which includes initial data collected by the Epidemic Intelligence group at European Centre for Disease Prevention and Control (ECDC) during the 2015 outbreak of Middle East respiratory syndrome (MERS-CoV) in South Korea. Note that the data used here was provided in outbreaks for teaching purposes, and therefore does not include the complete line list or contacts from the outbreak.  : chr "60-69" "60-69" "70-79" "40-49" ... ## $ sex : Factor w/ 2 levels "F","M": 2 1 2 1 2 2 1 1 2 2 ... ## $ place_infect : Factor w/ 2 levels "Middle East",..: 1 2 2 2 2 2 2 2 2 2 ... ## $ reporting_ctry: Factor w/ 2 levels "China","South Korea": 2 2 2 2 2 2 2 2 2 1 ...  Data visualisation. epicontacts implements two interactive network visualisation packages: visNetwork and threejs 17,18 . These frameworks provide R interfaces to the vis.js and three.js JavaScript libraries respectively. Their functionality is incorporated in the generic plot() method (Figure 1) for an epicontacts object, which can be toggled between either with the "type" parameter. Alternatively, the visNetwork interactivity is accessible via vis_epicontacts() (Figure 2), and threejs through graph3D() (Figure 3). Each function has a series of arguments that can also be passed through plot(). Both share a color palette, and users can specify node, edge and background colors. However, vis_epicontacts()  includes a specification for "node_shape" by a line list attribute as well as a customization of that shape with an icon from the Font Awesome icon library. The principal distinction between the two is that graph3D() is a three-dimensional visualisation, allowing users to rotate clusters of nodes to better inspect their relationships.

plot(x)
vis_epicontacts(x, node_shape = "sex", shapes = c(F = "female", M = "male"), edge_label = "exposure") graph3D(x, bg_col = "black") Data analysis. Subsetting is a typical preliminary step in data analysis. epicontacts leverages a customized subset method to filter line lists or contacts based on values of particular attributes from nodes, edges or both. If users are interested in returning only contacts that appear in the line list (or vice versa), the thin() function implements such logic.
# subset for males subset(x, node_attribute = list("sex" = "M")) # subset for exposure in emergency room subset(x, edge_attribute = list("exposure" = "Emergency room")) # subset for males who survived and were exposed in emergency room subset(x, node_attribute = list("sex" = "M", "outcome" = "Alive"), edge_attribute = list("exposure" = "Emergency room")) thin(x, "contacts") thin(x, "linelist") For analysis of pairwise contact between individuals, the get_pairwise() feature searches the line list based on the specified attribute. If the given column is a numeric or date object, the function will return a vector containing the difference of the values of the corresponding "from" and "to" contacts. This can be particularly useful, for example, if the line list includes the date of onset of each case. The subtracted value of the contacts would approximate the serial interval for the outbreak 19 . For factors, character vectors and other non-numeric attributes, the default behavior is to print the associated line list attribute for each pair of contacts. The function includes a further parameter to pass an arbitrary function to process the specified attributes. In the case of a character vector, this can be helpful for tabulating information about different contact pairings with table().

Benefits
While there are software packages available for epidemiological contact visualisation and analysis, none aim to accommodate line list and contact data as purposively as epicontacts 20-22 . Furthermore, this package strives to solve a problem of plotting dense graphs by implementing interactive network visualisation tools. A static plot of a network with many nodes and edges may be difficult to interpret. However, by rotating or hovering over an epicontacts visualisation, a user may better understand the data.

Future considerations
The maintainers of epicontacts anticipate new features and functionality. Future development could involve performance optimization for visualising large networks, as generating these interactive plots is resource intensive. Additionally, attention may be directed towards inclusion of alternative visualisation methods.

Conclusions
epicontacts provides a unified interface for processing, visualising and analyzing disease outbreak data in the R language. The package and its source are freely available on CRAN and GitHub. By developing functionality with line list and contact list data in mind, the authors aim to enable more efficient epidemiological outbreak analyses.

Grant information
The author(s) declared that no grants were involved in supporting this work.
Consider moving the section of the article called "Use cases" to before the "Data handling" subsection of the "Implementation" section. I felt that the description of the input datasets under "Use cases" was very informative and would have been organizational more helpful had it been placed earlier in the article.
Consider describing the sample outbreak data in a bit further detail. It appears to be data describing the MERS outbreak that occurred in South Korea in 2015. I think the description should include whether the data are simulated or from a real outbreak (if from a real outbreak, then a reference to the outbreak description should be included), the scenario of the outbreak, how many cases, how many contacts, place of the outbreak, duration of the outbreak, and a brief description of the demographic details included in the dataset. This amount of detail would allow the reader to translate the details of the outbreak from your text to the output provided by epicontacts.
Is the rationale for developing the new software tool clearly explained? Yes

Is the description of the software tool technically sound? Yes
Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others? Partly

Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool? Partly
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article? Yes No competing interests were disclosed.

Competing Interests:
I have read this submission. I believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.