o2geosocial: Reconstructing who-infected-whom from routinely collected surveillance data

Reconstructing the history of individual transmission events between cases is key to understanding what factors facilitate the spread of an infectious disease. Since conducting extended contact-tracing investigations can be logistically challenging and costly, statistical inference methods have been developed to reconstruct transmission trees from onset dates and genetic sequences. However, these methods are not as effective if the mutation rate of the virus is very slow, or if sequencing data is sparse. We developed the package o2geosocial to combine variables from routinely collected surveillance data with a simple transmission process model. The model reconstructs transmission trees when full genetic sequences are unavailable, or uninformative. Our model incorporates the reported age-group, onset date, location and genotype of infected cases to infer probabilistic transmission trees. The package also includes functions to summarise and visualise the inferred cluster size distribution. The results generated by o2geosocial can highlight regions where importations repeatedly caused large outbreaks, which may indicate a higher regional susceptibility to infections. It can also be used to generate the individual number of secondary transmissions, and show the features associated with individuals involved in high transmission events. The package is available for download from the Comprehensive R Archive Network (CRAN) and GitHub.


Introduction
The identification of transmission trees and transmission events during infectious disease outbreaks can lead to identifying factors and settings associated with subsequent transmissions [1][2][3][4] , describing super-spreading events 5,6 , or populations and areas more vulnerable to importations and transmission 7-10 , and quantifying the impact of control measures 11,12 . The most straightforward approach to reconstructing who-infected-whom is to carry out patient interviews and establish the previous contacts to connect the reported cases. However, contact-tracing investigations are costly and can be challenging to implement. Statistical methods have therefore been developed to infer transmission trees from routinely collected epidemiological data 12-17 .
The Wallinga-Teunis method was first developed to infer probabilistic transmission trees from onset dates and generation times in a maximum likelihood framework 12 . Genetic sequencing of pathogens have since become more common, and new tools such as the R package outbreaker2 were created to combine the timing of infection and the genetic sequences in order to improve the accuracy of inferred transmission trees 13,14,18-20 . Nevertheless, the accuracy of these reconstruction methods relies on the proportion of sequenced cases, the quality of the sequences, and the characteristics of the pathogen 21 . For instance, the measles virus evolves slowly, and sequences from unrelated cases can be very similar, which makes these methods ineffective for measles outbreaks 22,23 .
The package o2geosocial was designed to study outbreaks where sequences are uninformative, either because too few cases were sequenced or because the virus evolves too slowly. Building upon the framework presented in outbreaker2, o2geosocial was developed to infer who-infected-whom from variables routinely collected by surveillance systems, such as the onset date, age, location, and genotype of the cases 7 . Cases from different genotypes cannot be part of a similar transmission chain since differences in genotype illustrate major variations in their genetic sequences, 24 . Using age-stratified contact matrices and mobility models, we combined the different variables into a likelihood of connection between cases. In this paper, we first describe the structure of the package. From a use case based on simulated data, we then show how to run the model, evaluate the output, visualise the results of the inference, and customise the input functions to implement different mobility models.

Methods
Operation o2geosocial is implemented as an open-source R (version ≥ 3.5.0) package and can be run on all platforms (Windows, Mac, Linux). It incorporates C++ functions into a R framework using Rcpp 25 . Package dependencies and system requirements are documented in the o2geosocial CRAN repository. A stable version was released on Windows, Mac and Linux operating systems via a CRAN repository. The source code is available through Zenodo 26 and the latest development version is available through a Github repository.

Implementation
The general implementation of o2geosocial follows the structure of outbreaker2 and builds upon it by adding the effect of the location and the age-stratified contact data to the reconstruction of transmission clusters. However, unlike outbreaker2, o2geosocial does not take genetic sequences as input. It uses genetic groups (e.g. genotype) to exclude connections between cases, i.e. two cases cannot be from the same cluster if they belong to different genetic groups 28 . Therefore, o2geosocial is adapted to reconstructing transmission clusters from large datasets where genetic sequences are not informative, either because the mutation rate of the virus is slow, or because sequencing is scarce.
In o2geosocial, the number of independent clusters in the dataset is inferred using two different processes ( Figure 1). Firstly, the pre-clustering step aims to group cases before the MCMC runs following three criteria: Only cases reported in a radius of γ km, less than δ days before case i, and from similar or unreported genotype can be classified as potential infectors of i. Cases with overlapping potential infectors, and their potential infectors, are grouped together, and cases from different groups cannot be linked during the MCMC runs. The parameters γ and δ are defined as inputs of the function create_config(). Since surveillance datasets can include cases from unrelated outbreaks, the pre-clustering function was developed to remove impossible connections and speed up the MCMC runs.
Secondly, as cases classified in the same group after the pre-clustering step may come from different clusters, we defined a likelihood threshold λ to spot and discard unlikely connections after the MCMC runs: if the likelihood of connection from case j to case i is lower than λ, the connection is discarded and i is an import unrelated to j. In o2geosocial, the variable λ can either be an absolute (the log-likelihood threshold will be log(λ)) or a relative value (a quantile of the likelihood of all connections in all samples), and is defined by the variables 'outlier_threshold' and 'outlier_relative' in create_config().
Finally, unlikely connections between cases can alter the inferred infection dates of cases and bias the transmission trees sampled from the MCMC runs. Therefore, we first run a short MCMC to remove these unlikely connections. From this run we compute n, the minimum number of connections with a likelihood lower than λ per sampled tree. We then add n imports to the starting tree and run a longer MCMC. Lastly, we remove the Figure 1. Illustration of the process to estimate the cluster size distribution and the import status of 11 cases. In the first step, cases are split in two groups that do not have overlapping potential infectors (i.e. they were reported in different places, or different times). In step 2, we estimate the minimum number of unlikely transmissions (n) in the samples (right panel). In step 3, we remove n transmissions from the initial tree, and generate samples. Finally, we remove the unlikely connections in each sample of Step 3 and compute the inferred cluster size distribution.
connections with likelihood lower than λ in the final samples and return the infector, infection date and probability of being an import for each case (Figure 1).

Likelihood and priors
The functions custom_likelihoods() and custom_priors() can be used to edit each component of the likelihood and priors. By default, there are five components in the likelihood: Genotype component: There can be a maximum of one genotype reported per transmission tree. The genotype of a tree τ is the genotype reported for at least one of the cases belonging to τ. For each genotyped case i gen and at every iteration, only cases from trees with the same genotype as i gen , or without reported genotype can be listed as potential infectors.
Therefore, the genetic component of the likelihood that a case i of genotype g i was infected by a case j belonging to the tree τ j is defined as a binary value: Conditional report ratio: As in the package outbreaker2, we allow for missing cases in transmission chains. The number of generations between cases i and j, denoted κ ji , is equal to 1 if j infected i. We define Π as the conditional report ratio of the trees, which differs from the overall report ratio of an outbreak as only unreported cases within transmission chains impact the conditional report ratio. Entirely unreported clusters, or unreported cases infected earlier than the ancestor of a tree do not change the value of Π. By default, the probability of observing κ ji missing generation between i and j from the conditional report ratio p(κ ji |Π) follows a geometric distribution with mean (1-Π)/ Π.
The conditional report ratio is estimated during the MCMC runs using a beta distribution prior. By default, the prior distribution is parametrised as Beta(10,1), which is an informative prior of mean 0.9 and standard deviation 0.08. The two parameters of the beta prior can be changed using the variable prior_pi in create_config().

Time component:
The probability of t i being the infection date of the case i, given their reported onset date T i depends on the distribution of the incubation period f. The incubation period is defined by the variable f_dens in the function outbreaker_data().
The probability that i was infected by j given their respective inferred dates of infection t i and t j is defined by the generation time of the disease w κ ji (t it j ) (variable w_dens in outbreaker_data()), and the number of generations κ ji between i and j. The function w κ ji was defined as w κ ji = w * w * ...* w, where * is the convolution operator applied κ ji times.
This component of the likelihood follows the framework developed in the Wallinga-Teunis method, and in outbreaker2.

Spatial component:
The probability of connection between two regions k and l depends on the population sizes m k and m l , and the distance between regions d kl . Given spatial parameters a and b, s(k,l) is the probability that a case in the region l was infected by a case reported in k, and is defined using p kl , the connectivity between regions k and l: The package comes with a built-in exponential gravity model: ; or a power-law gravity model : 1 The exponential gravity model has been shown to be a better representation of short-distance mobility patterns 29 ; it is therefore the default option since o2geosocial aims at reconstructing transmission in a community or a region. The type of gravity model can be changed by setting the parameter spatial_method to "power-law": create_config(spatial_method = "power_law"). Other mobility models can be implemented by developing modules. In the use case, we give an example on how to replace the exponential gravity by Stouffer's rank model 30 .
The parameters a and b are estimated during the MCMC run via posterior sampling. This requires re-computing the matrix of spatial connectivity between regions at each iteration and is time-consuming. Therefore, if either a or b is estimated, we allow for a maximum of 1 missing generation between cases (max(κ ji ) = 2) and only compute s 1 (k,l) and s 2 (k,l) for regions that could potentially be connected. By default, the prior distribution of a and b are uniform.
Age component: Given the age group of each case α (1,..,N) and the age-stratified social contact matrix, we introduced a κ ji (α i , α j ), the probability that a case aged α j infected a case aged α i . This corresponds to the proportion of contacts to α i that came from individuals of age α j . Social contact matrices provided by large scale quantitative investigations such as the POLYMOD study quantify the probability of contact between infectors and infectees of different age groups 31 , and are imported using the R package socialmixr 32 . The contact matrix used in the MCMC run is defined by the variable a_dens in outbreaker_data().

Overall likelihood:
The overall likelihood that a case i was infected by the case j is equal to is the likelihood that a case with an onset date T i was infected on t i , and L ji (t i ,t j ,θ) is the log-likelihood of connection between i and j defined as:

Tree proposals
At every iteration of the MCMC, a set of movements is used to propose an update of the transmission trees. This update is then accepted or rejected depending on the posterior density of the proposed trees. By default, eight movements are tested at each iteration. Three of them were already part of outbreaker2 and were not modified: (cpp_move_t_inf() changes the infection date of the cases; cpp_move_pi()changes the conditional report ratio; cpp_move_kappa() changes the number of generations between cases). Two movements were edited to scan each transmission tree in order to prevent different genotypes from being in the same tree: (cpp_move_alpha() changes the infector; cpp_move_swap_cases() swaps infector and infectee). The remaining three are new movements: • cpp_move_a() and cpp_move_b() change the spatial parameters a and b and update the probability of connection between regions.
• cpp_move_ancestor() changes the ancestor of the tree. An ancestor is defined as the first case of a transmission tree. For each ancestor i, an index case is drawn from the pool of potential infectors, while another link is randomly picked and deleted.

Use case
Description of the simulated data Two simulated datasets are included in o2geosocial: toy_outbreak_short and toy_outbreak_long.
Both are lists describing simulated outbreaks and include three elements: i) cases: a data.table with the ID, location, onset date, genotype, age group, import status, cluster, generation and infector of each case; ii) dt_regions: a data table with the ID, population, longitude and latitude of each region; iii) age_contact: a numeric matrix of the proportion of contact between age groups. Both simulations were ran using distributions of the generation time and the latent period typically associated with measles outbreaks: the incubation period followed a gamma distribution of mean 11.5 days (standard deviation 2.24 days) 33 ; the generation time followed a normal distribution truncated at 0 of mean 11.7 days (standard deviation 2.0 days) 34 .
In order to assess whether the method was able to reconstruct the transmission links between cases, we needed to simulate the transmission trees. Population-level compartmental models cannot be used to generate who-infectedwhom. Therefore, we generated the dataset at an individual level, by simulating different transmission trees in the area of interest. The transmission trees were generated using the following process: 2. We drew the number of secondary cases stemming from this case.
3. If the number of secondary cases was greater than 0, the characteristics of the new cases were drawn using the distributions of the generation time, incubation periods, the spatial kernel, and the proportion of contacts between age groups. 4. We repeated steps 2 and 3, for each new case, until no more secondary cases were drawn (i.e. the random reproduction number drawn in step 2 was 0 for all new case).
5. We repeated steps 1 to 4, until we reached a maximum number of cases, or maximum number of trees, defined before running the simulation.
Numerous factors influencing the transmission dynamics are not included in this simulation framework. However, we do not aim to generate transmission trees which describe the spread of a given pathogen (here measles) in a community with complete accuracy. The main aim of this simulated dataset is to highlight the inference capabilities of the reconstruction method, and to explore causes for discrepancies between the simulations and the model fits, in an ideal setting where all parameters are known and are accounted for in the model.
In this use case, we analyse toy_outbreak_short. The dataset contains 75 simulated cases from different census tracts of Ohio in 2014 (variable cens_tract). The census tracts represent areas established by the Bureau of Census for analyzing populations and generally contain between 2,500 to 8,000 inhabitants. The variable cluster describes the transmission tree each case belongs to, and "generation" is equal to the number of generations between the first case of the tree (generation = 1) and the case.
We reconstruct the cluster size distribution of the simulated outbreaks using different models. We then evaluate the agreement between the inferred and the reference transmission clusters in each model, and compare the results obtained with each model. Finally, we assess the geographical heterogeneity of the reconstructed transmission dynamics. We use the package data. In the simulated data, 95% of the clusters contain less than five cases, 47.6% of the clusters are isolated (also called singletons). One larger cluster includes 31 cases ( Figure 2).
# Reference cluster size distribution hist(table(dt_cases$cluster), breaks = 0:max(table(dt_cases$cluster)), ylab = "Number of clusters", xlab = "Cluster size", main = "", las = 1) Set up and run the models with outbreaker() We set up the distributions the model will use to reconstruct the transmission trees. We define f_dens as the duration of the latent period, and w_dens as the generation time. These distributions have previously been described for measles outbreaks 33,34,36,37 . In this example, the same distributions were used to generate the simulated data and fit the model. In real-life, there can be discrepancies between the actual distributions and their theoretical estimates. Therefore, we also fitted the model using different distributions of the latent period and generation time, and explored the impact it had on the accuracy of the inferred transmission trees (See Extended Data).
# Distribution of the latent period f_dens <-dgamma(x = 1:100, scale = 0.43, shape = 26.83) # Distribution of the generation time w_dens <-dnorm(x = 1:100, mean = 11.7, sd = 2.0) The age specific social contact patterns can be imported from the element age_contact of the list toy_out-break_short. Alternatively, one can use the R package socialmixr to import a social contact matrix from the POLYMOD survey 32 .
Routinely collected surveillance data can include information on the importation status of the cases. In order to investigate the impact of using prior information on the importation status of the cases on cluster reconstruction, we implement two different models: in out1 the import status is inferred by the model, whereas in out2 it is set as an input parameter of the model, which only estimates who infected whom.
The first short run in out1 is run with 10,000 iterations to find the minimum number of importations, and the main run lasts for 20,000 iterations in both models. As the import status of the cases is inferred in out1, we have to set a threshold to quantify what is an unlikely likelihood of transmission between cases. We use a relative outlier threshold at 0.9, which means that the threshold will be the 9 th decile of the negative log-likelihoods L i (t i , j, t j , θ) in every sample.

Compare inferred and reference clusters
The function summary prints a summary of the data frame generated by outbreaker(). It contains a list with the number of steps, the distributions of the posterior, likelihood and priors, the parameter distributions, the most likely infector and the probability of being an import for each case, and the cluster size distribution. In order to compare the reconstructed clusters to the data in each model, we plot the median inferred cluster size distribution in out1 and out2 and the credible intervals. First, we group together clusters of similar sizes by defining the breaks of each group in the vector group_cluster. In this example, we defined the size categories as 1; 2; 3 -4; 5 -9; 10 -15; 15 -40 and 40 + cases. The inferred cluster size distributions are shown in the element cluster from the output of summary(out1), and are aggregated using the input variable group_cluster.
Using different values of λ would impact the cluster size distribution in out1. Conversely, the cluster size distribution in out2 is very similar to the data (Figure 3).
First, we retrieve the boundary files of the census tracts in Ohio to generate the background of the maps using the package tigris 40 . We import them in a format compatible with the package sf and create one background map for each model. We are interested in two outputs of the models: i) the number of imports per region, in order to highlight regions where importations of cases are most likely, and ii) the geographical distribution of the number of secondary cases per case, which gives insight into the areas most vulnerable to the spread of the disease.

Number of imports per region:
The element tree of summary(out1) contains the most likely infector, the proportion of iterations where the index is the most likely infector and the median number of generations between the two cases, the most likely infection date and the chances of being an import for each case. We add two columns to dt_cases showing the probablity of being an import in out1 and out2 for each case. As the import status is not inferred in out2, prop_import2 is a binary value, and is equal to dt_cases$import. We plot the number of imports per region in each model ( Figure 5). The right panel (out2) shows the geographical distribution of importations in the data. We observe discrepancies between the two panels. In out1, the inferred number of importations in the central areas is much lower than in the reference data. These maps highlight the uncertainty added when the import status of each case is inferred.

Average number of secondary cases per region:
In this section, we map the number of secondary cases per case in each region to identify which regions were associated with higher levels of transmission. We define the function n_sec_per_reg to compute the average number of secondary cases per case and aggregate it per region. We then extract the median number of secondary cases per case in each region.

map_data$n_sec <-as.numeric(n_sec_data[as.character(map2$GEOID)])
We now plot the geographical distribution of the median number of secondary cases in each region according to the models, and compare it with the simulations (Figure 6). Despite minor discrepancies, the maps generated by the two models are similar. Both show an important spatial heterogeneity. The eastern and central areas are associated with higher numbers of secondary cases. If we change the vectors n_sec1 and n_sec2 to plot different deciles, we show the dispersion of the number of secondary cases in the different iterations of the models. Similarly, we observe minor differences between the maps generated by the models and the simulated data. Most of the regions that repeatedly caused further transmissions in the simulations are identified by the models. In the Extended Data, we compared the regional number of secondary transmissions in the simulated data to the 95% Credible Intervals of both models, and found that the models were able to capture the input values in each region. Customise the likelihood, prior and movement lists: the Stouffer's rank model In the previous example, we ran and evaluated two different models to reconstruct transmission clusters from simulated surveillance data, and highlighted the spatial heterogeneity of measles transmission in the region. These models were run using the default likelihood, prior and movement functions. Now we develop a third model, where the spatial connection between regions is based on the Stouffer's rank method 30 .

# Edit the lists of movements and priors
We plot the inferred cluster size distribution and compare it to the reference data (Figure 7). We observe discrepancies between the inferred distribution and the data: the model over-estimates the number of clusters containing more than 15 cases and underestimates the number of small clusters and isolated individuals. Finally, we plot the proportion of perfect and close matches for each case (Figure 8). We observe that the fit obtained with the Stouffer's rank method is consistently worse than the first two models. The Stouffer's rank method did not improve the agreement between the inferred trees and the reference data.
The simulated data used in the study were generated using an exponential gravity model, which explains why introducing the Stouffer's rank method did not improve the inference. This is not representative of the performance of each mobility model at reconstructing actual transmission clusters.

Discussion
The R package o2geosocial is a new tool for data analysis building upon the framework developed in outbreaker2.
It uses routinely collected surveillance data to reconstruct transmission networks. It can be used on a broad range of diseases where genetic sequencing is not common, or informative. For instance, it has been applied on national measles surveillance data to reconstruct the cluster size distribution of outbreaks in the United States between 2001 and 2016 7 . In this study, we presented an application on a simulated dataset using detailed geographic information on the location of cases.
We implemented several models to reconstruct the cluster size distribution of the simulated outbreak.
Although each model was able to capture the overall dynamics of transmission, we observed discrepancies between the reference data and the reconstructed cluster size distribution for models where the importation status of the cases was inferred. These discrepancies are linked to the threshold set to define what is considered an unlikely connection. A looser threshold may lead to unrelated cases being connected and a lower number of inferred imports, whereas a stricter threshold increases the number of short transmission chains. Therefore, the use of epidemiological information describing importation status improves the accuracy of the transmission cluster reconstruction in o2geosocial. In case of incomplete epidemiological information, the user can set the importation status for some of the cases, and the others would be inferred. These results highlight that epidemiological investigations are crucial to improve our ability to reconstruct transmission events, particularly when unrelated importations happen concurrently.
The method described in this paper does not account for long-distance transmission, as transmission events are impossible in o2geosocial when the distance between regions is above the parameter gamma. In case of longdistance transmission, the infected case would be considered as a new importation. Nevertheless, this limitation is not critical since o2geosocial was designed to identify areas most susceptible to local transmission, i.e. regions where importations were likely to lead to local outbreaks.
The use of transmission trees and transmission clusters to assess current or future risk of outbreaks comes with various limitations. First, it relies upon the assumption that previous transmission patterns are representative of future outbreaks. Second, it requires past observed transmission events, and does not account for the number of opportunities of transmission per case. Where only sporadic isolated cases have been reported in the country, it is not possible to draw relevant conclusions on communities potentially most vulnerable to transmission. Third, partial detection of cases may bias the cluster size distribution, and under-estimate the number of secondary transmissions per case. Patterns of transmission, and characteristics associated with high-transmission events may still be observable but could introduce a bias if reporting is itself is affected by the same factors as is transmission. Finally, the use of transmission trees for real time modelling can be challenging, given the rightcensoring of transmissions caused by recent infectious individuals 42 . The default implementation of the method assumes that the generation times are independent and identically distributed throughout an outbreak, whereas in reality, depletion of susceptibles and competing risk of infection through clustering of contacts would be expected to affect the generation interval. The method can be customised to integrate time varying generation intervals set prior to running the models. However, estimating the distribution of the generation interval during the inference procedure is more challenging to implement in the current framework, which may introduce a bias in our results.
The analyses presented in this paper were run on simulated data, which partly explains the very close match between the inferred and reference cluster size distribution. Indeed, the distributions of the incubation period and generation time used to generate the simulations were the same as the ones used for cluster inference in the Main Analysis. Using imprecise or inaccurate distributions can lead to biases in the reconstruction of the transmission trees. We re-ran the inference procedure using different distributions (changing the mean or the standard deviation), the results can be seen in the Extended Data. When the distributions were set with lower standard deviations, several links were not observed in the inferred transmission trees anymore. Indeed, these connections had been made impossible since the range of likely values was narrower. In all other examples, the simulated and inferred clusters size distribution remained very close, we only observed a slight drop in the proportion of iterations that contain the right transmission links. Since the likelihood of connection is computed from several components, the discrepancies between the distributions used in the simulations and the model fits did not substantially changed the inferred trees.
We also showed how the model could be edited to implement different mobility models. Describing human mobility during infectious diseases outbreaks is challenging, and the performance of the models depends on the setting 29, [43][44][45] . Future developments in the package will focus on facilitating the integration of new variables in the likelihood of connection, such as workplace or school. Currently, such variables would have to be integrated within one of the components of likelihood. We aim to simplify the addition of new parameters and components in the inference framework. We encourage the development of extensions of o2geosocial to study a wide range of pathogens and settings where sequence data are not informative. We hope that wider use of o2geosocial can help maximise the information brought by routinely collected data and epidemiological investigations, in order to improve our understanding of outbreak dynamics.

Gerry Tonkin Hill
Universitetet i Oslo, Oslo, Oslo, Norway Robert et al. present an R package for inferring transmission chains using routinely collected surveillance data. The package is specifically designed for scenarios where integrating genomic data into the likelihood function is not as effective, such as for pathogens with extremely low mutation rates. The manuscript addresses an important problem and the R package is likely to be useful to both researchers and public health laboratories.
There are a few areas that I think the authors could expand on. The simulated outbreak consisted primarily of very small clusters with ~47% involving only a single isolate. Smaller clusters considerably simplify the problem of inferring transmission chains. It would be good if the authors could investigate how the accuracy of the approach changes with larger clusters. This could be achieved with additional simulations or by adjusting/adding to Figure 4 to indicate how cluster size impacted these results. Second paragraph of the introduction: Although outbreaker2 is acknowledged throughout the manuscript, I think it would be good to mention that outbreaker2 is also highly customisable in the introduction. For example, I think (although I am not sure) that it should be possible to customise the contact function in outbreaker2 to incorporate spatial

Eben Kenah
Biostatistics Division, College of Public Health, The Ohio State University, Columbus, OH, USA While I appreciate the quality of the programming in o2geospatial and the detailed demonstration of its use, I think the interpretation of its results in infectious disease epidemiology is far from clear. The use of cluster size distributions to investigate the transmission of the disease is difficult because the apparent clusters are highly sensitive to the detection of cases, which is a particular problem if there are many subclinical cases as in influenza or COVID-19. The use of transmission tree reconstruction to understand risk factors for transmission is equally difficult because we are often lacking denominators (e.g., how many opportunities were there to transmit) and because a tree always has a mean outdegree just below one, which means that the censoring of secondary infections caused by infectious individuals in the later part of the observation period needs to be accounted for (e.g., see Fraser, PLOS One, 2007).
The use case centers around a simulated data set whose generation is not clearly described, and it is often difficult to compare the model results to the "truth" contained in the simulation. The analyses of the simulated data use the same incubation period and serial interval distributions, so the analyses do not give us any sense of how the results would be affected by the misspecification of these distributions. This is mentioned briefly in the Discussion (to the authors' credit), but it is a fundamental shortcoming of the use case because these distributions are never known in practice.
(page 5, conditional report ratio) It would be more accurate to say that the number of missing generations follows a geometric distribution with mean (1 -\rho) / \rho, not an exponential distribution.

2.
(page 5, conditional report ratio) It is confusing to denote the conditional report ratio but then to refer to its prior distribution as prior_pi in the package. It would also be good to point out to readers that the prior distribution on the conditional report ratio is informative: It has a mean of 0.91 and a standard deviation of 0.08.

3.
(page 5, time component) The definition of f_dens given in the text does not seem to match its definition in outbreaker_data.R, where it is defined as the "colonization time, i.e. the time interval during which the pathogen can be sampled from the patient". The incubation period is usually defined as the time between infection and the onset of symptoms, so it would be helpful to specify whether T_i is meant to be the time of the report (as in the text) or the reported symptom onset date (as implied by calling T_i -t_i an incubation period).

4.
(page 5, time component) There are several problems that stem from the use of independent and identically distributed generation intervals as in the likelihood of Wallinga and Teunis (2004). First, generation intervals do not typically have the same distribution throughout an epidemic. They contract due to depletion of susceptibles and competing risks of infection from multiple sources. Thus, these likelihoods are only strictly correct in the very early stages of an epidemic (and only if there is negligible clustering of contacts). Second, it precludes an analysis of the effects of covariates on the risk of transmission between two individuals. Many critical questions in the response to an epidemic involve 5.
mechanisms of transmission and determinants of susceptibility and infectiousness. The proposed methods can address these questions only indirectly and on a coarse scale.
(page 6, description of simulated data) The simulations used to generate the data need to be described in more detail, including how the clusters were defined. The simulations themselves appear to be branching processes, which have several important differences from true stochastic epidemic models that should be mentioned.

6.
(page 7, R code above Figure 2) The use of a normal distribution for a time-to-event distribution is surprising but not truly problematic. It might be good to call it something like a "normal distribution truncated at zero". 7.
(page 8, top box) The R code in this box is inconsistent in the use of spaces when setting argument values. I believe the standard is to put spaces around the equals signs, which is followed through most of the paper. 8.
(page 15, Figure 5) In Figure 4, the results from model 2 are the truth. In this figure, there is no obvious way of comparing the results of the models to any truth. The models produce very similar results, but do the estimated numbers of secondary cases correlate with an underlying true parameter such as a reproduction number? 9.

Is the description of the software tool technically sound? Yes
Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others? Yes

Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool? Partly
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article? Partly Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Statistical methods for infectious disease epidemiology, mathematical models of epidemics, epidemiologic methods, causal inference, survival analysis I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

Alexis Robert
Thank you for your thoughtful comments on this paper. Please find our responses below We agree that the limitations associated with the use of transmission trees should have been described in more detail. We have added a paragraph in the Discussion to list the main limitations associated with the use of cluster size distribution and transmission trees, and explain their value in infectious disease epidemiology.
"The use of transmission trees and transmission clusters to assess current or future risk of outbreaks comes with various limitations. First, it relies upon the assumption that previous transmission patterns are representative of future outbreaks and requires previous incidence in the region of interest to be informative. Second, it requires past observed transmission events, and does not account for the number of opportunities of transmission per case. Where only sporadic isolated cases have been reported in the country, it is not possible to draw relevant conclusions on communities potentially most vulnerable to transmission. Third, partial detection of cases may bias the cluster size distribution, and under-estimate the number of secondary transmissions. Patterns of transmission, and characteristics associated with high-transmission events may still be observable but could introduce a bias if reporting is itself is affected by the same factors as is transmission. Finally, the use of transmission trees for real time modelling can be challenging, given the rightcensoring of transmissions caused by recent infectious individuals (Fraser, 2007)." Despite these limitations we believe our package is an useful contribution to the field: As we outlined in the Introduction, transmission trees have repeatedly been used in the recent years to investigate various aspects of the transmission dynamics, such as superspreading events, determinants of transmission, or the impact of various policies, including for pathogens where part of the cases was not detected (Wang et al Nature 2020, Wang & Teunis Frontiers in Medicine, 2020, James et al, PLOS One 2021). The method we developed aims to improve the inference of transmission links between cases, so that information on outbreak dynamics can be deduced from transmission trees even in settings where they are not reconstructed through thorough contact tracing investigations.

1.
We agree that it is important to highlight the impact of potential misspecifications of the input distributions. The first section of the Extended Data document describes the impact of changing the mean or the standard deviation of both distributions on the accuracy of the reconstructed transmission trees. This is now also described in the Discussion; "We re-ran the inference procedure using different distributions (changing the mean 2. or the standard deviation), the results can be seen in the Extended Data. When the distributions were set with lower standard deviations, several links were not observed in the inferred transmission trees anymore. Indeed, these connections had been made impossible since the range of likely values was narrower. In all other examples, the simulated and inferred clusters size distribution remained very close, we only observed a slight drop in the proportion of iterations that contain the right transmission links. Since the likelihood of connection is computed from several components, the discrepancies between the distributions used in the simulations and the model fits did not substantially changed the inferred trees." This typo has been fixed, thank you for spotting it. 3.
We changed the distribution to a geometric distribution. 4.
We changed the definition of the conditional report ratio in the main text to Π , in order to match the commands defined in o2geosocial. We added the following to clarify that the default prior distribution was informative and could be edited before running the model: "By default, the prior distribution is parametrised as Beta(10,1), which is an informative prior of mean 0.9 and standard deviation 0.08.The two parameters of the beta prior distribution can be changed using the variable prior_pi in create_config().""

5.
Thank you for noticing this. Indeed, f_dens represents the incubation period, i.e. the time between the reported symptom date and the infection date. Therefore, T_i is the reported symptom onset date. We clarified this in the Main Text, and changed the description of the variable in outbreaker_data.R. This will be updated on CRAN at the next submission.

6.
We agree that using independent and identically distributed generation intervals is a simplification, and does not represent the complexity observed throughout an epidemic. However, the temporal component of the likelihood function could be customised to generate different distributions of the generation intervals at different points of the outbreak. This would require implementing a new temporal likelihood function, which would use different values of w_dens depending on the time and regions , and adding it to the model using custom_likelihood(timing_infecttions = new_ll_timing()). We agree that in the current version of o2geosocial, the distributions used at different points would have to be defined prior to running the model, and could not be estimated as part of the fitting procedure. This will be an avenue for future development of the package, in order to seamlessly integrate new parameters to the models. We summarised these two points at the end of the "Customise the likelihood..." section: "In this use case, we only explored customising the spatial component. However, the other components of the likelihoods can also be edited, using the functions custom_priors(), custom_likelihoods(), or custom_moves(). For instance, to account for changes in the distribution of the generation time throughout an outbreak (Svensson 7. 2007), one would have to change the element timing_infections of custom_likelihoods(). However, the distribution would need to be set prior to running the models." In the Discussion we added "The default implementation of the method assumes that generation intervals are independent and identically distributed throughout an outbreak, whereas in reality, depletion of susceptibles and competing risk of infection through clustering of contacts would be expected to affect the generation interval. The method can be customised to integrate time varying generation intervals set prior to running the models. However, estimating the distribution of the generation interval during the inference procedure is more challenging to implement in the current framework, which may introduce a bias in our results." We added the following paragraph to to the Use Case section to describe how the simulated data were generated: "In order to assess whether the method was able to reconstruct the transmission links between cases, we needed to simulate the transmission trees. Population-level compartmental models cannot be used to generate who-infected-whom. Therefore, we generated the dataset at an individual level, by simulating different transmission trees in the area of interest. The transmission trees were generated using the following process: 1. We created an imported case, with random onset date, region of origin, and age group. 2. We drew the number of secondary cases stemming from this case, using a random reproduction number. 3. If the number of secondary cases was greater than 0, the characteristics of the new cases were drawn using the distributions of the generation time, incubation periods, the spatial kernel, and the proportion of contacts between age groups. 4. We repeated steps 2 and 3, for each new case, until no more secondary cases were drawn (i.e. the random reproduction number in step 2 was 0 for all new case). 5. We repeated steps 1 to 4, until we reached a maximum number of cases, or maximum number of trees, defined before the simulation. Numerous factors influencing the transmission dynamics are not included in this simulation framework. However, we do not aim to generate transmission trees which describe the spread of a given pathogen (here measles) in a community with complete accuracy. The main aim of this simulated dataset is to highlight the inference capabilities of the reconstruction method, and to explore causes for discrepancies between the simulations and the model fits, in an ideal setting where all parameters are known and are accounted for in the model."

8.
We changed how we refer to this distribution, we now use "normal distribution truncated at zero". Thank you for this suggestion.

9.
We fixed the use of spaces in this box. 10.