Statistical models for the deterioration of kidney function in a primary care population: A retrospective database analysis

Background: Evidence for kidney function monitoring intervals in primary care is weak, and based mainly on expert opinion. In the absence of trials of monitoring strategies, an approach combining a model for the natural history of kidney function over time combined with a cost-effectiveness analysis offers the most feasible approach for comparing the effects of monitoring under a variety of policies. This study aimed to create a model for kidney disease progression using routinely collected measures of kidney function. Methods: This is an open cohort study of patients aged ≥18 years, registered at 643 UK general practices contributing to the Clinical Practice Research Datalink between 1 April 2005 and 31 March 2014. At study entry, no patients were kidney transplant donors or recipients, pregnant or on dialysis. Hidden Markov models for estimated glomerular filtration rate (eGFR) stage progression were fitted to four patient cohorts defined by baseline albuminuria stage; adjusted for sex, history of heart failure, cancer, hypertension and diabetes, annually updated for age. Results: Of 1,973,068 patients, 1,921,949 had no recorded urine albumin at baseline, 37,947 had normoalbuminuria (<3mg/mmol), 10,248 had microalbuminuria (3–30mg/mmol), and 2,924 had macroalbuminuria (>30mg/mmol). Estimated annual transition probabilities were 0.75–1.3%, 1.5–2.5%, 3.4–5.4% and 3.1–11.9% for each cohort, respectively. Misclassification of eGFR stage was estimated to occur in 12.1% (95%CI: 11.9–12.2%) to 14.7% (95%CI: 14.1–15.3%) of tests. Male gender, cancer, heart failure and age were independently associated with declining renal function, whereas the impact of raised blood pressure and glucose on renal function was entirely predicted by albuminuria. Conclusions: True kidney function deteriorates slowly over time, declining more sharply with elevated urine albumin, increasing age, heart failure, cancer and male gender. Consecutive eGFR measurements should be interpreted with caution as observed improvement or deterioration may be due to misclassification.


Introduction
The National Institute for Health and Care Excellence recommend monitoring kidney function using estimated glomerular filtration rate (eGFR) in people with, or at risk of, chronic kidney disease (CKD) 1 . The guideline suggests increasing the intensity of monitoring according to the current level of eGFR and albumin-creatinine ratio, stating that monitoring should be tailored according to i) the underlying cause of CKD and ii) past patterns of eGFR and albumin-creatinine ratio, comorbidities, changes to treatments such as reninangiotensinaldosterone system antagonists, inter-current illness and whether the patient has chosen conservative management of CKD. One of the objectives of monitoring eGFR is to detect progression of CKD, which could precede end-stage renal disease (ESRD). ESRD is associated with substantial morbidity and mortality, with cardiovascular disease mortality rates 10 to 30 times higher in patients on dialysis than in the general population 2 . Yet, kidney function declines slowly with age and ESRD is rare, even for people with moderately impaired renal function (eGFR 30-59 ml/min/1.73m 2 ). In a study of 58,000 people with CKD stage 3 who were followed for 10 years, the cumulative incidence was 40 per 1,000 people 3 . It follows that recommendations to monitor everyone annually or more frequently in a community setting for progressive kidney function loss will have a poor yield. Furthermore, as eGFR is a noisy measurement, with a within-person coefficient of variation estimated to be approximately 5.5% 4 , it is likely two consecutive eGFR measurements may appear to indicate declining renal function when underlying renal function is stable (false positive), or stable renal function when underlying renal function has deteriorated (false negative). Finally, it is arguable as to whether there are any actions that can be taken to halt the deterioration of renal function if progressive CKD is found, as there is currently very little evidence that "catching" CKD early produces any benefit 5 .
There have been no trials of screening or monitoring for CKD 6 and recommendations for how frequently monitoring should take place are based on expert opinion. In the absence of trials, an approach combining a model for the natural history of kidney function over time combined with a cost-effectiveness analysis offers the most feasible approach for comparing the effects of monitoring under a variety of policies. The aim of this study was to create a model for kidney disease progression using routine measures of kidney function. Our approach simultaneously estimates the true rate of kidney function loss and the probability of misclassification that inevitably occurs from using eGFR. Our study is conducted in a general primary care population and our results will be useful in guiding future recommendations for the timing of monitoring eGFR in primary care.

Ethical statement
The protocol for this research was approved by the Independent Scientific Advisory Committee of the Medicines and Healthcare Products Regulatory Agency (protocol number 14_150R). Ethical approval for observational research using the Clinical Practice Research Datalink with approval from the Independent Scientific Adisory Committee has been granted by a National Research Ethics Service committee (Trent Multi Research Ethics Committee, REC reference number 05/MRE04/87).

Source and selection of participants
We used the UK Clinical Practice Research Practice Datalink (CPRD) 7 to construct an open cohort of adults (≥18 years of age) registered at practices deemed to have "acceptable" patient records (termed "up-to-standard" in CPRD). We included patient records starting from 1 April 2005, postdating the publication of the Kidney Disease Outcomes Quality Initiative (KDOQI) guidelines for the classification of CKD in 2002 8 and the introduction of Quality and Outcomes Framework targets in UK primary care in 2004. The study end date was 31 March 2014. Eligible patients had to be registered with their practice for a minimum of 12 months before study entry to ensure adequate recording of baseline covariates. We excluded patients who, in the 12 months before study entry, were pregnant, were receiving dialysis, or were living kidney donors or recipients. Follow-up ended at the study end date, unless preceded by the date of death, transfer out of CPRD, the last available linked data, or (where applicable) pregnancy, renal transplantation/donation, or dialysis.

Amendments from Version 1
In response to the Reviewers' comments, we have added a paragraph to the Discussion explaining the potential impact of outcome measure categorisation and Markov model assumptions. Regarding the loss of information, it is typically small when the number of categories is large (as was the case in our study). As for potential violations of the Markov assumption, these were mitigated through the stratification of our models by baseline albuminuria stage (a key factor determining trajectory of kidney function decline) and through conditioning the models on updated covariates.
One of the key assumptions of our models, which was not previously highlighted, was that kidney function could only deteriorate with age. We have added a sentence to our Discussion to highlight this.
We have also edited the Discussion to clarify our study's eligibility criteria and to explain that the results are generalisable to the UK general practice population (as opposed to the UK population). Serum creatinine tests are commonly performed in UK general practice, and not necessarily for the purpose of monitoring kidney function. Therefore, our study was able to capture an unselected UK general practice population, but may have omitted patients with end-stage renal disease, or patients who did not engage with their general practice.

REVISED
The HMMs comprised two components, a multi-state model governing the 'true' underlying progression of CKD, and a second model for the probability of misclassification to allow for the variability in eGFR. The underlying model for CKD was parametrised as uni-directional, in which true kidney function could only deteriorate over time (no spontaneous improvement). The outcome was eGFR stage based on the criteria used for the diagnosis of CKD, i.e. G1-G5. We combined stages G1 and G2 for the purposes of improving model fit. Death from any cause was assumed to be an absorbing state. A representation of the HMMs is depicted in Figure 1.
The HMMs were specified so that it was possible for misclassification to occur in neighbouring eGFR categories. Hence, for a person with true GFR >60 ml/min/1.73m 2 we specified the model so that a single measurement of eGFR could fall within a G3a or G3b category due to measurement error and biological variation, but not G4 or G5. For a person with true eGFR in stage G3b, a single measurement of eGFR could be misclassified as either G1/2, G3a, G4 or G5. Death was the only state assumed to be always classified correctly.
To assess model fit, we used a split-sample approach. Although this is a weak procedure for low-variance methods, such as the Cox proportional hazards model or logistic regression, it is useful for a model that can be over-parametrised or exhibit convergence issues (such as a HMM). We split the data using pseudo-random numbers into equal size training and testing data sets. The model was fit in the training data set and then used to predict trajectories of eGFR for patients in the testing data set, based on their measurement times and covariates. Calibration plots were used to compare the predicted and observed proportion of tests falling within each eGFR category over time. Annual transition rates for kidney function loss and death from any cause were estimated from the model, along with the misclassification probabilities and transition rate multipliers for age, sex, heart failure and cancer, and presented as state model diagrams. The models were used to estimate the probability of progression to a higher stage within six, 12 or 36 months, along with the probability that an eGFR test taken at that time would detect the change (true positive), and the probability that a change in eGFR stage would occur in a person in whom true kidney function had not changed (false positive), for all cohorts for baseline stages G3a and G3b; see Supplementary Tables S18-21 (Extended data) 14 .
Finally, we estimated global misclassification probabilities for the four cohorts using the Viterbi algorithm 15 to find the underlying sequence of true eGFR stages with the highest probability given the observed sequence. Assuming the state predicted by the model was the truth, we calculated the proportion of times the observed state was a lower stage than predicted (under-grading) and the proportion of times the observed was a higher stage than predicted (over-grading), and then added these together to calculate the total number of misclassified tests across cohorts.
All analyses were performed in R version 3.6.1 ("Action of the Toes") 16 , with HMMs fit using version 1.6.7 of the msm package 17 . Scripts used in these analyses are available (see Software availability) 18 .

Results
The initial data set comprised 3,338,526 patients. A total of 1,365,458 patients whose records contained fewer than three eGFR tests were excluded, leaving 1,973,068 patients eligible for analysis: 1,921,949 without a urine albumin test on record, 37,947 with normoalbuminuria (<3 mg/mmol), 10,248 with microalbuminuria (3-30 mg/mmol), and 2,924 with macroalbuminuria (>30 mg/mmol). Each of the four cohorts were split into two halves and nominated as training and testing data sets. Due to the computational demands of the statistical method used, we randomly selected a sub-cohort of 50,000 patients to fit the model in the cohort without a urine albumin test on record. Summary statistics of patient characteristics from the four cohorts are presented in Table 1.
Six state continuous time HMMs adjusted for sex, heart failure, cancer, hypertension and diabetes, and annually updated age were fit on the four training data sets. Hypertension and diabetes were subsequently removed from the models as they were unable to predict eGFR stage progression or death. All models converged to their respective maximum likelihood estimates, with positive definitive Hessian matrices permitting confidence interval estimation for all parameters. Intensity, transition and misclassification matrices for these models are given in Supplementary Tables S2-13 (Extended data) 14 . Figure 2 shows the annual transition and misclassification probabilities for a woman, aged 60, without heart failure or a previous diagnosis of cancer and with no urine albumin test   on record. The figure shows that if kidney function is normal (G1/G2) then the probability of her true kidney function deteriorating to stage G3a in one year is estimated to be 1.1%. The probability that a single eGFR test will be misclassified as G3a is 2.9%, while the probability that it will correspond to her true stage is 97.1%. The probability that this woman dies within a year is estimated to be 0.7%. The probability that her kidney function remains in this category is 98.2%. If the woman is one year older then transition probabilities should be multiplied by 1.08 for kidney function and 1.09 for death. For example, the annual transition probability from stage G3b, is 1.0% for a 60 year old woman, but 1.0 × 1.08 10 = 2.16% for a 70 year old woman and 1.0 × 1.08 20 = 4.66% for woman who is 80 years old. Multipliers in which the confidence interval overlapped "no effect" are set to 1.00. Figure 3 represents annual transitions for a woman with the same characteristics, but who has had her urine albumin tested and found to be in the normoalbuminuric range. Corresponding annual transition probabilities for kidney function are nearly twice that of an equivalent woman without a urine albumin test on record. Respective transition rates to death from each stage are also higher, illustrating that this cohort represents women in poorer health. Misclassification probabilities and transition probability multipliers are broadly similar to Figure 2. Figure 4 and Figure 5 show results for women with micro-and macroalbuminuria, respectively. Kidney function transition probabilities are higher, as are annual transition probabilities for death. Fewer transition multipliers are significant for these cohorts but this probably reflects the smaller cohort sizes and correspondingly reduced statistical power. Table 2 shows the results from applying the Viterbi algorithm to the four cohorts. Under-grading of eGFR stage occurs more often than over-grading in all cohorts but over-grading tends to increase for cohorts having urine albumin tests. In total, 12.1% (11.9-12.2%) of all tests done in the unmeasured urine albumin cohort are misclassified, 13.1% (13.0-13.3%) in patients Mean sojourn time, i.e. the average time spent in each state, decreased with increasing severity of eGFR and albuminuria stage (Table 3). One exception was for macroalbuminuric patients in eGFR stage G5, for whom the mean sojourn time was greater than for microalbuminuric patients in eGFR stage G5. However, few patients were present in the more severe diseases states and the 95% confidence intervals of the two estimates substantially overlap.

Discussion
We have developed a statistical model for kidney function monitoring over time, using a large clinical database of longitudinal kidney function measurements from an unselected primary care cohort. This model takes into account that observed kidney function is measured with error and uses statistical methodology to estimate the underlying 'true' rate of progression. We stratified our models by albuminuria stage in accordance with the findings of previous studies that showed that urine albumin excretion is a significant risk factor for the progression of CKD and the development of ESRD 19-21 . Our analyses suggest that kidney function declines more rapidly in men than in women, independent of other risk factors. Existing evidence for differences in the rates of progression between men and women is conflicting 3,22,23 . Our analysis supports the observations of others, that men are over-represented in the latter stages of CKD 24 , with our model predicting a slower progression of kidney disease for women in the unmeasured urine albumin and normoalbuminuria cohorts. The fact that women are over-represented at CKD stage 3 may be due to the fact that women tend to live longer than men.
We estimated the probability of misclassification conditioning on true eGFR stage. A consistent pattern is seen across the different baseline urine albumin levels and by eGFR stage. Our model suggests that on average, change in underlying kidney function is slow with mean sojourn times in stage G3a and G3b being between 15 and 25 years for patients without elevated urine albumin. Given the slow rate of change and the high chance that observed eGFR misclassifies the true eGFR stage, frequent testing of eGFR in these populations will inevitably lead to the detection of more spurious change than real change.
We assessed whether our models of kidney disease progression would be improved by adjusting for clinical characteristics that were a priori considered to be associated with increased risk, and therefore, faster progression. Our analysis did not support the notion that diabetes, hypertension, peripheral vascular disease, ischaemic heart disease, stroke or transient ischaemic attack are independently associated with deterioration of kidney function once albuminuria stage and updated eGFR are accounted for. We conclude that conditioning on eGFR stage and urine albumin levels, knowledge of diabetes status is less important, but we cannot rule out that our study may be under-powered to detect small but real effects on transition rates.
A major strength of this study is that we have taken a very large and unselected sample of patients from a database that has been shown to be representative of the wider UK primary care population 7 . Inclusion into the study was conditional upon having three or more serum creatinine measurements, but creatinine is commonly measured in UK general and not necessarily for the purpose of monitoring kidney function or diagnosing kidney disease. Our model for progression takes into account multiple stages of kidney function and the competing risk of death from any cause. We have also employed a method that Table 2. Probability (%) that any eGFR test is under-graded and/or over-graded, by albuminuria stage. 95% confidence intervals shown in brackets.
Our study has a number of limitations. Our data was not collected for the purpose of conducting a study about modelling progression of kidney function. As a consequence, we do not know the reasons tests were conducted, and for many patients, records were incomplete and examination times were irregular. The extent to which this could bias our findings is unclear as it depends on our understanding of the examination scheme used by the doctors. We recognise three potential mechanisms for these tests to occur in a primary care setting. A significant number of creatinine tests will be 'random' with respect to the kidney function, because they would have been ordered as part of a routine check-up and not specifically to monitor or diagnose kidney disease. This could be a result of the co-reporting of serum creatinine as part of 'test batches' in which other biomarkers would have been of primary interest, or because serum creatinine may have been requested prior to the initiation of a potentially nephrotoxic drug. For some patients, the timing of the next measurement will have been influenced by the current kidney function level. This is likely to have happened if the purpose of the test is to monitor CKD and current clinical guidelines are followed 1 . This mechanism has been referred to as 'doctor's care' in the literature. The third scenario is when a patient initiates the timing of their test themselves, so called 'patient self-selection'. Of the three scenarios, we consider the self-selection scenario possible but less likely than the other schemes due to the asymptomatic nature of kidney function loss in all but the end stages of the disease. Grüger et al. 26 showed that estimated transition rates are only biased under the "patient self-selection" examination schemes and transition rate estimates are unbiased if inefficient under doctor's care scheme. In the case of random timing, the estimates are both efficient and unbiased.
We could be criticised for using an approach that categorises kidney function rather than a method that models continuous eGFR, such as generalised linear mixed models. This is because categorisation can lead to loss of information and reduced statistical power. However, such information loss is typically small when the number of categories is large 27 , as was the case in our study. Furthermore, the use of categories that naturally aligned with clinically meaningful eGFR stages added to the interpretability of our findings. In addition, HMMs assume that all individuals within a state are interchangeable, and that the chance of progression to subsequent states depends only on the current state. This assumption may not hold if a patient whose kidney function has previously rapidly declined continues along this trajectory. To mitigate this, we have included updated risk factor information in our model and stratified by baseline albuminuria status. Further research could include assessing the impact of the Markov assumption on predicting kidney function decline using HMMs.
We attempted to include a state to represent transient and acute loss of kidney function (acute kidney injury) as this is a contributing factor to CKD, but the addition of this non-absorbing state with pathways back to each state resulted in over-parametrisation of the model. Furthermore, data on urine albumin, body mass index and ethnicity is missing in a large number of patients in CPRD. To overcome this, we created a sub-group of patients in whom urine albumin was not recorded. The omission of ethnicity in this model is a limitation as kidney function decline is considered to differ between ethnic groups. We were not able to adjust our models for ethnicity, as historically, ethnicity has been poorly recorded in CPRD.
It is likely that once a patient's kidney function has been observed in stage 4 or 5, they are referred to specialist care, with subsequent kidney function testing occurring outside the CPRD database. Hence, these patients' records are missing from our study, which potentially explains why transition rates slow down rather than increase, as might be expected. Our study design means that we would also miss patients with ESRD who had not engaged in primary care. In a study of electronic health records data from Pennsylvania, a similar model was fit to eGFR records, and reported that transition probabilities between kidney function stages generally increased as stage increased for all but stage 3 25 . Even so, our model calibrates well with reports of progression to ESRD from different stages. For example, Tangri et al. 28 reported that three from 2,014 people with CKD stage 3 at baseline progressed to ESRD after three years of follow-up. Assuming this population contained an equal proportion of people with CKD stage 3a and 3b, then our model, based on the unmeasured urine albumin cohort, would predict that just one person would reach stage 5 after three years. Using the model for patients with normoalbuminuria, it would be three people. From the same study, 22 of 826 people progressed from stage 4 at baseline to kidney failure after three years. Our models predict 25 people with unmeasured urine albumin and 46 people with normoalbuminuria would reach stage 4. In a study reporting on sex differences in CKD progression, the rate of ESRD per 100 person-years was 3.1 in women and 3.8 in men. Based on our model for patients with normoalbuminuria our equivalent estimates are 1.9 and 2.3, but 2.07 and 2.13 for patients with microalbuminuria and 3.0 and 3.2 for patients with macroalbuminuria. Our study shows that kidney function deteriorates slowly in most patients with average sojourn times in decades rather than years. Whilst eGFR is widely used to measure kidney function we estimate that the potential for misclassification is large and clinically relevant, with implications for monitoring for rapid kidney function loss or pharmacovigilance. For example, of 1,741 people with CKD stage 3 recruited for a study from 32 primary care practices in the UK 29 , 496 were in remission at baseline (although qualifying at the recruitment stage) and of these, 157 were back to CKD stage 3 at one year, with a further 132 returning to stage 3 CKD by five years. This type of pattern is consistent with our model, in which underlying kidney function only deteriorates but is observed with error. If our model is correct, then it is clear to see how monitoring CKD periodically will confuse and might lead to inappropriate action. The assumption that true underlying kidney function only deteriorates with age is a fundamental part of the model and further research could investigate alternative models for underlying kidney progression and their impact monitoring recommendations.

Conclusions
We have developed a model to predict decline in kidney function and used it to assess different monitoring strategies and screening programmes. The model takes into account stage progression and test error, which were recently identified as important for future economic evaluations of CKD testing 30 . Future work in this field could look to validate this model in another primary care population, ideally one in which patients are followed throughout including stages 4 and 5.

Underlying data
The data used in this study are not publicly available and were obtained under licence. This project contains the following extended data: • Extended Data.pdf (document containing Tables S1-20 and Figures

Open Peer Review
The authors' rationale for their use of their Hidden Markov Model is that it permits adjustment for the well-documented degree of CKD stage misclassification that is revealed when estimates of eGFR are compared with directly measured GFR (itself measured with significant "error" when evaluated by replicate measurements). But their Hidden Markov Model also makes assumptions that may well not hold in reality. There are two specific issues with the authors' Markov assumptions: a. that eGFR once "lost" cannot be recovered, and b: that the history of prior changes in renal function is not informative once a Markov transition has occurred. This model seems to be based on an assumption that reduction of eGFR is mostly associated with irreversible changes in renal structure or physiology. This assumption is particularly risky in subjects with relatively well preserved renal function and absent or trace proteinuria --demonstrably the large majority of subjects in the authors' cohort and evidently the primary focus of their broader project. In these subjects, the likelihood of a transient reduction of eGFR due to any episode leading to dehydration may well be greater than of an event causing an irreversible loss of function (conflicting with the assumption that "true" [hidden] eGFR cannot increase following a drop in eGFR). Further, subjects with prior episodes of AKI are known to have greater subsequent chance of progressing to ESRD, and subjects receiving treatment with any of the three classes of treatment shown to be effective in delaying progression (renin/angiotensin/aldosterone inhibitors, reduction of systolic blood pressure, and more recently SGLT2 inhibitors) are all associated with immediate reductions of eGFR that are associated with later slower risk of subsequent progression. These facts conflict with the assumption that past changes in eGFR leading to a Markov stage transition are not informative of the chance of subsequent loss of eGFR. I suggested using an alternative model --joint modeling --that deals similarly with the issue of errors in measurement of proposed predictors by estimating a true but "hidden" path of eGFR over time from the observed values and uses estimates from that model at relevant time points as the estimated "true" current value for the joint survival model. This technique permits use of all the continuous primary data without collapsing it into categories, makes no assumption excluding increases of eGFR, and permits many options for assumptions about the impact of earlier values. Additionally, this approach would permit direct evaluation of the concordance and the rate of classification error of the relevant primary outcomes --ESRD or death. In my Version 1 report, I noted that "Given the huge amount of information available for this analysis, the inefficiency of the hidden Markov model may be overcome in the analysis of the overall community estimates of rate of progression." This seems to be the case in the paper cited in their response to my critique, and I agree that their analytic method is useful for evaluation of choice of overall or group policy. I suspect, though, that alternative analytic methods permitting use of all of the continuous data might be more powerful in refining predictions, and therefore permitting individualized recommendations or guidelines applicable to specific patients.
In their response to my Version 1 report, the authors' have included new text recognizing the limitations of the Markov approach and recognized that "Further research could include assessing the impact of the Markov assumption on predicting kidney function decline using HMMs." Given this addition, I now approve this manuscript.

Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Nephrology, Epidemiology, Statistics, Clinical trial design, performance, and analysis.
The authors of this manuscript report on the long-term rate of progression of chronic kidney disease (CKD) in a very large cohort of UK general practice patients followed as part of the Clinical Practice Research Datalink in the time between 1 April 2005 and 31 March 2014. Progression of kidney disease over each patient's course was assessed as the rate of progression from one stage to the next across the six currently standard KDOQI/KDIGO "stages" of renal insufficiency (Stages 1, 2, 3a, 3b, 4, and 5), which are defined by ranges of estimated glomerular filtration rate (eGFR). The rate of progression from one stage to the next was modeled as a Hidden Markov Model, which assumes that subjects are at "true" eGFR stages at each evaluation point, but that these stages may be misclassified based on random error of the eGFR estimate, so that the "observed" transitions may be misidentified. The overall conclusions of the authors are: a) that progression of renal disease to renal failure occurs quite slowly in general, with only a small fraction of patients that progress to kidney failure, and b) that progression of kidney disease as evidenced by [exclusive] use of progression in the KDOQI/KDIGO Stage of CKD, though generally reflective of the average course of kidney disease in the community, is associated with significant error when applied to an individual. Though the authors don't say this, the implication is that use of their analysis may not be very helpful in establishing guidelines as to in whom, and at what repeated intervals, eGFR determinations would be cost effective.
The major strength of this study is the almost unique test bed offered by access to the data of almost 2,000,000 UK qualified general practice patients from Clinical Practice Research Datalink with "up-to-standard" data collection. Given the choice of analytic method, the authors' analysis is technically entirely competent. The assumptions and methods are well described. The results are clearly explained and put in appropriate context in the Discussion Section. The major limitation of the available data is that patients with the more advanced Stages of kidney disease are likely to have been referred to specialist providers, so that data on the more advanced stages of kidney disease, where progression is generally both more rapid and more clinically important, are more limited and more susceptible to bias. The authors clearly acknowledge this. Further, the issue addressed by this manuscript is potentially important, at least from a health economics point of view. Current policy recommendations uniformly encourage periodic assessment of renal function "in people with, or at risk of, chronic kidney disease (CKD)." But as noted above, in general kidney disease progresses quite slowly and only a small minority of subjects even with documented kidney disease ever progress to chronic renal failure. There is almost no empirical data upon which to make any data-based recommendation about in whom or how often the "periodic assessments" should be done.
But the choice of analytic method is subject to criticism. The continuous variable eGFR may range from over 100 (mL/min/1.73 M 2 ) to close to 0. The reduction of this range of values to a classification with 6 levels is associated with massive information loss. Markov models deal with chances of transition from one state to another, and all members of one state are considered to be interchangeable, so that a patient with an eGFR of 100 (at Stage 1/2 with the combination of these two stages) is treated as equivalent to one with an eGFR of 62. But surely the chance of progression from an eGFR of 100 to an eGFR < 60 (Stage 3a) is much lower than the chances of progression from a value of 62 to < 60 . Similarly, it is intuitively likely (though not at all certain) that a subject with a history of rapid loss of eGFR in the past will have more rapid progression in the future. But with a Markov model, all history is lost, and only the current stage, and not the prior history is taken into account. Given the huge amount of information available for this analysis, the inefficiency of the hidden Markov model may be overcome in the analysis of the overall community estimates of rate of progression. But these results will be of almost no use in prediction of the progression of kidney disease in the individual patient. And it is the individual patient's progress, not the average rate of progression in the community, that we are trying to clarify in recommending individual testing eGFR and setting repeat testing intervals.
An alternative approach would have been to use a hierarchical ("mixed") analytic method (or perhaps even better, joint analysis combining the hierarchical analysis with a time to event analysis for ESRD and death) to evaluate the overall community rate of progression while estimating the distribution of individual rates of progression. Mixed models deal automatically with the problem that eGFR is measured with error (whether due to variability of the underlying "true" GFR, error due to the misestimation of GFR by eGFR --up to 35% CKD Stage misclassification using the CKD-Epi formula: Levey et al. (2009) 1 , error due to the underlying determination of serum creatinine (Scr) which is increased at low levels of Scr because of its inverse relationship with eGFR, or variability in timing of the underlying measurements of serum creatinine). Mixed methods also permit specification of alternative models of change over time, permitting exploration of the hypotheses that progression is not linear. They would also facilitate development of time dependent models, allowing input of changes to patient's diabetes, hypertension, and cancer status over time, rather than limiting those variables to the baseline analysis. Most important, this sort of analysis might permit identification of the subset of individuals for whom the risk of progression is the highest --that might be most benefited by careful follow-up of eGFR, and might give better guidance as to cost-effective testing intervals.
The authors' choice of analytic methods in this analysis should be put into perspective. The authors were evidently motivated to undertake this analysis so that cost-effectiveness studies could be done specifically in terms of the KDOQI/KDIGO CKD staging mechanism --presumably to make it easier to develop easily followed guidelines for testing frequency. (See the authors' comment and reference 29 in their final Conclusion.) If the authors felt constrained by these requirements, they have performed a reasonable, if inefficient, analysis. But they and the authors of reference 29 might have done better to request and perform a more efficient analysis using the underlying eGFR data rather than the derived CKD Stages to develop their model, and then abstracted stage specific recommendations from the more efficient model, rather than the other way around.
In sum, the analysis as performed using the specified analytic model was competently done, and the overall conclusions are justified -that the rate of progression of renal insufficiency is slow on average, and that dependence on this model would be associated with potentially significant error. But these limited conclusions don't exclude the possibility that more clinically reliable and useful data might be obtained by use of a method that didn't throw away so much of the available data.
chronic kidney disease (CKD) in a very large cohort of UK general practice patients followed as part of the Clinical Practice Research Datalink in the time between 1 April 2005 and 31 March 2014. Progression of kidney disease over each patient's course was assessed as the rate of progression from one stage to the next across the six currently standard KDOQI/KDIGO "stages" of renal insufficiency (Stages 1, 2, 3a, 3b, 4, and 5), which are defined by ranges of estimated glomerular filtration rate (eGFR). The rate of progression from one stage to the next was modeled as a Hidden Markov Model, which assumes that subjects are at "true" eGFR stages at each evaluation point, but that these stages may be misclassified based on random error of the eGFR estimate, so that the "observed" transitions may be misidentified. The overall conclusions of the authors are: a) that progression of renal disease to renal failure occurs quite slowly in general, with only a small fraction of patients that progress to kidney failure, and b) that progression of kidney disease as evidenced by [exclusive] use of progression in the KDOQI/KDIGO Stage of CKD, though generally reflective of the average course of kidney disease in the community, is associated with significant error when applied to an individual. Though the authors don't say this, the implication is that use of their analysis may not be very helpful in establishing guidelines as to in whom, and at what repeated intervals, eGFR determinations would be cost effective.
Authors: We disagree. We believe our model (unlike others that assume the observed eGFR is equal to true eGFR) permits analysis that correctly considers the misclassification that commonly occurs when monitoring kidney function using estimated GFR. This is important because when a person is misclassified into a higher stage treatment may represent a cost but provide less benefit than expected and in the opposite scenario a misclassification into a lower stage (false negative) would not incur costs from direct treatment but would render the patient at a higher risk for longer. Attempts to model the cost-effectiveness of monitoring without taking this into account are potentially misleading. The work presented in this manuscript formed part of a wider research project, which is intended to assess the cost effectiveness of kidney function monitoring. The health economics study using the outputs from this paper can be found at https://doi.org/10.1371/journal.pmed.1003478. The Health and Technology Assessment programme this work feeds into can be found at https://www.journalslibrary.nihr.ac.uk/programmes/pgfar/RP-PG-1210-12003.

Reviewer:
The major strength of this study is the almost unique test bed offered by access to the data of almost 2,000,000 UK qualified general practice patients from Clinical Practice Research Datalink with "up-to-standard" data collection. Given the choice of analytic method, the authors' analysis is technically entirely competent. The assumptions and methods are well described. The results are clearly explained and put in appropriate context in the Discussion Section. The major limitation of the available data is that patients with the more advanced Stages of kidney disease are likely to have been referred to specialist providers, so that data on the more advanced stages of kidney disease, where progression is generally both more rapid and more clinically important, are more limited and more susceptible to bias. The authors clearly acknowledge this. Further, the issue addressed by this manuscript is potentially important, at least from a health economics point of view. Current policy recommendations uniformly encourage periodic assessment of renal function "in people with, or at risk of, chronic kidney disease (CKD)." But as noted above, in general kidney disease progresses quite slowly and only a small minority of subjects even with documented kidney disease ever progress to chronic renal failure. There is almost no empirical data upon which to make any data-based recommendation about in whom or how often the "periodic assessments" should be done.

Authors:
We would like that thank the Reviewer for the very positive comments and we agree that patients with more advanced stages of kidney disease might have been referred to specialist care. We would like to emphasise that the main aim of this study was estimate the rate of progression of kidney function decline before they enter end-stage renal disease, requiring specialist care. Therefore, there is a plethora of data, on which to make data-based recommendations regarding the frequency of periodic assessments. Tables S17-20 show that where such recommendations have been made, these have 'only' been for patients in eGFR stages G3a and G3b. Patients in these stages are still monitored within primary care, as opposed to specialist care. Moreover, as general practices are financially incentivised by the Quality and Outcomes Framework (QOF) to monitor such patients, there is no shortage of data for these patients as demonstrated in Table 1. Furthermore, it's also worth noting that Table 1 Table 1 a poor reflection of the general quality of the testing data.
Reviewer: But the choice of analytic method is subject to criticism. The continuous variable eGFR may range from over 100 (mL/min/1.73 m 2 ) to close to 0. The reduction of this range of values to a classification with 6 levels is associated with massive information loss. Markov models deal with chances of transition from one state to another, and all members of one state are considered to be interchangeable, so that a patient with an eGFR of 100 (at Stage 1/2 with the combination of these two stages) is treated as equivalent to one with an eGFR of 62. But surely the chance of progression from an eGFR of 100 to an eGFR < 60 (Stage 3a) is much lower than the chances of progression from a value of 62 to < 60. Similarly, it is intuitively likely (though not at all certain) that a subject with a history of rapid loss of eGFR in the past will have more rapid progression in the future. But with a Markov model, all history is lost, and only the current stage, and not the prior history is taken into account.
Given the huge amount of information available for this analysis, the inefficiency of the hidden Markov model may be overcome in the analysis of the overall community estimates of rate of progression. But these results will be of almost no use in prediction of the progression of kidney disease in the individual patient. And it is the individual patient's progress, not the average rate of progression in the community, that we are trying to clarify in recommending individual testing eGFR and setting repeat testing intervals.

Authors:
The Reviewer is correct in stating that the Markov assumption means that each subject's transition to a more severe eGFR stage depends only upon their current stage. We agree that this assumption may be an oversimplification of reality, as it would be likely that an individual who has progressed rapidly through previous stages will continue to do so, and vice versa. As for the reviewers point about likelihood of progression being different if eGFR was 100 vs 62 ml/min/1.73m 2 , we do not disagree that progression to a next stage would be different, but we would emphasise the problem of knowing what true eGFR is. For example, someone in whom true eGFR is 80 ml/min/1.73m 2 , their observed eGFR could be anywhere between 70 and 90 ml/min/1.73m 2 . So, by categorising eGFR into six levels there is some loss in information, but we have a more robust inference. We strongly disagree that these models will be almost of no use. Our models provide average progression times conditional on eGFR stage and presence of albuminuria, and other co-morbidities, e.g., heart failure. The same problem would exist for mixed effect models in as much as you can only provide aggregated rates of decline. We have added the following paragraph to the Discussion section: "We could be criticised for using an approach that categorises kidney function rather than a method that models continuous eGFR, such as generalised linear mixed models. This is because categorisation can lead to loss of information and reduced statistical power. Reviewer: An alternative approach would have been to use a hierarchical ("mixed") analytic method (or perhaps even better, joint analysis combining the hierarchical analysis with a time to event analysis for ESRD and death) to evaluate the overall community rate of progression while estimating the distribution of individual rates of progression. Mixed models deal automatically with the problem that eGFR is measured with error (whether due to variability of the underlying "true" GFR, error due to the misestimation of GFR by eGFR -up to 35% CKD Stage misclassification using the CKD-Epi formula: Levey et al. (2009) 1 , error due to the underlying determination of serum creatinine (Scr) which is increased at low levels of Scr because of its inverse relationship with eGFR, or variability in timing of the underlying measurements of serum creatinine). Mixed methods also permit specification of alternative models of change over time, permitting exploration of the hypotheses that progression is not linear. They would also facilitate development of time dependent models, allowing input of changes to patient's diabetes, hypertension, and cancer status over time, rather than limiting those variables to the baseline analysis. Most important, this sort of analysis might permit identification of the subset of individuals for whom the risk of progression is the highest --that might be most benefited by careful follow-up of eGFR and might give better guidance as to cost-effective testing intervals.

Authors:
The reviewer is correct, there are alternative statistical models to the approach we led with in this paper. Our initial intention was to model eGFR (or log eGFR) using mixed effect models, but we decided against it for two reasons: the trajectories of eGFR are not linear or necessarily even monotonic in time; more flexible modelling is limited by the fact that for many people only a few eGFR measurements existed in their health record. The models that we did fit did not meet the assumptions of normality implicit in the random effects approach and this, we believe, led to poor predictive performance of these models. It is possible that in an alternative data set, it would be possible to fit these models, but this was not our experience using routinely collected GP data. We would like to stress that we used updated covariates in all the HMMs used in this study and did not condition on only the baseline values of these.

Reviewer:
The authors' choice of analytic methods in this analysis should be put into perspective. The authors were evidently motivated to undertake this analysis so that costeffectiveness studies could be done specifically in terms of the KDOQI/KDIGO CKD staging mechanism --presumably to make it easier to develop easily followed guidelines for testing frequency. (See the authors' comment and reference 29 in their final Conclusion.) If the authors felt constrained by these requirements, they have performed a reasonable, if inefficient, analysis. But they and the authors of reference 29 might have done better to request and perform a more efficient analysis using the underlying eGFR data rather than the derived CKD Stages to develop their model, and then abstracted stage specific recommendations from the more efficient model, rather than the other way around.

Authors:
The reviewer is not entirely correct. This model was one part of a larger funded project which included a cost-effectiveness study using the outputs from this model, but it is not true that we felt or were constrained by this. Our initial plan was to model eGFR as a continuous measure, but we did not for reasons explained in the previous response. We feel the objective of our work is quite the opposite to the reviewer and we would prioritise models than fit clinical practice over and above statistical efficiency.

Reviewer:
In sum, the analysis as performed using the specified analytic model was competently done, and the overall conclusions are justified -that the rate of progression of renal insufficiency is slow on average, and that dependence on this model would be associated with potentially significant error. But these limited conclusions don't exclude the possibility that more clinically reliable and useful data might be obtained by use of a method that didn't throw away so much of the available data.

Authors:
We disagree with the assertion made by the reviewer that our method "threw so much of the data away". We agree that categorisation can result in loss of information, but this is most likely to occur when continuous measures are split in two (dichotomisation) and applied to small data sets. The thresholds for categorisation should also not be defined post-hoc. Neither of these are true in our case. The authors used CPRD records to inform a multistate model on progression of kidney disease. They appropriately considered that there could be measurement error and allowed for this to some extent.
They used outcome variables that have not been validated for incident analyses in CPRD(dialysis/transplantation).
The underlying model for 'true' but unobserved kidney disease assumed that kidney disease could only get worse over time but not better -but this is not supported by actual data which suggest that transient decreases can improve over time and may not necessarily need to be permanent (though such changes may be a risk marker for later decline). Here the authors should discuss more about the chronicity assumption within the CKD definition and how they parametrised this in their data.
Then there was a model for measurement error, but this seems to be a static model, i.e. assuming that measurement error does not vary over time -did I understand this correctly? Because over time in the UK the way creatinine gets measured and reported has changed dramatically during the study period. This change in reporting of calibrated creatinines in effect hides progression over time when crude analyses are used as different labs shifted to calibration and reporting of creatinine to IDMS at different points in time -thereby shifting the entire creatinine distribution down by 5% over the years prior to 2014 (i.e. hiding a decrease in kidney function over time).
Did the authors recalculate eGFR from the creatinines (which then requires some thought about time-dependent measurement error which varies by lab and time) -or use reported eGFRs (which then means they lost a lot of measurements based on thresholds with informative missingness as eGFR does not get reported uniformly -some labs only report values in a range of >15 and <60, others report up to 90ml/min/1.73m2)? Dependent on whether the authors used creatinine or eGFR they have discussed further biases in their design and explicitly allow for such biases in their model.
Then there is the overall study design which is a somewhat odd cohort as it is dependent on having three or more eGFR (or creatinine?) tests and that is not representative of the general population -the survey and CPRD validated prevalence of reduced eGFR in the UK population is about 6% and here the numbers are much higher (3-7 fold depending on albuminuria category) simply because these represent an enriched sample as GPs had a reason to test more than once but indeed three times. Does this enriched sample really represent 'a model for kidney disease progression'? The authors should discuss this and whether people who should have been tested but weren't tested may be a high risk group. The sample here represents a group of patients who engage with the health service but the people who progress in truth may be not fully captured. This needs to be discussed more.
There were real financial reasons for testing e.g. annual testing for diabetes (started in 2004), and less so for other illnesses, but risk factors for CKD determine testing rates and especially repeat testing as reported by the UK National CKD Audit, and there is some understanding from this audit how testing schemes are carried out. So I would disagree with "A major strength of this study is that we have taken a very large and unselected sample of patients from a database that has been shown to be representative of the wider UK population" as stated in the discussion -this is a selected sample and not representative of what happens overall in terms of kidney function decline as not all are tested the same way. I would have stratified by underlying comorbidity and not simply adjusted for it.
I totally agree with the authors about the selective loss to follow-up with loss of people who are managed by other specialities including renal in secondary care.
Overall this is an interesting analysis, but more work is needed to convince me that this model should be used for economic modelling.

If applicable, is the statistical analysis and its interpretation appropriate? Partly
Are all the source data underlying the results available to ensure full reproducibility? No

Are the conclusions drawn adequately supported by the results? Partly
Competing Interests: No competing interests were disclosed. thereby shifting the entire creatinine distribution down by 5% over the years prior to 2014 (i.e. hiding a decrease in kidney function over time).
Authors: This is a valid point, relating to a problem we have encountered in routine health record data sets. However, we feel that this issue is likely to have minimal impact to our study, as apart from the last 3.25 years, our data pre-dates the use of IDMS as the changes took place in 2010. Furthermore, we conducted this study in an open cohort (CRPD-Vision) that has been decreasing in size (https://doi.org/10.1136/bmjopen-2017-020738), further attenuating the pool of potentially affected tests.
Reviewer: Did the authors recalculate eGFR from the creatinines (which then requires some thought about time-dependent measurement error which varies by lab and time) -or use reported eGFRs (which then means they lost a lot of measurements based on thresholds with informative missingness as eGFR does not get reported uniformly -some labs only report values in a range of >15 and <60, others report up to 90ml/min/1.73m2)? Dependent on whether the authors used creatinine or eGFR they have discussed further biases in their design and explicitly allow for such biases in their model.

Authors:
In this study, we used all testing sources available to obtain eGFR, including prederived eGFR values. Unfortunately, the data source used provides no information on the techniques used to measure serum creatinine levels or to estimate eGFR. We also no longer have access to this data. Hence, we cannot say whether the calculation of a particular eGFR value was performed using laboratory methods or via an estimating equation in a GP practice. However, we estimated the overwhelming majority of eGFR values from raw serum creatinine values. Hence, any bias present from the use of pre-derived eGFR values is likely to be minimal. The method we used to calculate eGFR from serum creatinine was the CKD-EPI equation. We are aware that better method for the calculation of eGFR exist, however, the one we used would have been the one predominantly in use in UK clinical practice throughout the study period.

Reviewer:
Then there is the overall study design which is a somewhat odd cohort as it is dependent on having three or more eGFR (or creatinine?) tests and that is not representative of the general population -the survey and CPRD validated prevalence of reduced eGFR in the UK population is about 6% and here the numbers are much higher (3-7 fold depending on albuminuria category) simply because these represent an enriched sample as GPs had a reason to test more than once but indeed three times. Does this enriched sample really represent 'a model for kidney disease progression'? The authors should discuss this and whether people who should have been tested but weren't tested may be a high risk group. The sample here represents a group of patients who engage with the health service but the people who progress in truth may be not fully captured. This needs to be discussed more.

Authors:
The requirement of three serum creatinine tests was imposed to establish a reliable baseline eGFR status (requiring two tests) and to estimate change over time. We concede that the population present in this sample is not representative of a general UK population so much as a UK general practice population, and we have amended the wording of the fourth paragraph of the Discussion section of the manuscript to reflect this.