Keywords
Imaginary QALY, ordinal scores, impossible models
This article is included in the Research on Research, Policy & Culture gateway.
Imaginary QALY, ordinal scores, impossible models
The value framework advocated by the International Society for Pharmacoeconomics and Outcomes Research (ISPOR) is quite clear: “Leaders in the field of economic evaluation in health care have long recommended that analysts seeking to inform resource allocation decisions approximate the value of interventions in terms of incremental cost per Quality Adjusted Life Year (QALY) gained”1. The application of this value framework is probably best exemplified in the reference case technology assessment guidelines put in place by groups such as the National Institute for Health and Care Excellence (NICE) in the UK, the Canadian Agency for Drugs and Technologies in Health (CADTH) and the Institute for Clinical and Economic Evaluation (ICER) in the US. In each case pharmaceutical manufacturers and others (including the ICER itself) are asked to make a case for comparative cost effectiveness. This is done by constructing an imaginary (yet apparently believably ‘realistic’) simulation model extending, in the default case, for the lifetime of persons with a chronic disease. The costs and benefits of comparator interventions for the defined hypothetical population are then calculated. Benefits are expressed in terms of incremental cost-per-QALY claims. There is no intention that the resulting claims should meet the standards of normal science for credibility, evaluation and replication2. The model is not about the discovery of new facts; it is purely speculative. This is made clear in the latest version of the Canadian guidelines where it states: “Economic evaluations are designed to inform decisions. As such they are distinct from conventional research activities, which are designed to test hypotheses”3. By rejecting the construction of empirically verifiable theories and hypotheses, the imaginary simulated worlds of economic evaluations fail the demarcation test; they are pseudoscience not science4.
There is no ‘gold standard’ measure that can be used to generate QALYs. Several generic multiattribute instruments have been developed for this purpose. These differ considerably and produce markedly dissimilar scores for the same health states. The most used measures are the EQ-5D-3L and EQ-5D-5L, the HUI Mk2 and Mk3 and the SF-6D. These are designed to generate utility or value metrics on a scale from 0 = death to 1 = perfect health. Unfortunately, in the case of the EQ-5D-3L, the most widely used instrument, the algorithms applied to create utility scores can generate negative utility. The same argument, the production of negative utilities, applies to the other instruments. With the EQ-5D-3L utilities are allowed to range from −0.59 to 1.0. The negative utilities generated are considered to indicate states ‘worse than death’. The zero value in each measure is arbitrary, and it is not clear whether a utility of zero or lower makes any sense. The utility value is then applied to the simulated time spent in various hypothetical disease states over the course of a disease and a value adjusted time spent measure created: the QALY. QALYs are then aggregated (and discounted) over the simulated course of the disease to generate lifetime QALYs. Given estimated lifetime costs, the analyst can then produce lifetime cost-per-QALY, and eventually a simulated incremental cost-per-QALY claim.
For the utility value to support these operations it has to meet the axioms of fundamental measurement5. Four main types of measurement scale are recognized: nominal, ordinal, interval and ratio. Each satisfies one or more of the properties of: (i) identity - where each value has a unique meaning; (ii) magnitude where each value has an ordered relationship to other values; (iii) interval where the distances between scale units are equal to one another; and (iv) ratio where there is a ‘true zero’ below which no value exists. Nominal scales are purely descriptive and have no inherent numerical value in terms of magnitude. Ordinal scales have both identity and magnitude in an ordered relation but the distance between the ranks can differ considerably, generating only medians and modes (e.g., EQ-5D scales). The interval scale has identity, magnitude and equal intervals. It supports mathematical operations of addition and subtraction. A ratio scale satisfies all properties, supporting the additional operations of multiplication and division.
The question that must be addressed for those supporting QALYs is whether the utility value has ratio measurement properties. If we consider the EQ-5D-3L, there is no evidence that it measures at an interval level, let alone that it has ratio measurement properties.5 Quite the opposite. It can generate negative utilities and then negative QALYs. Put simply, it does not have a true zero. As the EQ-5D-3L is based on symptoms defined by ordinal response levels, the resulting EQ-5D-3L score can only have ordinal properties, not ratio properties. The same argument applies to the other instruments. There is no evidence to suggest that the question of fundamental measurement was considered in its development. The principal objective was a simple, functionally based capture of five symptoms with three ordinal response levels. Across any disease state, patients respond to the same five symptoms. Community preference weights are then applied and an algorithmic value is produced. The result is an ordinal score. Multiplying this score by time spent in a disease state is mathematically impossible.
Unless it can be demonstrated that the EQ-5D-3L (or any other value scale) has ratio properties for any target patient population, the concept of a generic utility QALY collapses; it defies measurement. The implications are interesting: the reference case incremental cost-per-QALY value framework is unintelligible, the claims for simulated QALY based cost-effectiveness claims with willingness to pay thresholds is redundant and some 30 years of advocating the construction of simulated imaginary worlds irrelevant. Rather than seeking real-world evidence, we are locked into a paradigm for imaginary world evidence.
Can the QALY be rescued; or, more to the point, do we want to put in the effort to rescue it? Certainly, it could be possible to start from scratch and develop a new measure from first principles employing modern rather than classical test theory measurement. This recognizes the application of Rasch measurement theory (RMT) in its application of conjoint simultaneous measurement (CSM). However, even with the application of RMT, we are unable to develop a scale with ratio properties unless there is a clear specification equation guiding its content6. At best we might develop a value set with interval properties, but this would preclude relating health status to time spent in a disease state (a multiplicative function) to create a QALY.
Do we need a QALY? Is there really a need to talk in terms of incremental cost-per-QALY claims? If we are concerned with quality of life and not the more narrowly defined health-related quality of life that characterizes almost all patient-reported outcome measures (PROMs), then we should consider disease-specific measurements. This is overdue; for we can say unequivocally that PROMs that were developed utilizing classical test theory, will not meet Rasch measurement standards. Quite simply, they were not designed to reflect an underlying latent construct with items selected to conform to Rasch measurement requirements. In some cases, it is possible, ex post facto, to ‘rescue’ an instrument through item assessment and possible removal of misfitting items7,8. A more positive approach would be to go back to first principles, as put forward by Rasch some 60 years ago, and meet fundamental CSM in the development of instruments9.
A further obstacle to rescuing the QALY is the fact that the utility manifest score can take negative values. This has been shown across many disease states for both the EQ-5D-3L and EQ-5D-5L10,11. In the former, the lowest possible manifest score, as noted above, is −0.59; in the latter the lowest score is −0.29. These negative scores, assuming we ignore the standards of fundamental measurement, lead to the intriguing possibility of negative QALYs. In other words, over a hypothetical lifetime, patients can conceivably hop into and out of negative QALY disease stages. With aggregate lifetime QALYs the sum of the time spent in these positive and negative QALY states could cancel each other out. It is not clear how we would interpret this ordinal score construction of negative time? Particularly where the lifetime summation of QALYs by disease stage is negative: cost per negative QALY?
It is a puzzle why those developing PROMs that are focused on functional status and symptom response should ignore the interests of the patient and, often, caregivers. After all, there is no reason why a physician’s view of response to therapy should necessarily be concordant with that of the patient or caregiver. If quality of life has any meaning it should focus on the patient as the principal ‘beneficiary’ of therapy interventions. A patient-centric approach, where life maintains its quality if patient needs are fulfilled, is not a new concept. It was first proposed in the early 1990s and has been the driving force in disease-specific instrument development within the Rasch measurement framework12,13.
Measurement is critical for the advancement of science. The focus, as in the physical sciences, should be on the development of unidimensional indices rather than profiles. We need to focus on one attribute at a time (e.g., temperature14 or pain), not confusing several attributes into a meaningless single score. Despite this, fundamental measurement scales are rare in medicine. If they are to advance beyond ordinal raw scores, they must meet the axioms of invariance and sufficiency15. Where the object to be measured is a latent construct, such as quality of life, we require a framework for identifying, if they exist, inherent measurement structures with interval properties. This is provided in the application of the axioms of conjoint simultaneous measurement developed independently by Rasch, and Luce and Tukey in the early 1960s16,17. To reflect an underlying unidimensional latent construct such as need-based quality of life, the CSM model argues that two requirements must be met by any outcome measure: (i) item difficulty (the easier the item in a questionnaire, the more likely it is to be affirmed), and (ii) respondent ability (the more able the respondent, the more likely are they to affirm the item).
If we consider quality of life measures, where the latent construct is need fulfillment, the items are generated by qualitative patient interviews in a specific disease state. Where data generated by the measure fit the Rasch model, a single index with interval properties is produced that captures response to therapy. QALYs and imaginary lifetime models are irrelevant. In other words, a patient-centric quality of life measure is generated, not a multi-attribute outcome such as the EQ-5D-3L that confuses a clinically based set of symptoms and responses to produce a meaningless outcome.
This is not to say that the Rasch model has been ignored. There are now several need-based disease-specific quality of life instruments available for clinical trials and for evaluating the impact of competing interventions on quality of life18.
Science can only make significant advances if measures are developed that have the required measurement properties; unidimensionality and ratio level measurement. Utility measures produce composite scores, as they add together several different types of outcome, for example, pain, emotional distress and physical mobility. Composite measurement cannot replace unidimensional measurement.
We have known how to develop unidimensional measures for the last 60 years, through the application of RMT. However, this also requires the development of theoretical models that explain the nature of the outcome that is to be measured and generating relevant content from people who are the true experts (patients in the case of quality of life). Such measurement is rare. Fitting measure data to the Rasch model is also a challenge, because of its strict requirements. For this reason, researchers continue to use dated methodologies and look for measurement models that are less demanding. Unfortunately, the consequences of failing to meet the requirements for fundamental measurement implies that the cost-per-QALY construct is an analytical dead end and much of the utility modeling conducted in the past 30 years has been profitless.
Abandoning the QALY would be, to say the least, embarrassing. A centerpiece of health technology assessment would be shown to have no discernible value. It is not just a question of pointing to the shortcomings of QALYs, but making it clear that the QALY, as exemplified in incremental cost-per-QALY modeled claims, is an impossible construct. Claims for pricing and access for pharmaceutical products and devices must be rejected; they are not realistic.
This article is intended to demonstrate that, in failing to appreciate the axioms of fundamental measurement, the utility values included in QALY analyses are an analytical dead end. If we are to assess the impact on patients of emerging therapies accurately, we need a disease-specific framework that provides a coherent assessment of the comparative benefits to patients and caregivers. We cannot include approximate information as an element in the evidence (real or imaginary) presented to formulary committees. Just as claims based on phase 3 clinical trials are recognized as robust, so should claims for quality of life and utility meet the same standards. This would free us to return to normal science and hypothesis testing.
No data are associated with this article.
Views | Downloads | |
---|---|---|
F1000Research | - | - |
PubMed Central
Data from PMC are received and updated monthly.
|
- | - |
Is the rationale for developing the new method (or application) clearly explained?
Yes
Is the description of the method technically sound?
Yes
Are sufficient details provided to allow replication of the method development and its use by others?
Yes
If any results are presented, are all the source data underlying the results available to ensure full reproducibility?
No source data required
Are the conclusions about the method and its performance adequately supported by the findings presented in the article?
Yes
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Health economics and biostatistics
Is the rationale for developing the new method (or application) clearly explained?
Yes
Is the description of the method technically sound?
Partly
Are sufficient details provided to allow replication of the method development and its use by others?
Yes
If any results are presented, are all the source data underlying the results available to ensure full reproducibility?
No source data required
Are the conclusions about the method and its performance adequately supported by the findings presented in the article?
Yes
References
1. Beresniak A, Medina-Lara A, Auray JP, De Wever A, et al.: Validation of the underlying assumptions of the quality-adjusted life-years outcome: results from the ECHOUTCOME European project.Pharmacoeconomics. 2015; 33 (1): 61-9 PubMed Abstract | Publisher Full TextCompeting Interests: No competing interests were disclosed.
Reviewer Expertise: Methodology in health decision making
Alongside their report, reviewers assign a status to the article:
Invited Reviewers | ||
---|---|---|
1 | 2 | |
Version 1 26 Aug 20 |
read | read |
Provide sufficient details of any financial or non-financial competing interests to enable users to assess whether your comments might lead a reasonable person to question your impartiality. Consider the following examples, but note that this is not an exhaustive list:
Sign up for content alerts and receive a weekly or monthly email with all newly published articles
Already registered? Sign in
The email address should be the one you originally registered with F1000.
You registered with F1000 via Google, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Google account password, please click here.
You registered with F1000 via Facebook, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Facebook account password, please click here.
If your email address is registered with us, we will email you instructions to reset your password.
If you think you should have received this email but it has not arrived, please check your spam filters and/or contact for further assistance.
Comments on this article Comments (0)