Measurement properties of the translations of instruments evaluating the subjective effects of tobacco- and nicotine-containing products: a systematic review of the

Several instruments are widely used for assessing Background: dependence, craving, withdrawal symptoms, and reinforcing effects in users of tobaccoand nicotine-containing products (TNP), including the Fagerström Test for Nicotine Dependence (FTND), Questionnaire of Smoking Urges, original (QSU) and brief (QSU-b) versions; Minnesota Nicotine Withdrawal Scale, original (MNWS) and revised (MNWS-R) versions; and Cigarette Evaluation Questionnaire, original (CEQ) and modified (mCEQ) versions. Although these instruments have been translated extensively, their translations and corresponding measurement properties have not been systematically assessed. This study aimed to (1) identify the translations of these instruments for which psychometric properties have been published, (2) describe the methods used for translation, and (3) describe the measurement properties and the context in which these translations were evaluated (e.g., target population and TNP used). Embase and MEDLINE databases were systematically Methods: searched. While no information could be found for the CEQ/mCEQ, several Results: translations were available for the remaining instruments: FTND, 25; QSU and QSU-b, 4 each; QSU (12-item version), 1; MNWS, 4; and MNWS-R, 1. Cigarette smokers represented the main target population in which the validation studies were conducted. Information about the translation process was reported for 25 translations. In most cases, the properties of the translations mirrored those of the originals. Differential item functioning was explored in only one case. There are few publications describing the measurement Conclusions: properties of the translations of the FTND, QSU/QSU-b, and MNWS/MNWS-R. None of these translations have been validated for TNPs 1 2 3 3 3 1 2 3 Reviewer Status AWAITING PEER REVIEW 04 Dec 2019, :2056 ( First published: 8 ) https://doi.org/10.12688/f1000research.20595.1 04 Dec 2019, :2056 ( Latest published: 8 ) https://doi.org/10.12688/f1000research.20595.1 v1 Page 1 of 30 F1000Research 2019, 8:2056 Last updated: 04 DEC 2019


Introduction
On June 22, 2009, the US Congress enacted a legislation (US Congress, 2009) that granted the US Food and Drug Administration (FDA) the authority to regulate tobacco products and the advertising and promotion of such products. In March 2012, the FDA Center for Tobacco Products (CTP) issued a draft guidance regulating applications for modified risk tobacco products (MRTPs) (US Department of Health and Human Services, 2012). This draft guidance mandates that applications must include scientific evidence about the effects of the products on tobacco-use behavior among current tobacco users. In particular, the guidance clearly states that submissions should present "nonclinical and/or human studies to assess the abuse liability and the potential for misuse of the product as compared to other tobacco products on the market." In this guidance, the FDA defines abuse liability as "the likelihood that individuals will develop physical and/or psychological dependence on the tobacco product." Physical dependence encompasses a growing tolerance to product use and/or the inception of withdrawal symptoms when product use cessation occurs. Psychological dependence is mainly characterized by craving and persistent tobacco-seeking and tobacco-use behaviors.
Several authors (Carter et al., 2009;Hanson et al., 2009;Institute of Medicine, 2012) have extensively reviewed measures and methods for assessing dependence, craving, withdrawal symptoms, and reinforcing effects in tobacco-and nicotine-containing product (TNP) users. They have identified some measures either widely used or recommended in tobacco research for the evaluation of tobacco products in general and MRTPs in particular. The most commonly quoted are the Fagerström Test for Nicotine Dependence (FTND) (Fagerström, 1978;Fagerström, 2012;Heatherton et al., 1991), Questionnaire of Smoking Urges (QSU) (Kozlowski et al., 1996;Tiffany & Drobes, 1991), Minnesota Nicotine Withdrawal Scale (MNWS) (Cox et al., 2001;Hughes, 2017;Hughes & Hatsukami, 1986;Hughes, 1992;Hughes & Hatsukami, 1998), and Cigarette Evaluation Questionnaire (CEQ) (Cappelleri et al., 2007;Rose et al., 1998;Westman et al., 1992). In terms of tobacco dependence assessment, the US Institute of Medicine report (2012) acknowledges that the FTND appears to contribute to a more precise estimation of dependence than the "Diagnostic and Statistical Manual of Mental Disorders" criteria. Regarding withdrawal symptoms, the same report mentions the MNWS as a well-characterized measure for assessing reduction of withdrawal symptoms. In their paper describing traditional tools and methods for abuse liability assessment, Carter et al. (2009) make references to the FTND for assessing the magnitude of nicotine dependence, the MNWS for assessing nicotine withdrawal signs and symptoms, and the QSU for measuring craving. In their review on questionnaires for measuring the subjective effects of potential reduced exposure products (PREP), Hanson et al. (2009) conclude that the most widely used scale has been the MNWS or its revised version (MNWS-R), followed by the QSU. They recommend that, at a minimum, these two scales should be included in a battery of assessment tests for PREPs. In addition, the authors also mention the CEQ and its modified version (mCEQ) as being widely used. Table 1 describes these measures (FTND, QSU, MNWS, and CEQ) and their evolution over time (QSU-brief [QSU-b], MNWS-R, and mCEQ).

Measure
History/content Response scale FTND Revised version of the Fagerström Tolerance Questionnaire (FTQ), which was developed in 1978 to provide a short, convenient self-reported measure of nicotine dependence (Fagerström, 1978).
It includes 6 questions.
In 2012, in an effort to integrate the total dependence panorama and the fact that the FTND has not been validated against all forms of tobacco use-from cigarettes to smokeless tobacco-Dr. Fagerström suggested that the FTND be renamed the Fagerström Test for Cigarette Dependence (FTCD) (Fagerström, 2012).

QSU (32 items)
Developed in 1991 (Tiffany & Drobes, 1991) to assess the potential multidimensional nature of craving report. It originally consisted of 32 items. A two-factor item structure was shown, with factor 1 representing a desire and intention to smoke, with smoking anticipated as pleasurable (15 items of which 10 are negatively keyed), and factor 2 representing an anticipation of relief from negative affect and nicotine withdrawal, with an urgent desire to smoke (11 items positively keyed). The type of desire represented on the first factor was characterized by items such as "I have an urge for a cigarette," and "I have no desire for a cigarette right now" (negatively keyed). In contrast, the second factor seemed to represent a more pressing and urgent state of desire as indicated by items such as "All I want right now is a cigarette," and "My desire to smoke seems overpowering." Seven-point Likert-type scale (1 = strongly disagree and 7 = strongly agree)

QSU (12 items)
Kozlowski et al. (1996) proposed an alternative model using the 12 most robust items from the original analysis.

QSU Brief Version (QSUb) (10 items)
Cox et al. (2001) developed a 10-item version, which they called the QSU-Brief (QSU-b), to facilitate use in laboratory and clinical settings. When used to derive a global measure of craving, QSU-b displayed high internal consistency across settings, providing a reliable assessment of desire to smoke. Factor analyses showed two distinct manifestations of verbal report of craving. Factor 1 represented a strong desire and intention to smoke, with smoking perceived as rewarding for active smokers, while factor 2 reflected an anticipation of relief from negative affect and an urgent desire to smoke. 100-point scale ranging (0 = strongly disagree and 100 = strongly agree)

MNWS revised version (MNWS-R)
The MNWS was developed in 1986 when Hughes & Hatsukami (1986) provided a detailed description of tobacco withdrawal and listed several signs and symptoms to be assessed (seven to nine items) rated on a 4-point scale (not present, mild, moderate, or severe). This measure has evolved over the years (Hughes, 1992), and the scale is now composed of eight symptoms associated with nicotine withdrawal (i.e., craving, irritability, anxiety, difficulty concentrating, restlessness, increased appetite or weight gain, depression, and insomnia). In a short communication, Hughes & Hatsukami (1998) encouraged researchers to use a scale that includes only seven DSM items: depression, insomnia, irritability/frustration/anger, anxiety, difficulty concentrating, restlessness, and increased appetite/weight gain.
Finally, a revised version was proposed-the MNWS-R (Hughes, 2017)which includes 15 items. The first eight symptoms are well-validated items (and the ones to be used if calculating a total withdrawal discomfort score), with the first seven being the DSM original items and the eighth investigating craving. The remaining seven symptoms were considered promising candidate symptoms (impatience, constipation, dizziness, increased coughing, increased dreaming or nightmares, nausea, and sore throat).
Items can be rated on an ordinal scale (0 = not present, 1 = mild, 2 = moderate, and 3 = severe) or on a 0-4 scale with the additional descriptor of "slight" between not present and mild, or by using a 100-mm visual analogue scale.

CEQ / Modified CEQ (mCEQ)
The CEQ is a self-reported questionnaire containing 11 items covering both the reinforcing effects (i.e., smoking satisfaction, psychological reward, and enjoyment of respiratory tract sensations) and aversive effects (i.e., dizziness and nausea) of smoking (Rose et al., 1998;Westman et al., 1992). The objectives of this paper were: 1 . To identify translations of the FTND, QSU/QSU-b,  MNWS/MNWS-R, and CEQ/mCEQ for which  psychometric properties are available; 2. To describe the methods used for translation; 3. To describe the measurement properties and the context in which these translations were evaluated (i.e., study design, target population, and TNP used by the study population).  (1) and (2) was limited to Abstract, Human research, and English. We screened reference lists to identify supplemental pertinent studies.

Selection criteria
Abstracts retrieved through the search strategy were reviewed and excluded if they (1) did not refer to the instruments of interest; (2) referred to the original version of the instruments of interest; or (3) referred to a translation used (a) in an epidemiological or behavioral context (i.e., not reporting measurement properties) or (b) for validating another measure and not for assessing/reporting the internal consistency or structural validity of the instruments of interest. Conference abstracts were excluded.
The reference lists of the papers considered for inclusion were reviewed, and articles of interest were included if they (a) referred to a translation for which internal consistency or structural validity was assessed at minimum or (b) provided additional information on an existing translation identified through the first round of review.
Two independent reviewers performed the selection. Initial data were extracted by one reviewer and then reviewed (and complemented if needed) by another.

Measurement properties
We used the COnsensus-based Standards for the selection of health Measurement INstruments (COSMIN) categorization (Mokkink et al., 2010a;Mokkink et al., 2010b;Mokkink et al., 2010c;Mokkink et al., 2018), to classify the measurement properties as follows: reliability, validity, and responsiveness to change. A fourth category, sensitivity and specificity, was added where appropriate (e.g., when the instrument was used for screening).
Reliability is described as the overall consistency of a measure, i.e., the degree to which scores for subjects who have not changed are the same when the measurement is repeated over time [test-retest reliability], is done with different evaluators on the same occasion [inter-rater reliability] or with the same evaluator on different occasions [intra-rater reliability]). As for internal consistency reliability, this estimate assesses the consistency of scores across items within a measurement instrument.
Validity is the degree to which an instrument measures what it is supposed to measure and includes the following: • Content validity: The extent to which the content of a questionnaire is an appropriate manifestation of the construct to be assessed. The key features are whether or not the items are relevant and that not important concept is missing, i.e., that the measure is comprehensive. As this review deals with translations, this part will include a description of the translation process and whether or not, on a qualitative level, the content of some items was changed to reflect cultural aspects.
• Construct validity: The extent to which the scores of a measure are in accordance with hypotheses based on the assumption that the questionnaire accurately measures the construct to be measured (Mokkink et al., 2010a). We have included the following aspects in construct validity: o Structural validity: The degree to which the scores of an instrument are an adequate reflection of the dimensionality of the construct to be measured (Mokkink et al., 2010a). Factor analysis should be performed to confirm the number of subscales present in a questionnaire.
o Hypothesis testing: The extent to which an instrument relates to other instruments in a way that is expected if it is accurately measuring the supposed construct (i.e., in accordance with predefined hypotheses about the correlation or differences between the measures). We have included the following aspects in this category: -The degree to which the instrument scores correlate with changes in instruments assessing similar constructs, connected but dissimilar constructs, or unconnected constructs.
-The degree to which the instrument scores correlate with biomarkers or measures of TNP consumption (consumption patterns).
-Predictive validity: The degree to which the considered instrument score is predictive of a future outcome or event.
o Cross-cultural validity: The extent to which the items performance in a translated or culturally adapted instrument appropriately reflects the performance of the items in the original version of the instrument (Mokkink et al., 2010a). This is evaluated using multi-group factor analysis or differential item functioning (DIF) by utilizing data from populations who completed the original version of the questionnaire and its translations.
Responsiveness to change (Hays & Hadorn, 1992) is the ability of an instrument to detect change over time in the construct to be measured. Responsiveness to change is considered an aspect of validity in a longitudinal context.
Sensitivity and specificity are used to evaluate the screening performance of a measure. Sensitivity is the proportion of true positives that are exactly identified, whereas specificity relates to the proportion of true negatives correctly identified. Generally, an optimal cutoff point for the score is selected to reduce the sum of false-positive and false-negative results.

Results
The search retrieved 193 articles (Table 2), of which 47 were selected for data extraction. While 46 of these articles described individual investigations on the measurement properties of translated versions of the FTND, QSU/QSU-b, and MNWS/ MNWS-R, one was a review of the psychometric properties of the FTND (original and translations) (Meneses-Gaya et al., 2009). No references were found on the CEQ or mCEQ. More details are presented in Figure 1 and Supplementary Material 1 (Table S1), which provides a list of references retrieved and reasons for inclusion/exclusion.

Measurement properties
Measurement equivalence was explored for only one translation-Chinese for immigrants to the US (Yamada et al., 2009)-for which DIF was examined by using IRT. Question (Q) 2 (difficult to refrain) showed a significantly substantial DIF, indicating that users of the Chinese version were more likely to support this item and to report more difficulty in refraining from smoking at several public places even after controlling for the nicotine-dependence level. As this DIF item in the Chinese version contributed minimally at the aggregate level, the authors concluded that its impact was negligible on scale scores. Neither unidimensional nor multidimensional results showed DIF for Q1 (time to first cigarette) or Q3 (cigarette hated most to give up), indicating that these two items are DIF-free. Authors concluded that these two items should be retained in the FTND to enable comparison between Chinese-and English-speaking smokers.   (1994) showed that the FTND could predict smoking cessation to a small degree (Study 2: 16-month follow-up, r = -0.11).
Sensitivity/specificity (Se/Sp) was explored for 28% of the translations ( Responsiveness to change has never been assessed.

QSU/QSU-b results
We retrieved four translations for the QSU, four for the QSU-b, and one for the QSU-12 (Table 4). Translation process. The translation process (Table 6) et al., 2015). The description for the Chinese QSU-b was minimal. Only the Brazilian team had provided some insight into the problems that arose during translation and the solutions they found.   Abbreviations: F: female; M: male; NA: not available in this paper (reference to another publication); NS: not specified in this paper; SD: standard deviation; S1: Sample 1; S2: Sample 2.

Table 6. Description of translation processes used (steps and people involved if mentioned) for the Questionnaire of Smoking Urges (QSU)/QSU-brief version (QSU-b) translations.
Measure language/ country of study ✓ (20 subjects) Numbers 1 to 7 were added above the Likert scale points that would visually be related to these numbers in the original scale.
Due to differences in the meaning of urge and craving, the initials in the English language (QSU) were used in the name of the scale, with the phrase "Brazilian version" being added. The term "craving" was not translated as "fissura" because of the latter being a popular term that suffers from regional influences and because its use is uncommon (according to the judges of this study) when reference is made to the desire to smoke. Therefore, "craving" was translated as "strong desire" (forte desejo).  F2: Anticipation of relief from negative affect and nicotine withdrawal, with an urgent desire to smoke (11 items, all positively worded) Craving VAS: -Non-deprived smokers: F1 and craving VAS were significantly correlated with each other before (r = 0.55) and after (r = 0.60) smoking.
F2 and craving VAS were also significantly correlated with each other before (r = 0.45) and after (r = 0.44) smoking.
F2 and craving VAS were significantly correlated with each other only after smoking (r = 0.50); before smoking, their correlation was r = 0.  (2,3,5,7,12,13,14,15,18,19,20,23,24,25,29,30,31) F2: Desire to smoke and the anticipation of smoking pleasure -13 items (4,6,8,10,11,16,17,21,22,26,27,28,32) 2,4,7,9,10,12) F2: Intention and desire to smoke -5 items (3, 5, 6 8, 11) No. of cig./day: Total score: (1) a better fit was found with the four-factor and two-factor models than with the one-factor model, and (2) the two-factor model provided a better fit than the four-factor model in both samples. In addition, their data suggested that the presence of mostly negatively worded items in F1 contributed largely to the twofactor structure of the QSU. Analysis with only negative items in F1 greatly improved the model fit in both data sets. According to the authors, these findings question the original interpretation of the nature of the dimensions measured by the two factors of the QSU.
• QSU-b: The authors of the Dutch and Malay versions reported differences from the original QSU-b (Cox et al., 2001), which, when used to derive a global measure of craving, showed high internal consistency across settings and provided reliable assessment of the desire to smoke. In contrast, factor analyses generated two instances of verbal report of craving. F1 represented a strong desire and intention to smoke, with smoking perceived as satisfying for active smokers, when an anticipation of relief from negative affect and an urgent desire to smoke was reflected by F2.
The first factor (F1) of the Dutch version (Littel et al., 2011) corresponded with the second factor (F2) of the English QSU-b (items 2, 4, 5, 8, and 9). F2 comprised items 1, 3, 6, 7, and 10. Items 2 and 5 loaded strongly on F1, whereas they had originally cross-loaded. The authors attributed this discrepancy to language differences. Items 2 and 5 (i.e., "nothing would be better than smoking a cigarette right now" and "all I want right now is a cigarette") communicate quite extreme statements, especially when literally translated into Dutch. F2 corresponded with the first factor of the original QSU-b, although, in the Dutch study, items 1 and 6 loaded on two factors. Again, Dutch language might be an explanation for these items loading on both factors. Items 1 and 6 include the words "desire" and "urge." Although phrases such as "I have a strong desire or urge for a cigarette," might be used in Dutch, it is far more common to use less potent expressions (e.g., "I would like/fancy a cigarette"). Nevertheless, items 1 and 6 are less extreme than the items assigned to F1. The authors did not add "anticipation of pleasure from smoking" to the name of this factor, because the subscale was not significantly correlated with either positive or negative affect.
In the Malay version (Blebil et al., 2015), factors 1 and 2 corresponded with those in the original version, with items 2 and 5 strongly loading on F2. The authors attributed this cross loading to the phrase "strong urge" conveying extreme utterances when literally translated into Malay.
Internal consistency was explored for all translations of the QSU. The alpha values for QSU F1 and F2 ranged from 0.89 (Guillin et al., 2000) to 0.96 (Araujo et al., 2006)  Responsiveness to change, predictive validity, sensitivity, and specificity were not assessed.

MNWS results
Four studies were retrieved for the MNWS (Table 8) Table 9).      Most subjects reported previous attempts to quit, except in the Malay sample (Blebil et al., 2014), where 77% of the subjects had not attempted to quit previously.
All studies were run with moderate smoker samples, on average, except for the study involving Koreans living in the US, which had recruited light smokers (Kim et al., 2007). In comparison, the original MNWS was developed with heavy smokers (Hughes & Hatsukami, 1986). Mean participant age ranged from 34 to 47.7 years. Men were predominant in all studies except in that in Italy, where women were slightly preponderant (59%) (Svicher et al., 2017).
Translation process. All four papers provided a description of the translation process used to develop each translation (Table 11). Only the Korean version (Kim et al., 2007) presented a brief report of the difficulties encountered and solutions found. References to guidelines or recommendations were given for all translations except for the Italian version (Svicher et al., 2017). Descriptions of the translation process were detailed for all translations except the Chinese version (Yu et al., 2010). Table 12 reports the measurement properties explored for each translation. All translations were assessed for structural validity, with a one-factor structure reported for the Italian MNWS eight-item version (Svicher et al., 2017) and the Malay nine-item version (Blebil et al., 2014).

Measurement properties.
A two-factor structure was reported for the Chinese version of the MNWS nine-item version (Yu et al., 2010) and the Korean nine-item version (Kim et al., 2007). The structure of the Chinese version was identical to the two-factor structure of the original version reported by Cappelleri et al. (2005): negative effect (F1, four items: depressed mood; irritability, frustration, or anger; anxiety; and difficulty concentrating), insomnia (F2, two items: difficulty going to sleep and difficulty staying asleep), and three single items (craving, restlessness, and increased appetite). A review of the items showed a slight discrepancy in those used as originals, with impatience listed in the Korean version but not in the Chinese, where insomnia represents two items (difficulty going to sleep and difficulty staying asleep). For the Korean version, F1 represented earlyoccurring disorders in mental functioning, and F2 represented disorders in physiological functioning and late-occurring disorders in mental functioning (i.e., increased appetite, disturbed sleep, depression, and impatience), explaining 66% of the variance. A two-factor structure was also reported in the Italian MNWS-R. Measurement equivalence using DIF was never assessed.
Correlations with consumption patterns and biomarkers were reported for three translations: Chinese, Italian, and Malay. Correlations with self-reported measures of dependence, craving, and anxiety were explored for all translations (Table 12).
Responsiveness to change, predictive validity, sensitivity, and specificity were not assessed.

Discussion
Given the globalization of tobacco research and control, we expected to retrieve more than 25,9,4  *?: Not clearly specified in the paper, but recommended in quoted guidelines; **Translation Process referred to as Linguistic Validation Process.  -Gaya et al., 2009;Osório Fde et al., 2013], and validity of Q3 (hated the most to give up). Those translation measurement outcomes questioning the validity of the original instrument may raise questions about the need to modify the content of the original. In this context, there is a well-known precedent: The International Quality of Life Assessment Project (Aaronson et al., 1992) is a notorious example of the development of translations of a PRO measure (i.e., the Short Form-36 [SF-36] Health Survey, which led the developer to change the original US instrument). The development and validation of the translated versions contributed to improvements in item wording and response categories and to the creation of the SF-36v2 Health Survey (Ware, 2007).
Our review showed that cross-cultural validity is rarely explored. Measurement equivalence using an IRT-based approach for examining DIF is almost never applied. This is a concern, as it might make it difficult to know if the scores obtained with the translations of these measures are comparable across languages and cultures and whether or not it is relevant to aggregate data from studies conducted in different countries.
Based on their extensive experience in cross-cultural evaluation, researchers from the European Organisation for Research and Treatment of Cancer (EORTC) Quality of Life Group have suggested that DIF should be part of the validation of questionnaire translations (Petersen et al., 2003;Scott et al., 2006;Scott et al., 2009). In their research, DIF analyses were conducted to identify items answered differently by language administration, reflecting either linguistic issues (e.g., imperfect translation) or cultural differences. Overall, they showed that, although most of the EORTC QLQ-C30 items seemed to have good linguistic equivalence, several scales presented highly conflicting results for some translations. They implied that some of these effects might be substantial enough to affect the outcomes of clinical studies, as translation differences in an item could result in clinically important differences at the scale score level.
Finally, our review showed that none of the translations has been validated with candidate MRTPs, indicating that more research is needed to comply with regulatory recommendations on the development of self-reported measures for use in labeling claims (US Department of Health and Human Services, 2009).
The main limitation of our research lies in its descriptive design. We did not provide insights on the quality of the translated versions (i.e., ratings on the translation process and the quality of the measurement properties) (Schellingerhout et al., 2011;Thoomes-de Graaf et al., 2016). Further research is needed to critically appraise the quality of the translations and guide researchers in their search for the best translation for their studies.
These results showing (1) discrepancies between the number of translations available, with and without documented information about their measurement properties, (2) heterogeneity in the scope of measurement properties explored and in the characteristics of the samples recruited, and (3) lack of validation with TNPs other than conventional cigarettes raise the need for generating a new initiative with two main goals (i.e., information and development).
First, implementation of a centralized repository for measurement instruments (original version and translations) with a licensing structure (endorsed by the developers of the originals) would enable researchers to have access to the most up-todate information about measures (i.e., development story and psychometric properties). By identifying existing translations and documenting them, this implementation might also help prevent the development of multiple translations for the same language and avoid concerns about which translation to use (Anfray et al., 2009). Furthermore, engaging the developers of the original versions in this process might help protect the integrity of each measurement instrument included (Anfray et al., 2018).
Second, if the original versions and translations of these measures are not appropriate for candidate MRTPs, fit-for-purpose measurement instruments (i.e., concept-driven instruments providing interpretable outcomes for the intended purpose) should be developed to enable comparison of combustible and noncombustible products on the same risk continuum. A similar initiative was launched several years ago, which led to the development of the ABOUT™ Toolbox (Assessment of Behavioral OUtcomes related to Tobacco and nicotine products) (Chrea et al., 2018). The measurement instruments included in this Toolbox are at different degrees of development. With their dissemination on ePROVIDE™ , researchers will be able to use instruments that are (1) developed and validated with state-of-the-art scientific methods to be psychometrically sound, straightforward to implement in clinical and population-based studies, and easy to interpret; (2) created to be relevant and applicable across the whole spectrum of TNPs and across various populations; and (3) designed to enhance standardization and comparison of data on perception and behaviors toward MRTPs across academic, industry, and public health research communities.

Data availability
Underlying data All data underlying the results are available as part of the article and no additional source data are required. This project contains the following extended data:

Extended data
• Supplementary file 1: List of the 193 references retrieved during the literature search.
• Supplementary file 2: Tables presenting detailed information on the FTND translations.
- Table S2.1. Sociodemographic/design characteristics and targeted country/language of studies evaluating the measurement properties of the translations.
- Table S2.2. Description of translation processes used (steps and people involved if mentioned) for the FTND translations.
- Table S2.3. Measurement properties of the translations of the FTND.

Reporting guidelines
Open Science Framework: PRISMA checklist for: "Measurement properties of the translations of instruments evaluating the subjective effects of tobacco-and nicotine-containing products: a systematic review of the literature" https://doi. org/10.17605/OSF.IO/3Z2EV (Acquadro, 2019).
Data are available under the terms of the Creative Commons Zero "No rights reserved" data waiver (CC0 1.0 Public domain dedication).
Author contributions CA, CC and NM conceived the idea of the manuscript. CDG performed the literature search. CA and CDG reviewed the abstracts and selected the articles for review. CA developed and drafted the manuscript. CC, CDG, MH, NM, and RW reviewed critically the manuscript. All authors read and approved the final version of the manuscript.