Harmonisation of welfare indicators for macaques and marmosets used or bred for research

Background: Accurate assessment of the welfare of non-human primates (NHPs) used and bred for scientific purposes is essential for effective implementation of obligations to optimise their well-being, for validation of refinement techniques and novel welfare indicators, and for ensuring the highest quality data is obtained from these animals. Despite the importance of welfare assessment in NHP research, there is little consensus on what should be measured. Greater harmonisation of welfare indicators between facilities would enable greater collaboration and data sharing to address welfare-related questions in the management and use of NHPs. Methods: A Delphi consultation was used to survey attendees of the 2019 NC3Rs Primate Welfare Meeting (73 respondents) to build consensus on which welfare indicators for macaques and marmosets are reliable, valid, and practicable, and how these can be measured. Results: Self-harm behaviour, social enrichment, cage dimensions, body weight, a health monitoring programme, appetite, staff training, and positive reinforcement training were considered valid, reliable, and practicable indicators for macaques (≥70% consensus) within a hypothetical scenario context involving 500 animals. Indicators ranked important for assessing marmoset welfare were body weight, NHP induced and environmentally induced injuries, cage furniture, huddled posture, mortality, blood in excreta, and physical enrichment. Participants working with macaques in infectious disease and breeding identified a greater range of indicators as valid and reliable than did those working in neuroscience and toxicology, where animal-based indicators were considered the most important. The findings for macaques were compared with a previous Delphi consultation, and the expert-defined consensus from the two surveys used to develop a prototype protocol for assessing macaque welfare in research settings. Conclusions: Together the Delphi results and proto-protocol enable those working with research NHPs to more effectively assess the welfare of the animals in their care and to collaborate to advance refinement of NHP management and use.

Research highlights Scientific benefit(s) • Harmonises welfare indicators for macaques, enabling inter-lab comparative studies and also greater data sharing to boost sample sizes for welfare-focused research.
• Ranks welfare indicators for macaques and marmosets and narrows the field for further investigation of those considered most important by experts.
3Rs benefit(s) • Identifies context appropriate welfare indicators, that are valid, reliable and practicable, allowing better assessment of welfare, minimisation of harm and evaluation of the impact of refinement techniques.
• Potentially benefits the welfare of an estimated 100,000 non-human primates (NHPs) used globally per year in biomedical research.

Introduction
Globally, an estimated 100,000 non-human primates (NHPs) are used annually in biomedical research and testing, with a far larger number housed in breeding facilities (Carlsson et al., 2004;Lankau et al., 2014;Zhang et al., 2014;Vermeire et al., 2017;Grimm, 2018). Accurate assessment of the welfare of these animals is essential for fulfilling ethical and legal obligations to minimise any harm caused by scientific or veterinary procedures, and for the effective implementation of refinement techniques such as analgesia and humane endpoints (Rennie & Buchanan-Smith, 2006;Jennings & Prescott, 2009;Hawkins et al., 2011;Descovich et al., 2019). It is also important for evaluating enhancements to animal management aimed at promoting positive welfare states and good psychological well-being, such as environmental enrichment and training for cooperation with husbandry (Chamove, 1989;Segal, 1989;Bassett et al., 2003;Lutz & Novak, 2005;Buchanan-Smith et al., 2005;Buchanan-Smith, 2010a;Coleman & Maier, 2010;Coleman & Novak, 2017). In some countries, there is a requirement for in vivo researchers to report to regulators the 'actual severity' experienced by the animals used in their experiments (European Union, 2010;Home Office, 2014;USDA APHIS, 2018), which is predicated on the ability to recognise and accurately measure pain and distress. Welfare assessment is also a component of the scientific method, because physiological and psychological responses to suffering can significantly affect data quality (Poole, 1997;Institute for Laboratory Animal Research, 2008). Minimising avoidable suffering is therefore necessary to ensure the validity of the scientific research performed (Novak & Petto, 1991;Graham & Prescott, 2015;Hannibal et al., 2017;Prescott et al., 2021).
Most NHP facilities have dedicated and highly trained animal care staff who go to great efforts to optimise the well-being of the NHPs in their care (Coleman, 2011), and effective welfare assessment tools will enable them to better accomplish this. It is recognised that welfare assessment should encompass both physical health and psychological well-being (National Research Council, 1998;Wolfensohn & Honess, 2005;Jennings & Prescott, 2009). However, working evaluations of laboratory NHP welfare are often based on measurements of various indicators presumed to be related to the extent of failure to cope, or difficulty in coping, with the environment (Lutz et al., 1991;European Commission, 2002). Modern welfare assessments should also aim to evaluate positive as well as negative states of individuals (Hawkins et al. 2011;Wolfensohn et al., 2018). Social play, allogrooming, food sharing, exploration, and relaxed gait have been suggested as behavioural indicators of positive NHP welfare in the laboratory, though relatively few have been validated (University of Stirling, 2011;Blois-Heulin et al., 2015;NC3Rs, 2015;Ahloy-Dallaire et al., 2018;Miller et al., 2020).
Most facilities that house or breed NHPs for research (i.e. laboratories, breeding centres, etc.) utilise a combination of animal-based indices, as this gives the best estimate of an individual NHP's welfare state (Novak & Suomi, 1988;National Research Council, 1998;Jennings & Prescott, 2009). These include physical or somatic observations (e.g. susceptibility to disease; growth rate; coat and body condition), physiological measurements (e.g. heart rate; body temperature; plasma cortisol), and structured behavioural assessments (e.g. behavioural repertoire; activity budgets; presence of quantitative or qualitative behavioural abnormalities) (Poole, 1988;European Commission, 2002;Wolfensohn & Honess, 2005;Gottlieb & Pomerantz, 2021;Novak & Meyer, 2021). Some animal-based indices used in practice, such as stereotyped behaviour (e.g. pacing), have been criticised for their lack of validity or validation (Poirier & Bateson, 2017;Polanco et al., 2021) and specificity (Descovich et al., 2019). Regardless, animal-based indices can be used to assess the outcome of providing resources for animal care, such as cage space and a varied diet.
A variety of resource-based indicators, which are variables measured not in the animals but in the environment, are also used to assess welfare (e.g. size and design of enclosures; provision of environmental enrichment; health monitoring programmes). These input-based, engineering criteria are attractive because they are objective, less time intensive, and easy to measure (e.g. during site inspection) (Johnsen et al., 2001;Mench 2003); however, they are often indirect measures of welfare and can be experienced differently by individuals (e.g. Izzo et al., 2011;Velarde & Dalmau, 2012). Used alone, they do not effectively evaluate the welfare state of individual animals; but used alongside animal-based outcome indices, resource-based input indices can usefully contribute to welfare assessments, and are important for standardising within and between facilities, especially if founded on validated welfare needs (Beaver & Bayne, 2014;Bennett et al., 2018) (Figure 1).
Despite the importance of welfare assessment in NHP research, there are few established welfare assessment tools, and little is known about the level of consensus within the research community on whether the available indices are considered valid (i.e. genuinely measuring an aspect of an animal's welfare state), reliable (i.e. can be measured consistently across and between users), and practicable (i.e. can be measured with limited time, resources, and within facility constraints). Truelove et al. (2020) conducted a Delphi consultation, an iterative, multi-stage survey technique, to identify laboratory macaque welfare measures and their relative importance. A list of 115 potential indicators for use in welfare assessment of macaques (54 animal-based and 61 environment-based items) was provided to a panel of macaque experts, predominantly from North America. Experts indicated which indicators were valid, reliable, and practicable to measure using the provided on-site scenario (Table 1a) and a composite percentage agreement score was assigned to each indicator, allowing subsequent ranking. Among the 39 experts who completed the two rounds of the survey, resource/ environment-based measures were considered better suited than animal-based ones for on-site welfare assessment, with the presence of self-harm behaviours and provision of social enrichment considered the most important indicators for assessing macaque welfare; a total of 56 indicators were selected as being valid, reliable, and practicable. The ten indicators with the highest composite respondent percentage agreement score following two rounds of ratings included Figure 1. Some resource-based inputs and animal-based outputs that can be used to assess non-human primate (NHP) welfare.
only one animal-based indicator (self-harm behaviour). These 56 indicators were presented as part of the current study, in part to gauge validity of the measures found in Truelove et al. (2020), as well as to uncover any indices that a different group of experts might accept or reject as useful in assessing macaque welfare and whether any of the indices can be applied to a different primate species used in research (marmosets).
If there was a broader consensus on appropriate indicators of suffering and well-being in NHPs used for research, and widely applicable welfare assessment tools, then this would help researchers, veterinarians, and other animal care staff better fulfill their obligations to optimise the welfare of the animals in their care. Importantly, it could also facilitate greater collaboration and data sharing between research facilities to address welfare-related research questions, such as the impact of common procedures and putative refinements. Not only would this boost sample sizes for welfare-focused studies, especially those which must piggy-back onto ongoing scientific procedures conducted primarily for another research purpose, but it would also enable inter-laboratory comparative studies to identify how variation in management practices influence animal welfare (Bliss-Moreau et al., 2021); doing so across an international audience might also identify practices diverging due to differences in culture and research specific to a region (e.g. McMillan et al., 2017;Baker & Prescott, unpublished work). In 2017, the United Kingdom's national 3Rs centre (NC3Rs) led an international data crowdsourcing project to establish the prevalence and potential triggers for aggression-related injury in grouphoused male laboratory mice (Lidster et al., 2019)a significant problem affecting the murine research community. In total, 143 animal technicians from 44 facilities collected aggression and husbandry data on over 137,000 mice using a common data collection framework. By comparing the prevalence of aggression and husbandry variables between facilities, the key factors that influence levels of aggression in male mice were identified, leading to recommendations for practical changes to husbandry to minimise aggressive behaviour and improve mouse welfare. This work illustrates the potential for welfare improvements when tapping into the expertise of a large group, regardless of the approach taken (e.g. crowdsourcing, Delphi).
To achieve broad consensus for NHP welfare indicators, and to develop a practical protocol for assessing macaque welfare, advantage was taken of the assembly of a group of NHP experts at the 2019 NC3Rs Primate Welfare Meeting. This international event supports laboratory and breeding centre staff working directly with NHPs to develop, share, and implement evidence-based refinements in NHP use and care. At the 2019 meeting, a hybridised Delphi consultation was undertaken to help harmonise NHP welfare assessment by gaining agreement amongst the experts on a list of macaque welfare indices that are valid, reliable, and practicable. Macaca experts either rejected or accepted the indices as measures of welfare for macaques as did Callithrix experts for marmosets, revealing whether welfare indicators identified for once species are applicable to other NHP species used or bred for research. Additionally, participants were surveyed about the methods used to measure each of these indices.
A classical Delphi consultation is an iterative, multi-stage survey technique that involves controlled feedback to a panel of anonymous subject experts; the consultation results in statistical group consensus on a selected topic as indicated by response stability between rounds (Van Zolingen & Klaassen, 2003). This is in contrast to the group Delphi/expert workshop approach, in which a panel of experts work together, rather than independently, on a topic to arrive at consensus (Webler et al., 1991) all other elements are identical. We integrated both approaches for this study, using a classical Delphi in one round and a group Delphi in another round. Achieving consensus between experts increases the validity of the welfare assessment protocol and ensures that it incorporates a wide range of expert opinions, so that it is not perceived as an imposition from a single group of people (Boulkedid et al., 2011).

Methods
Online survey software from Qualtrics was used to survey the delegates of the NC3Rs Primate Welfare Meeting (8 November 2019, London) about their views on welfare indices for macaques and marmosets, as part of a hybridised Delphi consultation process. The inclusion criteria were being a delegate of the meeting and being directly involved in the care, use or breeding of NHPs for research, which all participants met. The survey was constructed and administered by the authors. Participation was voluntary and responses were submitted using personal mobile devices. The link to the survey was emailed on the day of the meeting and also displayed at the event. Participants completed a consent statement online at the start of the survey. Additionally, if any participant wished to withdraw consent at any time, they were asked to contact the NC3Rs team who would then remove the data they had supplied. All delegates provided consent and no delegates subsequently retracted consent. Quasi-anonymity was maintained: responses remained unknown to other participants but were known to the authors and response data were coded by username after receipt so that individuals' responses could not be readily linked. Data collection procedures were approved by the Ethics Committee of the Faculty of Science, Agriculture and Engineering at Newcastle University. All data were managed according to a data management plan for NC3Rs office-led data sharing projects.
Participants were researchers, veterinarians, and animal technologists working directly with NHPs in nine countries (United Kingdom [UK], France, Germany, Hungary, Italy, Netherlands, Sweden, Switzerland, and the United States of America [USA]), with three-quarters based within the UK. Respondents were asked to identify their species of focus (macaque or marmoset) and area of specialty (neuroscience, infectious disease, toxicology, breeding, or other). In this way, we were able to actively control for species as a potential source of bias in the study. The survey method (hybridized Delphi) also addressed two potential biases (dominance effect and Von Restorff effect) through the use of multiple rounds and anonymity (Hallowell, 2009).
Multiple steps were required to complete the hybridised Delphi consultation process ( Figure 2). First, participants were presented with the scenario in Table 1a. They were then presented with the top ten welfare indices identified as valid, reliable, and practicable (≥70% consensus) for assessing macaque welfare in the Delphi consultation of Truelove et al., 2020 (Table 1b) and asked: Q1 "Which of the following indices do you think are the most valid and reliable for assessing NHP welfare? (select as many indices as you feel are appropriate)" Q2 "How practical are the indices you selected for assessing NHP welfare from the top ten?" (with the options: "Very impractical; Impractical; Neither; Practical; Very practical") Next, the participants were presented with a more extensive list of 56 welfare indices from the aforementioned Delphi consultation and asked: Q3 "Of the 56 indices, which do you think are the most valid and reliable for assessing NHP welfare? (select as many indices as you feel are appropriate)" Q4 "How practical are the indices you selected for assessing NHP welfare from the 56 indices?" (with the options: "Very impractical; Impractical; Neither; Practical; Very practical") Finally, participants had the opportunity to suggest additional indices that they considered to be valid, reliable, and practicable for assessing welfare (Q5; free text responses; 5% threshold).
In a second round of the consultation, the participants were split into pre-assigned groups according to the scientific disciplines they worked in and whether their work involved marmosets or macaques. Working as a group and bearing in mind the same scenario (Table 1a), participants were provided feedback from the first round as to which welfare indicators were considered valid and reliable, and which were considered practicable. They were asked to discuss and then define how they would measure each of the indices identified as being valid, reliable, and practicable in round 1 (i.e. at or exceeding 70% consensus, as per Truelove et al., 2020 andLeach et al., 2008). Specifically, they were asked to consider the following and then respond as a group: Q6 "Are you recording this measure at an individual, group/cage, room or unit level?" (with the option: "Other [please specify]") Q7 "How would you record this measure? i.e. what method and equipment (if any) would you use?" (free text responses) Q8 "How long would you spend recording this measure? i.e. would you measure this intermittently, and how frequently or constantly and over what period of time etc.?" (free text responses) Q9 "What proportion of animals/groups/rooms/units would you assess in order to get a meaningful assessment?" (selected choice in 10% intervals from <10% to 100%; with the option: "Other [please specify]") The top ranked welfare indicators for macaques identified during the two Delphi consultations and the information obtained regarding their measurement was then used, along with the expertise of the authors, to construct a prototype protocol for assessing macaque welfare in research settings. There were too few marmoset data to generate a welfare assessment protocol for this species.

Statistical analysis
Many Delphi studies have used percentage measures as their primary indication of consensus, despite disagreement as to whether this is adequate (Hsu & Sanford, 2007). We set an a priori agreement level of 70% or greater for consensus, as has been done in other animal welfare and healthcare studies (e.g. Leach et al., 2008;Keeney et al., 2011).
Descriptive statistics were used to summarize the participants' responses per round. For all completed surveys, there were no missing data. For those surveys started but without any collected data (i.e., no answers provided but an identifier was issued), these were removed so as to not inflate the number of participants. Data were imported into Microsoft Excel for Microsoft 365 (2021) and summarised for analysis; participant identifiers were removed to maintain anonymity. Free text comments were analysed qualitatively and were grouped by similar idea by one coder (MAT).
To complete the Delphi process, group stability (i.e. consistency of response between rounds) must be demonstrated (von der Gracht, 2012); this was achieved by Krippendorff's alpha coefficient (α) test (Hayes & Krippendorff, 2007). For interpretation, a value of 0 indicates perfect disagreement whereas 1 indicates perfect agreement; a value of 0.667 or more permits (tentative) conclusions to be made (Krippendorff, 2004).

Generalised macaque welfare assessment protocol (GEN-MAC)
Our generalised macaque welfare assessment protocol aims to offer a practical and context appropriate tool for laboratory staff caring for macaques (Table 2). It provides a quantitative set of criteria to support staff in monitoring and maximising macaque health and well-being, based on expert consensus. The tool encompasses all four domains of potential welfare compromise (i.e. nutritional, environmental, health, behavioural) identified by the Welfare Quality ® project (Blokhuis et al., 2013) and Mellor et al. (2009). Taken together, the chosen indicators should provide an assessment of an individual animal's welfare, and hence, when repeatedly measured over time, provide an assessment of its quality of life (Fraser, 2008). We acknowledge that good animal welfare is more than the mere absence of negative experiences and recognise that the tool incorporates few indicators of positive welfare state currently; however, validation of these is proving difficult (e.g. see Ahloy-Dallaire et al., 2018 for a discussion of the relationship between play and positive affective states). As new indices of positive state are validated, they can be incorporated into this tool.
This tool is not intended to replace welfare assessment protocols tailored to specific scientific disciplines, projects, procedures, and adverse effects. Rather it presents an appropriate number of valid, reliable, and practicable indicators for a generalised assessment of "wellness" that can inform and augment existing specific tools. This generalised tool is particularly suited for high level assessments of the outcome/quality of institutional behavioural management programmes and comparisons between laboratories. Where appropriate, facilities working in specific disciplines may wish to supplement this core set of indicators with additional ones listed in Figure  Where physiological measurements are required, the least invasive and most refined method that will provide the necessary data should be used (e.g. Davenport et al., 2006;Rennie & Buchanan-Smith, 2006;Smiley Evans et al., 2015). Awareness of the context for the assessment is important; for example, food intake can be reduced following administration of anaesthetic and analgesic medication, as well as due to pain or illness.
Like other welfare assessment tools, this one combines animal-based measures of welfare with indirect resource-and staff-based ones, which are more amenable for assessing the welfare of large numbers or groups of animals when under time constraints. There is evidence that the resource-and staff-based measures included are closely associated with outcomes indicative of good animal welfare in macaques, even if they do not guarantee that any one animal is experiencing a good quality of life (Jennings & Prescott, 2009;Schapiro et al., 2014). For example, it can be time consuming to measure affiliative social interactions, but an acceptable alternative is to check the macaques are at least socially housed with the opportunity for normal social behaviour and there is no evidence of NHP induced injury. Our approach to scoring of staff-based indicators allows a degree of flexibility and rewards programmes which incorporate elements of good practice, though users can choose to focus on the other indicator types if they wish. Most of the animal-based indicators can be directly and objectively measured after only a short-period of staff training, and they do not overly disturb the animal. The information can be gathered during site inspections, daily observations, physical exams, and other activities, such as handling for scheduled scientific procedures. Where there is the option to conduct more detailed, extended behavioural observations (e.g. analysis of closed-circuit television [CCTV] recordings), we would encourage this as it will provide greater insight into an animal's welfare state, especially if compared against a baseline measurement of normal behaviour for individual animals during their active phase and prior to any study (Council on Animal Care, 2019). A pilot study is underway to assess the time commitment required for completion of assessments using the tool, for Table 2. Generalised welfare assessment protocol for laboratory-housed macaques.

References
Animal-based -score for each animal 1. Self-harm behaviour E.g. on inspection or recorded in daily logs. Where seen, more frequent and detailed follow-up observations can be made to assess incidence, severity, and impact. a variety of group and colony sizes. It is possible that emerging approaches for automated recording and analysis of NHP behaviour will help to reduce the time required in the future (Rushen et al., 2012;Witham, 2018;Bain et al., 2021).
The multi-dimensional assessment should be performed by experienced staff, ideally with a knowledge of the individual animals, so that changes in welfare status can be more readily identified. The indicators included in the protocol do not require veterinary diagnostic expertise or specialist animal behaviour knowledge to be accurately recorded, but the involvement of such experts in implementing the welfare assessment tool is encouraged, particularly in the interpretation of the findings of assessments using this tool. A team approach, with good communication among those involved and periodic testing of inter-observer reliability, will help to ensure reliable assessments and consistent use of the tool (Clingerman & Summers, 2012;Lambeth et al., 2013). Individual animal records can be combined to give an overview of the colony, which can be reviewed periodically or compared with data from other colonies. If the tool is used as part of daily health checks, then scores can be compared over time and a greater severity score assigned where there is repeated evidence of impaired welfare.
To facilitate use of the GEN-MAC protocol in practice, an Excel version is available to download in our Extended data (Leach et al., 2022). The file incorporates the formulae for calculating the welfare scores. We encourage users of the protocol to provide us with feedback, so that the tool can be enhanced; please email mark.prescott@nc3rs.org.uk. When reporting use or adaptation of GEN-MAC in the literature, please use the following citation:

Macaque respondents
Round 1, Phase 1rating of top ten indices Percentage scores for validity and reliability, and practicability, of the top ten macaque welfare indicators in the Truelove et al. (2020) Delphi were compared with the scores for the same indices in the current Delphi ( Figure 3). Considering respondents working with macaques only, there was agreement between the two consultations that presence of self-harm behaviour, provision of social, food, and physical enrichment, and health and behaviour monitoring are valid, reliable, and practicable welfare indicators for NHPs (>70% consensus). However, whilst included in the top ten indicators in Truelove et al. (2020), cage furniture, humane euthanasia, hear other NHPs, and room ventilation failed to reach the consensus criterion in the current survey.

Round 1, Phase 2rating of 56 indices
The percentage agreement scores for the more extensive list of macaque welfare indicators presented in Round 1, Phase 2 are given in Table 3. For Table 3 (macaques) and Table 5 (marmosets), "Practical" reflects two practicability categories that have been collapsed (practical + very practical). Those indices reaching less than 70% agreement for practicability have been shaded grey. The 56 indices have also been categorised into the following symbol-coded indicator types: ▪ Animal-based: behavioural # , physiological/physical ## ▪ Environment-based: micro^(i.e. cage), macro^^(i.e. ambient) ▪ Staff-based: procedural and development + , husbandry ++ Across the 56 welfare indicators presented to the macaque respondents (n=67), only eight (14.3%) met the a priori agreement level of ≥ 70% for validity and reliability; these were also considered practicable measures. Three were animal-based (self-harm behaviour, body weight, appetite), whilst the other five were environment-or staff-based (social enrichment, cage dimensions, health monitoring, staff training, positive reinforcement training). Three of these eight    ., 2022)). Consensus for three additional indictors was nearly reached, with agreement levels between 65-69.99%; one was animal-based (NHP induced injuries) and the remainder were staff-and environment-based (behavioural management programme, vertical space).
Items were deemed more practicable to measure than they were valid and reliable, with 48 indicators (85.7%) meeting the threshold for consensus for practicability and two additional indicators approaching consensus (room cleaning frequency, frequency of moves during lifetime). Six indicators did not meet the consensus threshold for practicability when considering the hypothetical scenario; four of these were environment-or staff-based (field of view, intentional exposure to novelty, frequency of chair restraint, cage position) and two were animal-based (fear of other NHPs, environmentally induced injuries).

Round 2measurement of selected welfare indicators
Respondents working with macaques were classified into five categories: neuroscience (n=34), toxicology (n=9), infectious disease (n=4), breeding (n=12), and other (n=8). The 'other' category included the disciplines of reproduction, surgery, metabolic disease, and ethology. Figure 4 shows the indicators chosen as reliable and valid by two-thirds of macaque respondents (i.e. at or approaching consensus) by discipline and indicator type. Items approaching consensus (65-69.99%) are included in these results as some of the groups had small sample sizes and items would perhaps reach consensus with additional participants. Refer to Extended data Supplementary Table 2 for a complete list of the respondent agreement scores for validity and reliability of the indices by discipline category (Leach et al., 2022).
More of the 56 welfare indices were considered valid and reliable by at least two-thirds of respondents in the infectious disease category (33 indices) and breeding category (24 indices), than in the other three categories. Whilst the number of animal-focused indices is relatively similar across the main disciplines (6-8 indices), the infectious disease and breeding categories also include procedural and developmental indices, as well as more micro-environment level indicators. Selfharm behaviour is one of the top four indices in all discipline categories but toxicology, and body weight appears as one of the top four as well in all except breeding. Provision of social enrichment appears in each discipline except toxicology and other. Potential explanations for the variation between the disciplines are given in the Discussion section.
Discipline groups discussed and reported how they would measure the potential welfare indicators deemed valid, reliable, and practicable. The top ten indices from the current survey and how they might be measured are given in Table 4. For each indicator, participants recommended that each should be measured at 91-100% of the population being assessed, whether the unit of measurement be at the individual or the facility level, to get a meaningful assessment. All indicators except staff training could be measured at the individual level, and there was a mix of methods for recording the indices, with observation, records, or both being recommended.  per animal) NHP=non-human primates.

Marmoset respondents
Round 1, Phase 1rating of top ten indices As was the case for macaque respondents, when presented with the top ten indices from Truelove et al. (2020), most respondents working with marmosets considered social, physical, and food enrichment to be indices that can be used to assess welfare, along with self-harm behaviours and the presence of a health monitoring programme; cage dimensions was also considered important. Hearing other NHPs and ventilation were not rated highly (selected by only 50% of respondents) ( Figure 5). A smaller proportion of respondents working with marmosets considered a behavioural management programme to be useful for assessing welfare (50%), than did those working with macaques (82.6%). Of these top ten indices, eight were considered practicable by all respondents, and all ten by at least two-thirds of respondents (potentially a consequence of the small sample size; n=6).

Round 1, Phase 2rating of 56 indices
Overall, 21 of the more extensive list of 56 indicators were considered valid and reliable measures for assessing marmoset welfare (Table 5); eight met consensus, while the other 13 approached consensus. Of these 21, nine are animal-based, with the remainder comprised of environment-based (9) or staff-based (3) indicators. Of the 56 indicators, 50 (89.3%) were rated as practicable by at least two-thirds of the respondents working with marmosets.
The few respondents working with marmosets were classified into three categories: infectious disease (n=4), neuroscience (1) and breeding (n=1). Indicators chosen as reliable and valid by at least two-thirds of these respondents are shown by discipline and indicator type in Figure 6. Items approaching consensus (65-69.99%) are included in these results as the groups had very small sample sizes and items would perhaps reach consensus with additional participants. More of the 56 welfare indices were considered valid and reliable by at least 70% of respondents in the infectious disease category (21 indices) and neuroscience category (16 indices), than in breeding (9). Seven indices were selected in all three disciplines (stereotypy, body weight, mortality, NHP induced injuries, cage furniture, animal care observations, and disease surveillance) and all 16 indices in the neuroscience category were also selected by the infectious disease group.

Round 2measurement of selected welfare indicators
The marmoset experts also discussed and reported how they would measure the potential welfare indices deemed valid, reliable, and practicable. The top eight indices from the current survey and how they might be measured for this species are given in Table 6. For each indicator, participants recommended that each should be measured at 91-100% of the population being assessed, whether the unit of measurement be at the individual or the facility level, to get a meaningful assessment. All indicators could be measured at either the individual or cage level, as appropriate, with NHP induced injuries also being assessed at the room level. There was a mix of methods for recording these indices, with observation, records, or both being recommended.

Macaque and marmoset respondents
Considering both macaque and marmoset respondents (N=73), the whole group's level of disagreement about the validity and reliability of the top ten indicators identified in Truelove et al. (2020) was high in both phases (Phase 1, α=0.1993; Phase 2, α=0.0915); however, levels remained relatively consistent between phases (Δ 0.1078), indicating group stability. The movement that occurred between phases was in the direction of disagreement (signifying divergence). Likely, this was a result of the increased options provided to the respondents between phases (i.e., more potential indices in Phase 2) and not requiring ranking of the same ten items across each phase; respondent fatigue is also a possibility. When asked to indicate which of 56 potential macaque welfare indicators are valid and reliable (Q3) for assessing NHP welfare, those ten    initially presented (Q1) shifted in importance as evidenced by the proportion of respondents who selected an item as important for assessing welfare (Table 7). Across the two phases, the respondents' average inter-rater agreement was 78%; however, there were 12 respondents who had scores below 70%.
Across the two Delphi consultations, four indicators were identified as important for both marmosets and macaques: body weight, NHP induced injuries, physical enrichment, and cage furniture (Table 8).

Discussion
This study aimed to achieve consensus on effective indices of welfare for macaques and marmosets bred and used for research through expert consultation about the validity, reliability and practicability of a range of potential indicators. It builds upon the previous Delphi consultation of Truelove et al. (2020) by surveying a larger population of macaque experts working within a broader range of countries (predominantly within the EU) and collecting information on how top-ranking indices should be measured. The larger population also enabled us to explore differences between disciplines (though sample sizes were small for some discipline categories). By combining data from the two consultations, we were able to develop a generalised protocol for welfare assessment of macaque species.
We chose a Delphi process owing to the ability to survey a large number of experts anonymously and independently, without the opinions of any one respondent/group dominating the discussion, and to provide controlled feedback, helping to reduce noise and converge upon quality indicators. The systematic Delphi approach is more rigorous than other group consensus approaches, like case studies or focus groups (Boulkedid et al., 2011). However, it does have limitations which impact on our interpretation of the results (outlined below).
Considering first respondents working with common marmosets, of the 56 indicators presented in Round 1 Phase 2, only 37.5% (21/56) were considered valid and reliable for assessing marmoset welfare, and 89.3% (50/56) were rated as practicable by at least two-thirds of the respondents. This is not surprising given these indicators were those furnished from the macaque literature and experts in Truelove et al. (2020). Of the six indicators rated as not practicable, five were staff-based and one was a micro-environment indicator. Of note in terms of species differences is the observation that ambient environment indicators such as humidity, room temperature, room ventilation, and light intensity were considered valid and reliable by two-thirds of marmoset respondents but not so of macaque respondents, probably reflecting the physical needs of these tropical New World monkeys, which are different to those of macaques and temperate living humans (Buchanan-Smith, 2010b).
Indicators ranked important for assessing the welfare of common marmosets were body weight, NHP and environmentally induced injuries, cage furniture, huddled posture, mortality, blood in excreta, and physical enrichment. These findings should be viewed as preliminary given only six of our 73 respondents (8%) worked with this species and the indices were from Truelove et al. (2020). Nonetheless, whilst they cannot be said to be indicative of the marmoset research community, these findings have value in identifying potential effective indicators that can be further explored in a subsequent Delphi process involving a larger population of subject experts. We consider it important to conduct this exercise, given the resurgence in the use of this species in biomedical research (Colman et al., 2021). Some of the chosen indicators may reflect signs typically associated with marmoset wasting syndrome, a disease which causes morbidity and mortality in captive colonies (Ludlage & Masfield, 2003). We note also there was some inter-species application, as indicated by the percentage agreement scores of marmoset experts for seven of the top ten macaque indices in Truelove et al. (2020). It is valuable to have identified welfare assessment indicators that could be applied to multiple NHP species, particularly when conducting a limited-time, on-site assessment.
When presented with 115 potential indicators for macaques, participants in the Truelove et al. (2020) Delphi selected 56 of these as valid, reliable, and practicable within the context of a hypothetical scenario involving 500 animals; environment-based and staff-based indicators (44) were selected more than three times animal-based indicators (12). In the current study and with the same scenario, of the 56 indices, only eight were found to be valid, reliable, and practicable by at least 70% agreement of the macaque respondents. Three of the eight were animal-based (self-harm behaviour, body weight, appetite); the remainder were either environment-based (social enrichment, cage dimensions) or staff-based (health monitoring programme, staff training, positive reinforcement training). In addition, NHP induced injuries (animal-based) and presence of a behavioural management programme (staff-based) approached consensus at 68.7%.
It is notable that no physiological indicators and only one behavioural indicator (self-harm behaviour) are included in the top ten of Truelove et al. (2020), probably reflecting the greater effort required in collecting animal-based data to assess welfare (though 12 animal-based indicators, including body weight, appetite, and NHP induced injuries did reach consensus in Truelove et al., 2020). We speculate that the predominantly European participants of the current Delphi were more open to animal-based indicators than the predominantly North American participants in Truelove et al. (2020) because European macaque colonies tend to be smaller and there is perhaps more staff resource available to obtain information requiring direct measurement. Environment-and staff-based indicators can generally be assessed with more immediacy and ease, and without specialist equipment or judgement (e.g. whether cage furniture is present in the enclosure). However, it should be noted that the mere presence of something does not give a full picture of its contribution to NHP welfare; the quality of the item, and how much it is used, and by which animals, are also important factors.
Of the eight indicators selected in Phase 2, three of these also appeared in the top ten of Truelove et al. (2020) presented in Phase 1 (self-harm behaviour, social enrichment, health monitoring programme), strongly suggesting that these indicators are considered critical for the assessment of macaque welfare. Also exceeding or approaching the 70% threshold in Phase 1 were a behavioural management programme, physical enrichment, and food enrichment, suggesting their consideration as well. Social enrichment was rated the top indicator in Phase 1 (>94% of respondents), reflecting the importance of companionship for psychological well-being in these animals. Social enrichment and self-harm behaviour are well known as important indicators of good and poor welfare, respectively, in NHPs, and there is a large literature on their incidence and relevance to macaque well-being, so it is not surprising that there was consensus agreement on their importance in both Delphi consultations. Fewer than two-thirds of respondents felt a humane euthanasia programme, hearing other NHPs, cage furniture, and ventilation could be used to assess welfare. Opinion was split on how practical cage furniture and ventilation are as welfare indices for macaques. Agreement scores for practicability were generally higher in Truelove et al. (2020) than in the current study, possibly reflecting differences in expert demographics, sample size, and methodology.
A greater number of the 56 indicators were considered valid and reliable by respondents working with macaques in infectious disease research and breeding, than in neuroscience, toxicology and other disciplines, perhaps reflecting a greater awareness of the impact of surgical and husbandry procedures, and the cage environment, on the welfare of these animals ( Figure 4). Macaques in neuroscience also undergo frequent sedations, surgeries, and medical procedures, so it is curious these did not approach consensus in this category but did so in infectious disease. Within neuroscience, body weight and appetite are included in the top four indices, reflecting that many macaques used in neuroscience undergo food or fluid control to motivate them to work on behavioural or cognitive tasks whilst brain activity is measured. Self-harm behaviour and stereotypies are within the top seven, reflecting the practice of monitoring between experimental manipulations behavioural changes which could compromise the validity of the NHP model. Although they did not meet the set threshold for inclusion as items to rate by all participants, additional indicators suggested by respondents within this discipline included activity level (including the presence of depressive-like behaviour or non-alert inactive behaviour), engagement with and performance on experimental tasks, and water intake (again reflecting the scientific procedures involved).
Within infectious disease, more than half of the indicators (33/56) were considered valid and reliable by three-quarters of respondents, with over 40% (14/33) of these being cage-based. Appetite, body weight, and mortality are in the top five, reflecting that many of the macaques used in such studies will experience disease (Prescott et al., 2021). Additionally, to round out the top five are stereotypy and self-harm behaviour. Similar to neuroscience, these indices reflect the practice of monitoring behavioural changes between experimental manipulations over prolonged periods of time.
Ten indicators were selected by more than two-thirds of respondents in toxicology, drawn from a range of categoriesbehavioural, physiological, husbandry-based, and cage-based. Body weight and mortality checks are routinely performed as part of regulatory toxicology studies. Most animals are euthanised for pathology when assigned to toxicology studies, which might account for the appearance of a humane euthanasia programme in the top four. Huddled posture may reflect sickness due to test drug administration.
Within breeding, 24 of the 56 indices (42.9%) were selected by more than two-thirds of respondents. Of the top ten, five are cage-based and four husbandry-based, reflecting the large group sizes and number of animals to be monitored in breeding units, as well as the relative lack of need for scientific procedures and for welfare data for a scientific purpose. The inclusion of indicators such as social density, vertical space, environmental complexity, cage furniture, and visual barriers probably reflects the more spacious environments often afforded to breeding animals.
That some indices are contextualized differently across specializations is not surprising. For example, those who require chair restraint for handling of monkeys to do their research might find frequency of such restraint to be a more useful indicator than those who do not. This brings to light the difficulty in assessing animal welfare and the complexities of using indicatorsthey must be well defined, validated for the species for which they are applied, and need be not only practical to measure but also reliable across time and raters. Agreement on which of these is most important is coloured by culture and work experience as well as discipline perspective (Duijvesteijn et al., 2014), and when working as a group, the results of the process will also be subject to the composition of the panel. In the current study, the majority of respondents working with macaques did so in neuroscience, and most respondents were primarily from the UK and EU, whereas those in the Truelove et al. (2020) study were primarily from the USA. Differences exist between these regions in how laboratory NHPs are housed and managed, partly due to the oversight regulations. For example, minimum cage space requirements for NHPs are considerably smaller in the USA than in the UK and EU member countries. Under Directive 2010/63/EU (European Union, 2010), the minimum volume for macaques from three years of age is 1.8m 3 /64ft 3 per animal, reflecting the value placed on providing housing that allows for exercise and the expression of ethologically relevant behaviours, such as running, climbing, leaping and hiding from companions (NC3Rs, 2017). Under the ILAR Guide (National Research Council, 2011), the minimum volume per macaque up to 10kg is 0.25m 3 /9ft 3 , and this space allocation was not increased in the 2011 revision. One possible reason for this disparity in minimum cage space is that UK and EU NHP facilities tend to house considerably fewer animals than those in the USA. Irrespective of these regional differences, our two Delphi consultations have been able to identify critically important indices for macaque welfare.

Conclusions
We have identified context appropriate indicators that are valid, reliable, and practicable for assessing the welfare of macaques and marmosets bred and used for research, including in toxicology, neuroscience, and infectious disease, potentially benefiting far in excess of 100,000 NHPs used globally per year by improving welfare assessment, minimisation of harm and evaluation of the impact of refinement techniques. In ranking potential welfare indicators, we have identified those indicators considered the most important by experts and narrowed the field for further investigation and validation of both species-specific and general indicators. We have used the top-ranking indicators for macaques identified by experts in our two Delphi consultations, and agreement on how these should be measured, to develop a practical and generalised welfare assessment protocol to support laboratory staff in monitoring and optimising macaque health and well-being (GEN-MAC). There were too few marmoset data to generate a welfare assessment protocol for this species, and we recommend further data are collected. Our work to harmonise welfare indicators and assessment should facilitate inter-lab comparative studies, data-sharing to boost sample sizes in research asking welfarefocused questions, and benchmarking of welfare standards between facilities. It would be good to build upon this momentum and achieve further consensus and harmonisation globally, involving Pacific Rim countries in addition to North America and Europe. Further validation of the proto-protocol, and of the top-ranking welfare indicators, would also be welcome. Funding opportunities are available from the NC3Rs, other bioscience organisations, and animal welfare charities.
This project contains the following extended data: • GEN-MAC protocol.xlsx (Generalised macaque welfare assessment protocol aims to offer a practical and context appropriate tool for laboratory staff caring for macaques. It provides a quantitative set of criteria to support staff in monitoring and maximising macaque health and well-being, based on expert consensus.) Data are available under the terms of the Creative Commons Zero "No rights reserved" data waiver (CC0 1.0 Public domain dedication).

Open Peer Review
Are the conclusions drawn adequately supported by the results? Yes In the section "Results -Demographics" we have 67 colleagues working with macaques, and 6 working with marmosets. It appears to me greatly unbalanced. I would then delete the responses from the marmosets' people, for later wider participation. Table 6 reports the results for marmosets, however it is based on a very limited sample size. For example, just one person replied in the area of neuroscience -I would suggest to delete it.
In the "Conclusions" section, there is no mention of the results on marmosets.
Therefore, I recommend the approval of the present paper, but I consider it a "macaque paper", so: i) if the authors delete any reference to the marmoset survey; ii) or they play down very much that part, perhaps mentioning the results in the final discussion, underlying the necessity and need to compare different species, looking at their preliminary data on marmosets.
Are the 3Rs implications of the work described accurately? Yes

Are a suitable application and appropriate end-users identified? Yes
Is the work clearly and accurately presented and does it cite the current literature? Yes

Are sufficient details of methods and analysis provided to allow replication by others? Yes
If applicable, is the statistical analysis and its interpretation appropriate? I cannot comment. A qualified statistician is required.
Are all the source data underlying the results available to ensure full reproducibility? Yes

Are the conclusions drawn adequately supported by the results? Partly
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: animal behaviour, animal welfare, primatology, ethics of research I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.