Measuring bias, burden and conservatism in research funding

Grant funding allocation is a complex process that in most Background: cases relies on peer review. A recent study identified a number of challenges associated with the use of peer review in the evaluation of grant proposals. Three important issues identified were bias, burden, and conservatism, and the work concluded that further experimentation and measurement is needed to assess the performance of funding processes. We have conducted a review of international practice in the Methods: evaluation and improvement of grant funding processes in relation to bias, burden and conservatism, based on a rapid evidence assessment and interviews with research funding agencies. The evidence gathered suggests that efforts so far to measure Results: these characteristics systematically by funders have been limited. However, there are some examples of measures and approaches which could be developed and more widely applied. The majority of the literature focuses primarily on the Conclusions: application and assessment process, whereas burden, bias and conservatism can emerge as challenges at many wider stages in the development and implementation of a grant funding scheme. In response to this we set out a wider conceptualisation of the ways in which this could emerge across the funding process.


Introduction
Peer review is a core part of the academic research process, with the majority of academic research funding allocated through peer review. In a recent review (Guthrie et al., 2018), some of the potential limitations and outcomes of the use of peer review to allocate research funding were explored, with a key finding of that work being that there is a need for further experimentation and evaluation in relation to peer review and the grant award process more widely. However, measuring the performance of the funding allocation processes can be challenging, and there is a need to better share learning and approaches. This paper aims to address this gap by providing a review of existing practice in the measurement of research funding processes in relation to three of the main challenges identified by Guthrie et al. The intention of this work is to provide a review of ideas and approaches that funders can use to better analyse their own funding processes and to help facilitate a more open and analytical review of funding systems. Through our interviews with funders, we also explored current practice internationally in attempting to reduce burden and bias and to facilitate innovation and creativity in research.

Methods
We undertook a Rapid Evidence Assessment (REA) that built on previous work-such as that by Guthrie et al. (2018)-encompassing methods for evaluating programs, the challenges faced in evaluation, issues associated with research evaluation and the importance of responsible metrics. We focused specifically on metrics and measurement approaches that address bias, burden and conservatism. We restricted our search to literature in English from the 10 years between 2008 and 2018 to ensure we focused on the latest developments in the field and current best practice. We covered academic literature in Scopus as well as grey literature, e.g. policy reports and studies, government documents and news articles.

Search strategy
We identified relevant literature through three routes: 1. Academic literature search: Scopus search using the search terms in Table 1 for publications from 2008 onwards. To identify literature that focused on bias, burden and conservatism, we operationalised these search strings as follows: [Group 1 AND Group 2 AND (Group 3 OR Group 4 OR Group 5 OR Group 6 OR Group 7)].
2. Grey literature search: search on the websites of the funding bodies considered in this study (Table 2) 3. Snowballing: Snowballing refers to the continuous, recursive process of gathering and searching for references Table 1. Search terms for the rapid evidence assessment.

Organisation Country
European Research Council (ERC) International from within the bibliographies of the shortlisted articles. We performed snowballing from the reference lists of publications identified following screening.
Screening strategy Using the search strings described above, the Scopus database yielded 1,741 results. We performed an initial test of our strategy by checking that specific key papers we were already aware of appeared in the results, for example Guthrie et al. (2018). Once satisfied the search strategy was performing effectively, we implemented a filtering process to determine the inclusion or exclusion of articles based on their relevance to address the primary objectives of this task as set out in Figure 1.

Data extraction
Data extraction was performed using a data extraction templatea pre-determined coding framework based on the study aims (i.e. bias, burden, and conservatism). The headers of this template against which data was extracted for each article (where available) were: • Researcher extracting data (initials) • Source (author, date) • Title • Year of publication • URL • Type of document (journal article, review, grey report, book, policy, working paper, etc) • Objectives (aims of the work) • Area of peer review (journals, grants, other) • Evaluation framework or model (to evaluate funding program) • Evidence on and measures of burden (on researchers, institutions, funding bodies) • Evidence on and measures of bias (gender, career stage, research type, institution) • Evidence on and measures of innovation • Datasets and collection (any datasets used for evaluation purposes or information on data collection) • Metrics and indicators (any specified metrics used for evaluation) • Challenges (any identified challenges for evaluations) • Strengths and weaknesses (of the study) • Quality of study design and conduct (if appropriate assign red, amber, or green) • Strength and generalisability of the findings (assign red, amber, or green) Three researchers performed the full extraction of 100 articles in parallel. During this process, each researcher was instructed to add key references to a 'snowballing database'. The snowballing database was populated with 15 articles, which were passed through the filtering processes described above, yielding an additional eight papers that were fully extracted. We also considered additional articles using a combination of targeted web searches and suggestions from our key informant interviews. These methods yielded an additional 18 articles that were included in our REA.

Key informant interviews
We conducted key informant interviews with one representative from each research funding organisation in Table 2 in order to understand how evaluation methods are employed in practice and to explore evaluation approaches that may not be documented in the literature. We identified respondents with relevant expertise at key biomedical and health research funders internationally and contacted them by email to request their participation. We focused on developed research systems that may be comparable with the Australian system, primarily in Europe and North America. We also interviewed researchers working on the analysis of peer review and grant funding approaches and their challenges. Initially 12 individuals were contacted. Of those, 6 agreed to participate; 5 did not respond to our request; 1 declined to participate; 2 further identified colleagues to participate in their place, who were contacted by email and in both cases accepted.
Interviews were conducted by telephone and lasted for approximately one hour. Interviews were recorded and field notes taken. One interview was conducted per participant. Interviews were conducted following a semi-structured protocol (see Table 3) to enable consistent evidence collection while providing the opportunity to explore emerging issues. As the interviews were designed to be semi-structured, we encouraged the interviewees to explore areas they thought were important that may not have been directly covered in our interview protocol.
To protect the anonymity of the interviewees, the analysis that we report does not make any specific reference to individuals; we use the interview identifiers INT01, INT02, etc. to make references to specific interviews in our analysis.

Data analysis
The analysis took a framework analytic approach, aiming to capture information on processes and metrics used in practice across organisations in relation to the aims of this study to identify how bias, burden and innovation in funding process can be measured. Data from each interview was coded into an excel template by each individual conducting interviews (GM, DRR), with one row per interview (and hence organisation). The column headers were as follows: Analysis was primarily focused on capturing information on practice at each of these organisations to provide a picture of the methods currently being used by research funding organisations to measure and to alleviate burden, bias and conservatism in peer review-based funding processes. However, we also reviewed evidence on challenges, strengths and weaknesses to identify any information to inform our wider analysis and discussion.

Ethical approval
This study was recommended for exemption by the RAND Human Subjects Protection Committee. Participant consent was obtained orally at the start of each interview. The precise detail of consent sought is set out in the interview protocol (Table 3). Gender bias has been the primary area of study within applicant characteristics, perhaps having gained significant Table 3. Interview protocol.

Section Protocol content/questions
Introduction and consent Thank you for agreeing to participate in our study. On behalf of the National Health and Medical Research Council of Australia we are developing an evaluation framework for their new grant program. As part of our work, we are conducting key informant interviews to help build an understanding of the program evaluation landscape.
The project will be written up as a public report. Do you have any questions about the project? With your permission I would like to record this interview, but the recordings, any notes and transcripts will be kept strictly confidential and never be made available to any third party, including the National Health and Medical Research Council.
Any quotes included in RAND Europe's final report will not be explicitly or directly attributed to you without your permission. Should we wish to use a quote which we believe that a reader would reasonably attribute to you or your organisation, a member of the RAND Europe project team will contact you to inform you of the quote we wish to use and obtain your separate consent for doing so. All records will be kept in line with the General Data Protection Regulation (GDPR) 2018. Further information about RAND Europe's data security practices can be provided upon request.
To keep all processes in line with the GDPR 2018, we would like to ask you to confirm a few data protection statements: 1. Do you agree that the interview can be recorded by RAND Europe and that these recordings can then be transcribed for the purpose of providing an accurate record of the interviews? visibility in an early study that showed that females needed to be 2.5-fold more productive to achieve the same scores as males in the Swedish Medical Research Council's peer review process (Wennerås & Wold, 1997).
Following this initial study, gender bias has been explored in several different countries. In The Netherlands, researchers funded by The Netherlands Organisation for Scientific Research (NWO) examined 2,823 applications between 2010 and 2012 from early career scientists, analysed gender as a statistical predictor of funding rate, and examined the success rate throughout the process (application, pre-selection, interview, award) (Van Der Lee et al., 2015) The authors found that there was a gender disparity with males receiving higher scores in 'quality of researcher' evaluations but not 'quality of proposal' evaluations, particularly in disciplines with equal gender distribution among applicants.
Another study in the US looked at bias in the Research Project (R01) grants from the National Institutes of Health (NIH), and found this grant program exhibited gender bias in Type 2 renewal applications (Kaatz et al., 2016). The authors analysed 739 critiques of both funded and unfunded applications, using text analysis and regression models. The study found that reviewers gave worse scores to female applicants even though they used standout adjectives in more of their critiques. A second piece of work from the same authors employed more state-of-the art text mining algorithms to discover linguistic patterns in the critiques (Malikireddy et al., 2017). The algorithms showed that male investigators were described in terms of leadership and personal achievement while females were described in terms of their working environments and 'expertise'-potentially suggesting an implicit bias where reviewers more easily view males as scientific leaders, which is a criterion of several grant funding programs.
In a longitudinal study, researchers followed the careers of an elite cohort of PhDs who started postdoctoral fellowships between 1992 and 1994 (Levitt, 2010). The study found that 16 years after the fellowships, although 9 per cent of males had stopped working in a scientific field, compared with 28 per cent of females, there was no significant difference in the fractions obtaining associate or full professorships. However, females whose mentors had an h-index in the top quartile were almost three times more likely to receive grant funding -males' success had no such correlation with their mentors' publication record.
In a Canadian Institutes of Health Research (CIHR) funded study, researchers evaluated all grant applications submitted to CIHR in the years 2012-2014 (Tamblyn et al., 2018). Descriptive statistics were used to summarise grant applications, along with applicant and reviewer characteristics. The dataset was then interrogated with a range of statistical approaches (2-tailed F-test, Wald χ 2 test), which showed that higher scores were associated with having previously obtained funding and the applicant's h-index and lower scores with applicants who were female or working in the applied sciences. Exploration in relation to racial bias has also been performed, though there is a smaller body of work than on gender bias. In 2011 researchers funded by the NIH showed that black applicants were ten percentage points less likely to obtain R01 funding than their white peers, after extensively controlling for external factors (educational background, country of origin, training, previous research awards, publication record and employer characteristics) (Ginther et al., 2011). A funding gap between Table 4. Summary of approaches taken to measure bias in grant funding programs.

Measurement approach Area of potential bias investigated
Statistical evaluation of funding data • Gender ( white/mixed-race applications and minority applicants has been a persistent feature of NIH grant funding between 1985 and 2013 (Check Hayden, 2015). According to a preprint article from mid-2018, racial bias in the NIH system may have diminished (Forscher et al., 2018). The researchers report on an experiment where 48 NIH R01 proposals were modified to contain white male, white female, black male and black female names before being sent for review by 412 scientists. The authors found no evidence-at the level of 'pragmatic importance'-of white male names receiving better evaluations than any other group; however, they note there may be bias present at other stages of the granting process.
Career stage. Career stage is another potential source of bias in the peer review process. There are other approaches to defining career stage, for example focusing on necessary competences rather than time elapsed. The European Framework for Research Careers has four stages-first stage researcher, recognised researcher, established researcher and leading researcher-and provides a classification system that is independent of career path or sector (EC-DGRI, 2011).

Research field.
There may be biases between research fields and also against research that falls between, or combines, those fields. While interdisciplinary research is often considered fertile ground for innovation, there is a belief among researchers that interdisciplinary proposals are less likely to receive funding (Bromham et al., 2016). Defining and identifying interdisciplinary research is a challenge that has hindered the evaluation of this potentially damaging belief. A recent study sought to address this challenge by developing a biodiversity metric, the interdisciplinary distance (IDD) metric, to capture the relative representation of different research fields and the distance between them (Bromham et al., 2016). Using data from 18,476 proposals submitted to the Australian Research Council's Discovery Program over a five-year period, the authors found that the greater the degree of interdisciplinarity, the lower the probability of an application being funded.
Institution. There is also some evidence that characteristics of the institution may be a source of bias in the grant application process. For example, a 2016 study of Canada's Natural Sciences and Engineering Research Council (NSERC) Discovery Grant program found that funding success and quantity were consistently lower for applicants from small institutions, and that this finding persisted across all levels of applicant experience as well as three different scoring criteria (Murray et al., 2016). The authors analysed 13,526 proposal review scores, using logistical regression to determine patterns of funding success and developing a forecasting model that was parameterised using the dataset. The authors note that some differences between institutions may be due to differences in merit and differences in research environments; they recommend that more needs to be done to ensure funds are distributed appropriately and without bias.
Reviewers. Reviewers may have overt or implicit biases that can affect their scoring of grant proposals, some of which are noted above. The level of expertise that reviewers have relating to an application can affect their evaluations, with studies finding both advantageous and disadvantageous effects. Li examined this issue by constructing and analysing a dataset of almost 100,000 applications evaluated in over 2,000 meetings (Li, 2017). The study found an applicant was 2.2 per cent more likely to receive funding, the equivalent of one-quarter of the standard deviation, if evaluated by an intellectually closer reviewer as measured by the number of permanent reviewers who had cited the applicant's work in the five years prior to the meeting. Conversely, another study found that reviewers with more expertise in an applicant's field, as measured by a selfassessment of their level of expertise relating to an application, were harsher in their evaluations (Gallo et al., 2016). . The study found inter-rater reliability to be 83 per cent, which is comparable to the previous studies. The authors suggest that the slight reduction in disagreement may be due to the nature of early career applications or differences in the scoring and assessment criteria.

Strategies for reducing bias
As the research community has gained an increasing awareness of bias, steps have been taken to develop fairer processes and programs. There is some emerging evidence that training can reduce bias and increase the inter-rater reliability of reviewers. The CIHR introduced a reviewer training program following the discovery that its new grant system focusing on applicants' track records was disadvantaging women, while a program focusing on the research proposal was not. In the grant cycle following the introduction of a training module on unconscious biases, female and male scientists had equal success rates (Guglielmi, 2018). Additionally, an online training video was found to increase the inter-rater reliability for both novice and experienced NIH reviewers, with correlation scores rising from 0.61 to 0.89 following training (Sattler et al., 2015).
Blinding the identity of applicants from reviewers has been studied as a mechanism for increasing the fairness of peer review systems. In the context of journal peer review, the journal Behavioural Ecology found that its introduction of double-blind review increased the representation of female authors by 33 per cent, to reach a level that reflects the composition of the life sciences academic workforce (Budden et al., 2008). The US National Science Foundation (NSF) has trialled a blinded application process called 'The Big Pitch', which involves applicants submitting an anonymised two-page research proposal alongside a full conventional proposal (Bhattacharjee, 2012). The NSF reported that there was only 'a weak correlation' between the success outcomes of the full and the brief, anonymous applications.
Burden in the funding process Measuring burden in the funding process. The burden of the grant application process has been measured for applicants, reviewers, funders and research institutions using a variety of methods. A list of the different approaches used to evaluate burden of the application process can be found in Table 6.

Strategies for reducing burden
In recent years, different strategies have been developed to try to reduce the burden of grant applications. Table 7 provides a summary of approaches used by a range of international funders, both to reduce burden in their funding processes, and to measure the level of burden. These measures could allow researchers to focus on their research, save reviewers time, and potentially reduce the cost of grant review by reducing the labour required to review grant applications (Bolli, 2014).   Online survey by invitation Researchers were asked if they were the lead researcher on the proposal, how much time (in days) they spent on the proposal, whether the proposal was new or a resubmission, and their salary in order to estimate the cost of proposal preparation. Researchers who had submitted more than one proposal were asked to rank their proposals in the order in which they perceived to be more deserving of funding. Researchers were also asked about their previous experience with the grant peerreview system as an expert panel member or external peer reviewer. The number of days spent preparing proposals was estimated based on the data collected.
The authors also used a logistic regression model to estimate the prevalence ratio of success according to the researchers' experience and time spent on the proposal. The authors also examined potential non-linear associations between time and success, as well as comparing the researchers' ranking of their proposals with their outcome through peer review This study found an estimated 550 working years of researchers' time was spent preparing the 3,727 proposals submitted for NHMRC funding in 2012, accounting for an estimated AU$66 million per year The authors also found that more time spent on the proposal did not increase the chance of a successful outcome A slight yet not statistically significant increase in success rate was associated with experience with the peer-review system Restricting resubmission has also been adopted by the European Research Council (ERC) (Council, 2017).

Two-stage application process.
The ERC has combined the application limit with a two-stage application that involves awarding project proposals a score during the first stage of the application process (A, B or C). If the project is awarded an A, the proposal will proceed onto the next assessment stage. If the proposal received a B, the applicant must wait one year before reapplying. And if the proposal is graded with a C, the applicant must wait two years before reapplying to any of the ERC-funded programs. This approach has led to a decrease in the number of applications received for evaluation by the ERC (INT04). The NIHR have also adopted a two-stage application process that has led to a decrease in the number of applications sent for peer review and a shorter time between application submission and outcome notification (INT01).

Multiple calls per year.
Moreover, the NIHR has multiple calls for proposals throughout the year, which reduces not only the burden on reviewers by decreasing the number of applicants to review per round (INT01), but also on the applicants by having ongoing grant applications (Herbert et al., 2014).

Grant application length.
In 2012, the NIH reduced the length of most grant applications from 25 pages to 12 pages, with the aim of reducing the administrative burden on both applicants and reviewers, and to focus on the concept and impact of the proposed research rather than the details of the methodological approach (Wadman, 2010). However, a study by Barnet et al. found that shortening the length of an application slightly increased the time spent by applicants preparing the proposal (Barnett et al., 2015).

Funding period.
Extending the funding period to five years (Bolli, 2014) has also been suggested to reduce the burden of grant application. Another feature of innovative research is its uncertain and potentially controversial nature. While many funding agencies aim to support innovative research, the body of work on peer review suggests it is an inherently conservative process (Langfeldt & Kyvik, 2010).

Measuring innovation and creativity
Since defining and identifying innovative, creative research is challenging, it can be difficult to measure the level of innovation and creativity within a research portfolio, or the extent to which a research funding program fosters innovation. However, there are examples in the literature of potential approaches to measuring innovation, as set out in Table 8.
By definition, new ideas are likely not to be met with consensus. It has been suggested that innovation could be measured through a metric based on lack of agreement between reviewers, measuring controversy as a surrogate for innovation, with new metrics, including variance or negative kurtosis, the degree to which observations occur in the tails of the grading distribution (Kaplan, 2007).
Productivity is another potential approach to measuring innovation. One study assessed the careers of researchers funded by two distinct mechanisms, investigator-initiated R01 grants from the NIH and the investigator program from the Howard Hughes Medical Institute (HHMI), with the aim of determining whether HHMI-style incentives result in higher rate of production of valuable ideas (Azoulay et al., 2011). The authors estimated the effect of the program by comparing the outputs of HHMI-funded scientists with that of the NIH-funded scientists within the same area of research, who received prestigious early career awards. Using a combination of propensity-score weighting and difference-in-differences estimation strategies, the authors found that HHMI investigators produced more high-impact journal articles than the NIH-funded researchers, and that their research was more prone to changes.
Another study looked at the relation between the knowledge contained in an application proposal and a reviewer's expertise and the outcome of proposals focusing on innovative research and area of expertise (Boudreau et al., 2016). In this study, the authors designed and executed a grant proposal process for research, and randomised how proposals and reviewers were assigned, generating 2,130 evaluator-proposal pairs. The authors found that evaluators give lower scores to research proposals • Some organisations include innovation as one of the assessment criteria, accounting for a percentage of the overall score of the proposal. In AHA, innovative research gets scored on the following questions: Does the project challenge existing paradigms and present an innovative hypothesis or address a critical barrier to progress in the field?
• Does the project develop or employ novel concepts, approaches, methodologies, tools or technologies for this area? Strategies for improving the assessment of innovation and creativity In recent years, different strategies have been developed to improve the assessment of innovation in grant review. Table 9 provides a summary of approaches used by a range of international funders, both to increase innovation and creativity and to evaluate the level of innovation and creativity across their funding streams.
To ensure innovative research is being funded some agencies, including the NIH, adopt an 'out of order funding' approach (Lindner & Nakamura, 2015). In this approach, a number of applications for innovative research are chosen for funding despite receiving lower scores than other funded research based purely on the peer review process. In the NIH, this strategy has led to approximately 15 per cent of applications selected 'out of order'.
The NIH has also made additional changes to the peer review process in order to increase the emphasis on innovation and decrease the focus on methodological detail (Lindner et al., 2016). These changes included reducing the length of the methodological description (from 25 to 12 pages), with guidance to focus away from routine methodological details towards describing how their application is innovative. Including innovation as a criterion for grant assessment could incentivise researchers to include innovative ideas and new approaches into their proposals (Guthrie et al., 2018).
Many funding agencies have also adopted the strategy of having a separate scheme to fund innovative research, allocating smaller funds with a shorter time frame to these specific streams. The NIH has developed the New Innovator Award, committing $80 million to the award, and two others that specifically encourage innovation, the Pioneer and Transformative R01 Awards (Alberts, 2009). ZonMW have designed an 'off-road' program aimed at high-risk, high-reward projects, providing €100,000 for 1.5 years (INT02). NIHR has also designed different funding tiers to promote funding for innovative projects, providing £150,000 funding for 18 months (INT01). However, this strategy could include longer funding periods to encourage a culture of innovation among young researchers who Table 9. Approaches to increasing and evaluating innovation and creativity used by a selection of international research funders.

Discussion
Our review of international practice regarding the characterisation and measurement of bias, burden, and conservatism innovation and creativity in the grant funding process demonstrated that the efforts so far systematically to measure these characteristics by funders have been limited. However, in each area there were examples of existing practice we can draw upon as summarised in Table 10.
It is also worth noting the challenges in defining each of these elements, partly reflecting the diversity within each of these areas. In terms of bias, we note biases can emerge in terms of a range of areas, with five main areas highlighted in the literature: applicant characteristics (e.g. gender, ethnicity); career stage; research field; institution; and reviewer characteristics. Burden can be characterised in terms of where the burden is experienced: by applicants, reviewers, the funding agency and by institutions. Efforts to address burden and ways of measuring their effectiveness may differ across these groups. Finally, a key challenge in measuring innovation is providing a definition of innovative or creative research that can be operationalised. Often funders do this based on expert judgement, but this is challenging to use for portfolio assessment and analysis.
Finally, a key limitation of the work is that since this is a review of the existing literature and practice, we are constrained by what has so far been reported, which in some areas is fairly limited. In particular, the majority of the literature focuses on the application and peer review process, which only forms a part of the overall funding scheme that starts from the initial establishment of the structure of the funding scheme through to the monitoring and evaluation of ongoing and completed funding awards. We set out in Table 11 a wider conceptualisation of some of the ways in which challenges could theoretically emerge in relation to funding schemes at different stages throughout this process. This is intended to illustrate the potential breadth of scope for this work beyond the literature: as such it is neither exhaustive nor driven by existing evidence of those challenges or opportunities emerging in practice. Rather it acts as an aid to thinking through the full process of the development and implementation of funding schemes. We suggest that further research and evaluation efforts are needed to more fully conceptualise and measure effectively the concepts of bias, burden and innovation in research across the full scope of the research funding process. the author talked about detailed F tests and X2 tests without referring to the nature of the underlying multivariate analysis which seems strange as these are only tests of statistical significance and not the actually underlying process used to estimate the association. It would helpful to have a statistical reviewer as part of the review team for these articles to assist in the interpretation of the study results and the respective statistics.
Are all the source data underlying the results available to ensure full reproducibility? Yes. 5.
Are the conclusions drawn adequately supported by the results? Partly… Just in terms of commentary. The Results section was very informative and the tables summarizing different approaches was also very useful. It was difficult to understand the choice of funders to review and it was very surprising to see the NIH excluded from the interviews, although the research that has been done on NIH peer review are included in the results section. It is also strange to make the comment that findings from one funder are not generalizable to another. The literature shows that there are trends that cut across funding agencies, countries and even between manuscript and grant review and I think that this statement is not at all supported by the results that are presented.
Finally, the section on innovation was very important and will be an addition to the literature. Results that were presented do not support the statement that funding agencies are not really doing much work in this area although I think that is likely true. Basically, the author outlined what the finding agencies were doing and it was left to the reader to interpret that some of these are very generic approaches that really do not address the key issues such as burden or bias.

If applicable, is the statistical analysis and its interpretation appropriate? Partly
Are all the source data underlying the results available to ensure full reproducibility? Yes

Are the conclusions drawn adequately supported by the results? Partly
Competing Interests: No competing interests were disclosed. This is an open access peer review report distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Adrian G. Barnett
School of Public Health, Institute of Health and Biomedical Innovation, Queensland University of Technology, Brisbane, Queensland, Australia This is important research given the impact of research funding on what science is funded and the flow-on benefits of research to the public. The authors examined the recent literature, but very usefully they also interview funders to find out what is happening in practice. I think this review could be very useful for funders and for researchers working in meta-research.
My main issue was that important detail was lacking in a number of places concerning the strategies and changes used by funders. This is a list of places where I wanted more detail, which are mostly based on text in the tables: Table 5, the German Research Foundation have 300 equality measures. Is this a typo? That many measures seems to be far too many to be useful, and the funders could just end up drowning in information.
○ Table 5, the German Research Foundation evaluate bias using "Progression of careers" and "Institutional bias" but some idea of how they do this would be useful. Do they only examine gender bias? Do they take any action if they detect bias? ○ Table 5, NIHR, do they provide more money for the higher tiers? ○ Table 5, NIHR, they evaluate bias using "Diversity of applications", but in terms of what? Gender and race? ○ Table 5, ARC, does the ROPE statement have any impact? Especially as some Australian researchers have said that ROPE is "largely perceived as a tokenistic gesture put on forms and never taken into account by the people who make decisions and evaluate work". 1 ○ Table 5, ARC, what does "ongoing access to all research projects" mean? Access for who?
The ARC or the public? ○ Table 5, ARC, they look at the discrepancy in scores, but what action do they take when they find a discrepancy? ○ Table 5, ARC, is international benchmarking sensible given that funders in other countries could also be struggling with biases in terms of gender or race? Wouldn't it be better to set targets/benchmarks based on consultation with the research community? ○     Table 9, ARC, how does a continuous funding round increase innovation or creativity? I can see how this could increase diversity by being more accommodating of researchers with family commitments.
○ Table 9, ERC, "External program evaluation" needs more detail ○ Table 10, more detail is needed for "Longitudinal", does this mean following those who did and did not win funding? ○ Page 8, more cautious language may be needed when using the Levitt study given that it was an exploratory study that does not mention a protocol. In particular I am concerned about the use of quartiles as a threshold, as it's not clear if other thresholds were tried (e.g., tertiles, quintiles).

○
Page 10, in terms of the Clarke et al study -for which I was an author -we actually concluded that the inter-rater agreement appeared higher than in previous studies, so I would not say 'comparable'. I think the most likely reason for this greater agreement was that our study concerned people-funding where the main idea is to rank past performance, whereas all other previous studies had examined project-funding where the main task is to predict future performance which is far harder. ○ Table 6, order the rows by date or first author name? ○ Table 6, Schroter et al 2010 row, the first paragraph in the "Methodology" column end abruptly.
○ Table 6, Barnett et al 2015 study (for which I am an author), the streamlined funding did not lead to an increase in success rates because all the rounds were streamlined. The increase in success rates over time occurred because the scheme had far fewer proposals over time.
Many of the proposals in the initial rounds were ineligible because it was a new funding scheme and a lot of researchers simply "had a go". ○ Page 16, first column, first paragraph, the NIH "two-chances" policy has since been relaxed, see work by Kaiser. 3 ○ Page 16, paragraph "Grant application length", it's fairer to say that a shorter form "was associated with" an increased application time, as this was a non-randomised study.
○ Page 17, some funding schemes have used a "wildcard" option for panels, where each panel member was allowed to save one proposal from a culling stage. I cannot find the reference for this, but I can find where we have discussed this idea. 4 ○ There was little on the transparency of funders. If funders opened up the anonymised ○ application data for scrutiny, then this could be a great way of assessing innovation and diversity.