Keywords
Peer Review, Linguistic Characteristics, MSCA program
This article is included in the Research on Research, Policy & Culture gateway.
This article is included in the Meta-research and Peer Review collection.
Peer Review, Linguistic Characteristics, MSCA program
All views expressed in this study are strictly those of the authors and may in no circumstances be regarded as an official position of the European Research Executive Agency or the European Commission.
The process of evaluating research grant proposals has attracted considerable attention in the past decade. With the increasing amount of funding for research, there is a constant need for improvements in evaluation procedures for providing funding to the most promising project proposals. Recent scoping reviews on peer review for research funding recommends, among other propositions, that there is a need for the identification of interventions that are consistent in resolving peer review issues in proposal evaluation (Recio-Saucedo et al., 2022; Shepherd et al., 2018). Studies on grant peer-review evaluation have mostly focused on the analysis of criteria used by expert reviewers when assessing proposals (Abdoul et al., 2012; van Arensbergen and van den Besselaar, 2012; Hug and Aeschbach, 2020). Other studies have investigated the linguistic content of review reports (van den Besselaar et al., 2018; Hren et al., 2022). However, to the best of our knowledge, evidence is missing on how the requirement for numerically scoring a grant proposal, i.e. attributing a numerical/quantitative score to a proposal by a given reviewer, affects the way a reviewer comments or expresses opinions related to the proposal.
The evaluation process of grants submitted to EU research programs, the so-called Framework Programmes for research and innovation, consists usually of two consecutive steps, with each proposal going through (1) an individual evaluation, made by (typically three) different expert reviewers and (2) a consensus phase, where those reviewers agree on the final evaluation of the proposal. In both parts of the evaluation, the evaluation is normally focused on three criteria: a) research Excellence, b) Impact, and c) Implementation, for which comments must be given separately. Each criterion is attributed a score that determines the total score of the proposal. The result of the consensus stage is an evaluation summary report (ESR), consisting of the consolidated concerted opinions of the group of expert reviewers. Previous studies have established this approach as a stable procedure in the evaluation of research grant proposals (Pina et al., 2021; Buljan et al., 2021).
In the previous Framework Programme, Horizon 2020 (H2020), some of the grant schemes faced changes in their scoring process. This was the case of the Marie Skłodowska-Curie Actions (MSCA), the flagship funding program dedicated to the promotion of researchers’ mobility and career development at all stages of their careers. In the past, expert reviewers were asked to provide comments and numerical scores for each of the three evaluation criteria in both their individual evaluations (the so-called Individual Evaluation Report – IER) and then at the level of the consensus, resulting in the final score of the evaluation summary report (ESR). For some of the funding schemes of MSCA, this approach was discontinued, and numerical scores were not attributed anymore at the level of individual evaluations (IER). Only textual comments were required at the IER stage, and numerical scores were used at the stage of the consensus for the ESR.
A recent study has indicated that proposal weaknesses have a greater effect on the ranking, compared to proposal strengths (Hren et al., 2022). Based on this finding, the ranking of proposals would greatly depend on the reviewers’ ability to identify and describe the weaknesses. In cases when there is a large number of proposals, qualitative methods of analysis can be ineffective. Therefore, quantitative analyses of the text, i.e. tools that assess quantitative characteristics of the text, can be a solution, as they have been shown to be relevant for proposal evaluation (Luo et al., 2022).
The objective of this study was to compare the linguistic characteristics of the comments related to the Excellence, Impact, and Implementation criteria in the evaluation reports of MSCA Innovative Training Networks (ITN) proposals submitted in 2019 and 2020, under H2020, in order to assess whether the removal of numerical scoring affected the structure of IER textual comments and whether this change was associated with the evaluation outcome at the consensus stage, i.e. ESR. We chose the ITN granting scheme, because it is, with around 1500 annual submissions and a success rate below 10%, among the most oversubscribed and competitive of the whole framework program.
This study was preregistered on Open Science Framework: https://osf.io/t84ba.
We worked on anonymized datasets, without insight about the actual content of the proposal, or the names of the applicants or experts evaluators, so that the regulations on personal data protection were not applicable.
The data analyzed consisted in the IERs and the ESRs of all ITN proposals evaluated in the calls of 2019 and 2020. Each report includes textual comments referring to the different evaluation criteria. Scores of IER were only available for the year 2019. The anonymized quantitative data used for the analysis in this article is available on the Open Science Framework: https://osf.io/6bpvu/?view_only=.
Linguistic characteristics of experts’ comments were assessed using the Linguistic Inquiry and Word Count software (Pennenbaker et al., 2015a, 2015b), a program that counts words related to different psychological states and phenomena and gives a score that is a proportion of the specific category in the entire text.
We collected the data on the proposal status after evaluation (“Main list”, “Reserve list” or “Rejected”), call in which they were submitted (2019 or 2020), research area, total evaluation scores, as well as numerical scores for Excellence, Implementation and Impact criteria, together with corresponding comments which separately described proposal strengths and weaknesses. We separately analysed IERs and ESRs.
For the evaluation purposes, proposals are categorized in eight panels: Economics-ECO, Social sciences-SOC, Mathematics-MAT, Physics-PHY, Chemistry-CHE, Engineering-ENG, Environmental sciences-ENV, Life sciences-LIF. For this study, we clustered the MAT-PHY-CHE-ENG-ENV into one single “research domain”: PHYENG. So, the three research domains in this study were PHYENG, ECOSOC and LIF.
LIWC variables were calculated separately for strengths and weaknesses for each of the criteria assessed: They included the word count and the text tone of the evaluation report (Kaatz et al., 2015; Kacewicz et al., 2014; Pennebaker et al., 2015a):
a) Analytical tone: higher scores indicate the logical and hierarchical style of writing;
b) Clout tone: higher scores indicate confidence or leadership, with lower scores indicating insecure writing;
c) Authenticity: high scores indicate writing honestly and humbly, with expressing views as personal opinion;
d) Emotional tone: higher scores on emotional tone indicate a higher proportion of words related to a more positive emotional tone.
To eliminate potential sampling bias, we collected data for the whole cohort of submitted MSCA ITN proposals in 2019 and 2020.
The descriptive data was presented as frequencies and percentages for project status and research domain. Text characteristics were presented as means and standard deviations, or as means and 95% confidence intervals in cases of figures. We first compared the differences on all variables using a t-test or Chi squared, depending on the nature of the variables. A P value threshold of less than 0.001 was considered to be significant in t test. Variables which were not significant were excluded from further analysis. We used logistic regression to compare differences between the two call years, in which proposal variables (proposal status, word count for research excellence weaknesses, word count for implementation strengths, and negative affect levels for implementation strengths) were predictors and the year of the call was the criterion. The level of significance was set to 0.05. The analysis was done using the R statistical program (R Core Team, 2021) and the JAMOVI package for statistical analysis (The jamovi project, 2023).
The number of evaluated proposals was similar in 2019 (n=1554) and 2020 (n=1503). The number of rejected proposals was 1367 (87.9%) for 2019 and 1333 (88.6%) for 2020, 128 (8.2%) and 148 (9.8%) funded proposals, and 59 (3.8%) and 22 (1.5%) proposals on the reserve list in 2019 and 2020, respectively. The majority of the proposals was from physical sciences and engineering (n=1004, 64.6% for 2019 and n=947, 63.0 for 2020), followed by life sciences (n=391, 25.1% for 2019 and n=387, 25.8%) and economics and social sciences (n=159, 10.2% for 2019 and n=169, 11.2% for 2020).
Overall, review comments were written predominantly in an analytic and objective language, which was indicated by the high level of Analytical tone and low levels of Authenticity; this indicates that a small proportion of reviewers formulated their arguments as personal opinions, rather than objective comments (Figures 1 and 2).
There was a greater presence of clout and emotional tone in the description of the proposal’s strengths (Figures 1 and 2). Also, a greater proportion of the comments describing the strengths of the proposal had more words related to the positive emotional tone (Figures 1 and 2).
The acceptance of a proposal was predicted by the linguistic characteristics of the comments related to the weaknesses in the proposal, more specifically lower analytical tone across all three criteria’s weaknesses, a higher negative emotional tone for research excellence and impact weaknesses, and higher clout in research excellence (Table 1). In total, those predictors explained around 30% of the variance of the criteria (McFadden R2=0.30). On the other hand, the differences between 2019 and 2020 were negligible, explaining around 3% of the variance (McFadden R2=0.03) (Table 2). These predictors included the number of words in excellence strengths (both for individual reviewers and consensus reports), lower analytical tone for excellence and impact in individual reports and higher emotional tone in consensus reports.
With regard to the differences in textual characteristics between consensus report evaluation and individual evaluation scores between 2019 and 2020, the observed pattern of greatest differences between consensus score and individual scores in emotional tone was stable across different proposal status (Table 3) and research domains (Table 4). Emotional tone was overall greater for consensus score results.
In this study, which included all ITN proposals from 2019 and 2020 calls, we aimed to assess whether the changes in the evaluation procedure were related to differences in characteristics of review reports. We found that the differences in linguistic characteristics between reports fromboth calls (2019 and 2020) were small and negligible from a practical point, indicating that the removal of numerical scores did not result in meaningful changes in the reports’ comments, assessed by quantitative text analysis. For both calls, the comments were written objectively, with weaknesses written with less emotion and more analytically than the proposals’ strengths. On the other hand, we found that the final status of the proposals (i.e. main-listed or rejected) can be predicted by the linguistic characteristics of the reviewer’s comments, especially the tone related to the identified weaknesses, indicating that weaknesses may be crucial in proposal evaluation.
The comments’ text was written mostly using formal language, indicated by high levels of analytical tone, both for strengths and weaknesses. The same feature was observed in a previous study performed on journals’ reports from peer reviewers (Buljan et al., 2021). Our results also provide evidence for a general advice to the applicants – to be very focused on the objective structure of their proposal (Baumert et al., 2022). When emphasizing proposal strengths, due to the low levels of authenticity, the reviewers less frequently used personal pronouns like “I” or “we”, probably to present the proposal strengths as factual information, and not as a personal opinion. This finding is contrary to the study of Thelwall et al. (2023), which pointed out that higher use of first pronouns in reviews is related with higher proposal quality. On the other hand, in the description of weaknesses, the reviewers more often presented the information as their personal opinion. This finding is further supported by high levels of clout tone in the description of strengths. The clout – the tone which indicates writing from a position of power – was much higher in the description of the project strengths than in the description of the weaknesses, from which we can conclude that reviewers were more certain in their evaluation. The emotional tone was more positive in the description of strengths, probably because of the use of words related to the project’s probable success. There was a greater presence of clout and emotional tone in the description of the proposal’s strengths, which indicates that reviewers wrote with less confidence when discussing the potential flaws of the proposals, compared to when they mentioned the proposal’s strengths. In that respect, it is to be noted that the EC services instruct reviewers that evaluation reports should not express opinions, but rather evaluate factual elements of the proposals.
The principal difference between consensus and individual scores was in the emotional tone score. Across different categories, the emotional tone was higher in consensus ratings rather than in individual evaluation score, indicating more positive tone of the outcome of the consensus process. This may be due to the fact that only the concensus report is sent our to applicants. The IER remains an internal (intermediate) report and is not externalised and sent to applicants. In a previous study, we found that the agreement between reviewers was very high (Pina et al., 2021). At the time of the individual evaluation, the reviewers do not know whether other reviewers will agree with them. It is possible that, when reviewers need to write a consensus evaluation, they are not limited by the objective language in the evaluation of the proposal, since it is established that other reviewers agree with their opinion, so the tone is more relaxed and positive.
Our previous study of the predictive value of comments on proposals’ strengths and weaknesses in ITN evaluation process used both qualitative and quantitative (machine learning) approach (Hren et al., 2022). Our present results partially confirm the results of that study, which found that proposals’ weaknesses are more predictive of its evaluation outcome (Hren et al., 2022). However, we found that only some elements in the weaknesses are predictive of the proposal status. Specifically, a higher analytical tone and fewer negative evaluation words in comments related to proposals’ weaknesses were associated with a more favorable funding outcome . It needs to be noted that themes that served as predictors in the regression model were identified qualitatively and were better predictors (explained around 55% of the criteria) compared to our quantitative text analysis (around 30% of the criteria). However, due to the large number of proposals, linguistic characteristics of reviewers’ comments may serve as an additional tool in the proposed evaluation, as advised by others (Luo et al., 2022).
The finding that we did not observe meaningful differences in tone of reviewers’ comments needs to be interpreted in the light of several limitations. It should be noted that our entire quantitative text analysis process was made by dictionary-based text analysis algorithms, which may slightly deviate from manual analysis (Luo et al., 2022), being still predictive for proposal funding outcomes, and that was partially reproduced in our study. One aspect of quantitative text analysis, sentiment analysis or analysis of the text tone, could serve as a useful tool to determine whether there were any differences in the evaluation performed by reviewers after the removal of individual numerical scores, as reviewers were a common part of both procedures. We only focused on the linguistic characteristics of the comments related to the positive and negative sides of the proposals. Qualitative analysis of the proposals would give input on the potential differences between the two calls but, due to the number of the proposals, there is a question of practical value for such an approach. We do not have information about who the reviewers were, which may present relevant information since some individual characteristics, such as experience in research or reviewing, may influence the review process (Seeber et al., 2021). Based on our evaluation, we found no evidence that the removal of numerical scoring produced any differences in the evaluation output.
This study assessed whether removing of numerical scores will have a significant effect on the evaluation procedure. The findings indicate that the removal of numerical scores did not contribute to meaningful differences in the evaluation procedure of H2020 ITN proposals or its outcome. Those results support the finding that the procedure used for the evaluation of MSCA grant proposals is very robust and stable.
Ivan Buljan, Conceptualization, Data Curation, Formal Analysis, Investigation, Methodology, Project Administration, Visualization, Writing – Original Draft Preparation, Writing – Review & Editing; David G Pina, Conceptualization, Data Curation, Investigation, Methodology, Project Administration, Resources, Writing – Original Draft Preparation, Writing – Review & Editing; Antonija Mijatović, Conceptualization, Data Curation, Formal Analysis, Investigation, Methodology, Writing – Original Draft Preparation, Writing – Review & Editing and Ana Marušić, Conceptualization, Data Curation, Formal Analysis, Funding Acquisition, Investigation, Methodology, Project Administration, Resources, Supervision, Visualization, Writing – Original Draft Preparation, Writing – Review & Editing.
Open Science Framework: Are numerical scores important for grant evaluation?, https://osf.io/6bpvu/?view_only= (Buljan et al., 2021).
This project contains the following underlying data:
Open Science Framework: Are numerical scores important for grant evaluation?, https://osf.io/6bpvu/?view_only= (Buljan et al., 2021).
This project contains the following extended data:
Open Science Framework: STROBE checklist for “Are numerical scores important for grant evaluation? A cross sectional study”, https://osf.io/6bpvu/?view_only= (Buljan et al., 2021).
Data are available under the terms of the Creative Commons Zero “No rights reserved” data waiver (CC0 1.0 Public domain dedication).
Views | Downloads | |
---|---|---|
F1000Research | - | - |
PubMed Central
Data from PMC are received and updated monthly.
|
- | - |
Is the work clearly and accurately presented and does it cite the current literature?
Yes
Is the study design appropriate and is the work technically sound?
Partly
Are sufficient details of methods and analysis provided to allow replication by others?
Partly
If applicable, is the statistical analysis and its interpretation appropriate?
Partly
Are all the source data underlying the results available to ensure full reproducibility?
Partly
Are the conclusions drawn adequately supported by the results?
Yes
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Judgement and decision-making, peer-review, inter-rater agreement
Is the work clearly and accurately presented and does it cite the current literature?
Yes
Is the study design appropriate and is the work technically sound?
Yes
Are sufficient details of methods and analysis provided to allow replication by others?
Yes
If applicable, is the statistical analysis and its interpretation appropriate?
Yes
Are all the source data underlying the results available to ensure full reproducibility?
Yes
Are the conclusions drawn adequately supported by the results?
Yes
References
1. Gallo SA, Pearce M, Lee CJ, Erosheva EA: A new approach to grant review assessments: score, then rank.Res Integr Peer Rev. 2023; 8 (1): 10 PubMed Abstract | Publisher Full TextCompeting Interests: No competing interests were disclosed.
Reviewer Expertise: Research Integrity and Peer review
Is the work clearly and accurately presented and does it cite the current literature?
Yes
Is the study design appropriate and is the work technically sound?
Partly
Are sufficient details of methods and analysis provided to allow replication by others?
No
If applicable, is the statistical analysis and its interpretation appropriate?
Partly
Are all the source data underlying the results available to ensure full reproducibility?
Yes
Are the conclusions drawn adequately supported by the results?
Partly
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: research evaluation, bibliometrics, peer review
Is the work clearly and accurately presented and does it cite the current literature?
Yes
Is the study design appropriate and is the work technically sound?
Yes
Are sufficient details of methods and analysis provided to allow replication by others?
Yes
If applicable, is the statistical analysis and its interpretation appropriate?
Yes
Are all the source data underlying the results available to ensure full reproducibility?
Yes
Are the conclusions drawn adequately supported by the results?
Yes
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Linguistics; Discourse studies; evaluative linguistic resources
Alongside their report, reviewers assign a status to the article:
Invited Reviewers | ||||
---|---|---|---|---|
1 | 2 | 3 | 4 | |
Version 2 (revision) 05 Sep 24 |
read | |||
Version 1 26 Sep 23 |
read | read | read | read |
Provide sufficient details of any financial or non-financial competing interests to enable users to assess whether your comments might lead a reasonable person to question your impartiality. Consider the following examples, but note that this is not an exhaustive list:
Sign up for content alerts and receive a weekly or monthly email with all newly published articles
Already registered? Sign in
The email address should be the one you originally registered with F1000.
You registered with F1000 via Google, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Google account password, please click here.
You registered with F1000 via Facebook, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Facebook account password, please click here.
If your email address is registered with us, we will email you instructions to reset your password.
If you think you should have received this email but it has not arrived, please check your spam filters and/or contact for further assistance.
Comments on this article Comments (0)