Keywords
co-authorship network, academic success, scientific impact, scientific career
This article is included in the Research on Research, Policy & Culture gateway.
co-authorship network, academic success, scientific impact, scientific career
As international competition in science accelerates, there has been a growing interest in the determinants of individual success in academia (Sinatra et al., 2016; Clauset et al., 2017; Fortunato et al., 2018). A general notion is that success breeds success because recognized publications open new opportunities for funding and collaboration (Bol et al., 2018). This has directed attention to young scholars because achievements at early stage might generate future success (Wang et al., 2019). The relation between scientific collaboration and success has been investigated for some time (de Solla Price & Beaver, 1966; Luukkonen et al., 1993; Melin & Persson, 1996; Katz & Martin, 1997; Sonnenwald, 2007; Hood & Wilson, 2001; Sarigöl et al., 2014). It is especially useful to understand early career success because young scholars need mentors to learn from and are more likely to stand out in case they work with successful supervisors (Sekara et al., 2018; Ma et al., 2020; Li et al., 2019). Although early careers might learn from more people, it is less understood how future success of early career researchers depend on the team they work with during their early career.
Teams in scientific research are gaining dominance across all fields (Wuchty et al., 2007; Ziman, 1994). Research teams typically include postdocs, graduate and undergraduate students who collaborate with the principal investigator and other seniors of the group (Mali et al., 2012). Such collaborations can be observed in co-authorship that is frequently used to map collaboration networks in teams (Beaver, 2001; Glänzel & Schubert, 2004) and are thought to influence success of projects (Uzzi & Spiro, 2005). Two counteracting mechanisms are important in this respect. On the one hand, projects can create more novelty in the case where co-authors have not collaborated before because they can combine diverse expertise (De Vaan et al., 2015; Vedres, 2017; Zeng et al., 2021). On the other hand, team cohesion generated by shared co-authors, strong and persistent collaboration, trust and previous success can provide an environment in which knowledge sharing are efficient (Uzzi & Spiro, 2005; Aral & Van Alstyne, 2011; Mukherjee et al., 2019). Thus, the question, whether early career researchers benefit more from diverse than from cohesive teams, is important because striving for novelty in scientific research and efficient learning are difficult to achieve at the same time.
In this paper, we take a social network analysis approach to investigate co-authorship networks of early career researchers. To quantify diversity and coherence in the collaboration network of students and across the author teams they belong to, we apply the network constraint measure developed by Burt (1992, 2000, 2001). This measure ranges between 0 and 1: high values around 1 show that co-authors of the PhD student work frequently together; low values around 0 show that the PhD student works with co-authors who are otherwise not collaborating with each other. This measure has been very widely used to capture diverse knowledge access through connections in various contexts including creative industries (Juhász et al., 2020), innovation (Tóth & Lengyel, 2021) and to capture the role of network cohesion in knowledge transfer (Reagens & McEvily, 2003; Tortoreillo et al., 2012).
Our empirical case concerns researchers who have had a successful defense in any Hungarian doctoral school between 1993 and 2010. Our data contains information on the dissertation, including the scientific field and year of defense, and bibliometric information data comes from publication records of egos and their co-authors. We estimate the accumulated number of citations at the eighth year following defense, which gives us a simple measure of success at the end of the early phase of academic career (Van Balen et al., 2012).
Cross-sectional linear regressions with year and scientific field dummies show that the number of papers published until the second year after the defense correlates negatively with accumulated citations, but the impact of these papers correlates strongly with future impact. This finding indicates that thorough work focusing on a few but important papers is a much better strategy than producing many papers during doctoral studies. We find that in case of life science students, both the number of co-authors and most importantly the constraint measure correlates positively with future impact. These latter two co-efficients are not significant for other scientific fields. These results provide new evidence that PhD students in life-sciences can benefit from working in a cohesive research team probably because these provide a good learning environment.
The ad-hoc ethical committee of the Library and Information Centre, Hungarian Academy of Sciences discussed the use of personal data for the purpose of the research project “Team cohesion of PhD students in life sciences”, and approval to use the data was given. Consent from the participants to use their data was waived by the committee. The Library and Information Centre of the Hungarian Academy of Sciences as data handler of the Hungarian Scientific Bibliography Database is authorized to use the data for scientific research as per law 1994/XL and 10/2021.(II. 24.) regulation decreed by the President of the Hungarian Academy of Sciences.
All other data used in the study was publicly available.
We combine two data sources to collect information about early-career scholars. Data on doctoral defenses have been collected from www.doktori.hu, an openly available collection of all successful PhD theses defended in Hungarian doctoral schools starting from 1993, the year when the PhD system was introduced in the country. We downloaded data from the website in January 2017. This data contains 16,151 Hungarian PhD students who defended their theses until that date. Information provided includes the ID and name of every PhD student, the title of their thesis, the year of defense, scientific area, and the name of supervisors. Our second data source is the Hungarian Scientific Bibliography Database (MTMT) that contains the scientific publications’ metadata of all active Hungarian researchers.
The two databases can be matched on the individual student level. A total of 23% of students were matched between the doktori.hu database and MTMT data using their student IDs. The rest of the students were matched by hand using name and scientific field. We could identify 60% of the PhD students in the MTMT data. The number of PhD students who could be matched with an MTMT profile was 9,415 and is illustrated in Figure 1A by the year of defense. In our regression exercise, we focus on the future impact of PhD thus restrict the analysis to the 2,061 PhD students who defended theses in the 1993-2010 period.
A. The number of PhD students by year in www.doktori.hu (light blue) and the successfully identified PhD students in the MTMT database (navy blue). Data includes all students who defended between 1993 and 2017 but the analysis will focus on those who defended in the 1993-2010 period. B. The number of publications by the Hungarian PhD holders (year of defenses between 1993-2010) and their co-authors between 1990 and 2019.
Bibliometric data was downloaded from MTMT by the data handler colleague of the Hungarian Academy of Sciences after the identification of PhD students in MTMT. This happened in two steps. First, we downloaded all 272,954 publication records of the identified 9,415 PhD students in 2017. Then, we identified 20,139 co-authors of PhD students in MTMT and downloaded their publication records in 2020. This final bibliometric dataset contains records of 1,205,184 papers published by 43,485 authors between 1990-2019. Note that only those authors are included who are affiliated with Hungarian institutions and must have registered on MTMT. There are around 50,000 MTMT accounts altogether, meaning that our data collection has covered around 86% of the total scientific community in the country. Figure 1B illustrates the number of all publications from the entire career of those PhD students who defended between 1993 and 2010 and from their co-authors. The data processing and analysis was carried out in R (version number 4.0.0).
Measuring scientific success, especially individual scientific performance is a complex problem. Traditionally, it is based on production (publication) numbers, scientific impact (citation numbers) and structural measurements for example the network characteristics of authorship (Van Balen et al., 2012; Glänzel et al., 2019). However, the raw citation number depends on several factors, such as the year of publication, research field, document type (e.g. research article, review article or proceedings), and journal characteristics (e.g. frequency of occurrence, number of articles in the journal). For example, it is easy to see that the earlier an article has appeared, the more citations it could receive. The citation habits are different in individual research fields, so to compare two citation measures we must do it in the same research area. The various document types use different number of references. Thus, the comparison is more accurate if it is made within the same document type. Moreover, the journal characteristic also can cause a bias on raw citation numbers. The solution for these problems is using normalized citation numbers.
In our case the MTMT database contains only raw citation numbers in the year when we downloaded the data, in 2020 (Figure 2 illustrates the publication and citation distributions over the examined period). To handle this problem, we compared each PhD holder in two-year periods, in a cumulative way using the year of their PhD defense as a starting point. We compared PhD holders by their research field. The MTMT database contains document types as articles, books, others, but we were unable to distinguish research articles and review articles. Therefore, we did not consider the document types. Figure 3A shows the cumulative number of citations of the examined PhD students, while Figure 3B shows the cumulative number of their publications.
A. The distribution of number of publications by Hungarian PhD holders (year of defense 1993-2010) between 1990 and 2019. B. The distribution of number of citations between 1990 and 2019 by Hungarian PhD holders (year of defense 1993-2010) in 2020.
To answer our research question, whether cohesive or diverse co-authorship network structure favors the success of a young researcher, we analyzed the weighted and dynamic ego-networks of PhD students. Such networks were generated from the publication records. These ego-networks include the PhD student in the center (ego), to which co-authors (alters in the ego-network terminology) are connected. Links are undirected but weighted by the number of co-authored papers. The networks are dynamic, such that we add new collaborators and new links to the ego-network of individual PhD students as new papers are published, but do not delete ties over the years. Since we have access to the publications of co-authors, the links between alters contain those publications that were not authored by the PhD student.
Cohesive networks are dense and include strong, high-bandwidth ties (Aral, 2016). That is, co-authors frequently publish with each other. Such network structures are thought to capture an environment, in which shared work experience and developed trust facilitate learning from peers. In cohesive networks knowledge transfer is faster and more efficient such that the PhD student can learn complex knowledge easier (Reagens & McEvily, 2003). On the contrary, diverse networks, in which co-authors have not worked with each other but with the PhD student, capture an environment that provides the student with diverse capabilities of co-authors. In such networks, innovation and novel combination is more likely (Burt, 2001). In case the student can integrate distinct pieces of knowledge, diverse networks might help them to publish papers with high degree of novelty.
We used Burt (2000) constraint indicator that characterizes ego-networks in the cohesive-diverse continuum using the formula:
where and is the number of papers that PhD student i has co-authored with colleagues j and q, and is the number of papers that j and q has co-authored without i. The indicator takes high values in case co-authors publish intensively together and low values are produced when co-authors do not publish together.
As the size of ego-networks grow, the probability that co-authors are connected might decrease, which has often been found in co-author networks (see for example Tóth & Lengyel, 2021). Thus, one must consider the degree of PhD students as well as their number of co-authors.
The distributions of degree and constraint are depicted in Figure 4. As expected, these two indicators change the opposite direction. The number of relations rise in time (Figure 4A) which is obvious because we used a cumulative ego network and did not erase former co-authorships. We can also see an increase in the distribution of degree in time, which means that while some researchers could evolve their co-authorship networks after their PhD, others had a narrowed scientific network. The distribution of constraint slightly decreases (Figure 4B), and the median of constraint also falls in time. The cause is that the size of ego networks grows in time and those PhD holders who get more and more co-authors have also more diverse collaboration network.
A. The distribution of degree in cumulative ego networks of the Hungarian PhD holders (year of defense 1993-2010) between 1990-2019. B. The distribution of constraint in cumulative ego networks of the Hungarian PhD holders (year of defense 1993-2010) between 1990-2019.
We calculated further measures that might be also used to characterize cohesion and diversity in ego-networks. Betweenness centrality quantifies diversity in the network of PhD students by measuring the number of shortest paths in the network that go through the ego. The higher betweenness centrality of the ego the more diversity in the network. Global clustering quantifies the fraction of closed triangles in the network among all possible triangles, while network density measures the fraction of observed ties among all possible ties with the ego. The higher these measures the higher cohesion in the ego-network.
Table 1 reports Pearson correlation coefficients between network parameters at the second and eight year after PhD defense. As expected, we find a negative correlation between degree and all other network indices. Constraint is strongly correlated with network density. We have run alternative regression specifications with the network measures listed in Table 1 but only found significant results for constraint.
Two years (below diagonal) and eight years (above diagonal) after Hungarian PhD holders defense (year of defense 1993-2010).
Our data enables us to capture impact of publications as a snapshot in 2020 by the total number of citations received until then. This allows for cross-sectional specification, in which we can compare students who finished in the same year and consider publications that they produced until a certain year after defense. This way, we can avoid the problem that earlier publications have more time to collect citations.
To answer the question whether cohesive co-authorship networks of PhD students during their studies help their future success, we estimate the number of accumulated citations (CITi,t+8) of student i to the paper that they published until the 8th (t+8) year following defense at year t with the following equation:
where denotes citations to papers published until the second year after defense, and are papers published until the second year after defense and is the number of papers published between the second and eight year after defense, is the degree, and is the constraint measure of the student’s co-author network, is scientific area-specific fixed-effect, is year dummies and is the error term.
To estimate Eq.2, we used ordinary least squares (OLS) linear regression models. The scientific area fixed effects are specified by research fields of doctoral schools. These latter refer to 54 categories of research fields defined by the Hungarian Accreditation Committee (HAC, 2018): exactly one research field has been assigned to each doctoral school. All variables are log-transformed.
Table 2 reports results of an OLS regression of estimating Eq.2. In columns 1-3, we estimate citations to papers that were published until the 8th year following defense with variables that capture publications and co-authorship until the 2nd year following defense. We introduce variables in a stepwise manner such that a baseline model is run in column 1 and network variables are introduced in columns 2 and 3.
OLS regressions with year and scientific field fixed effects and robust standard errors.
(1) | (2) | (3) | |
---|---|---|---|
CIT (log) | 0.899*** (0.009) | 0.894*** (0.009) | 0.893*** (0.009) |
PAP (log) | −0.211*** (0.017) | −0.217*** (0.019) | −0.226*** (0.020) |
Δ PAP (log) | 0.317*** (0.010) | 0.312*** (0.011) | 0.313*** (0.011) |
DEG (log) | 0.043** (0.018) | 0.087*** (0.028) | |
CON (log) | 0.244** (0.121) | ||
Constant | 2.327*** (0.374) | 2.324*** (0.372) | 2.141*** (0.383) |
N | 2,061 | 1,948 | 1,948 |
R2 | 0.919 | 0.917 | 0.918 |
Throughout the models, we found a very strong positive correlation between CITt+2 and CITt+8 that is a trivial relation but has importance in our empirical exercise. Because citations are collected for all publications in 2020, CITt+8 includes CITt+2 However, the very high correlations also mean that most of the citations at the end of the early career stage are received to the publications that were published during or closely after PhD studies. PAPt+2 is negatively correlated while ΔPAPt+8 is positively correlated with the dependent variable. These findings suggest that due to accumulation of citations, the best strategy for PhD students is to produce a few but high impact papers that will help them to collect citations in their early career.
In column 2, we introduce degree that leaves correlations of other covariates almost unchanged. DEG is positively correlated with CITt+8 suggesting that the number of co-authors facilitates citations. Note that there might be various mechanisms at play; citations might grow with the number of co-authors because they can also cite the paper or spread the word, and alternatively, the project and the PhD student can gain from working with and learning from many collaborators.
Constraint is positively correlated with CITt+8 Controlling for DEG, the number of publications, and including year and scientific field dummies, CON quantifies the extent to which co-authors of the PhD student have collaborated in publications that are published until the second year after the defense of the student. Our finding suggests that such cohesive ego-networks are beneficial for PhD students. Because we also control for the citations to papers, this finding confirms that PhD students benefit the most from working in cohesive collaboration networks because these create efficient learning environments.
Correlations of independent variables indicate that the models are not violated by multicollinearity. The highest value of the Pearson correlation coefficients is ρ = 0.41 between DEG and PAPt+2 We document the correlation between DEG and CON in Table 1 (ρ = 0.61), but the inclusion of these variables together is conceptually motivated as we describe before. Further, the inclusion of CON in Model 3 does not substantially influence the coefficient of DEG.
In Table 3, we report full regression models decomposed into four big scientific areas such as Science, Life Science, Engineering, Social Science. To achieve these scientific areas, we have grouped the 54 scientific fields (as per National Accreditation Committee). We found that DEG and CON is significant for the Life Science subsample while DEG only is weakly significant in Engineering. Thus, cohesive research environment is important for Life Science students and less for students in other fields.
OLS regressions with year and scientific field fixed effects and robust standard errors.
Sciences | Life Sciences | Engineering | Social Sciences | |
---|---|---|---|---|
(1) | (2) | (3) | (4) | |
CIT_t2 (log) | 0.919*** (0.020) | 0.870*** (0.017) | 0.917*** (0.034) | 0.910*** (0.026) |
PAP (log) | −0.294*** (0.046) | −0.220*** (0.033) | −0.222*** (0.072) | −0.267*** (0.058) |
Δ PAP (log) | 0.357*** (0.023) | 0.296*** (0.018) | 0.255*** (0.034) | 0.388*** (0.033) |
DEG (log) | 0.068 (0.063) | 0.162** (0.054) | 0.191* (0.110) | 0.039 (0.075) |
CON (log) | 0.173 (0.255) | 0.697*** (0.274) | 0.654 (0.401) | 0.060 (0.303) |
Constant | 2.149*** (0.438) | 2.324*** (0.372) | 1.124*** (0.422) | 0.434 (0.536) |
N | 437 | 1,948 | 155 | 279 |
R2 | 0.919 | 0.917 | 0.942 | 0.910 |
In this study we examined the success of students who defended theses in Hungarian doctoral schools between 1990 and 2010 by looking at their publication records and accumulated citations in 2019. Our bibliometric database also contains the PhD students’ publications and their co-authors’ publications between 1990 and 2019. We analyzed whether cohesive or diverse co-author network structure gives a better chance to a young researcher to stand out in terms of citations eight years after defense. Linear regression models suggest that those students who participate in cohesive collaboration networks, receive significantly more citations at the end of their career. This result highlights the need for strong collaborations and effective learning environment during doctoral studies. However, our results regarding the structure of co-author networks are specific to Life Science students. Thus, cohesion is mostly important in areas where new knowledge is produced in teamwork.
The present paper contributes to a growing literature, in which studies try to determine factors that support the future success of young researchers. Li and co-authors (2019) demonstrate that those students who publish with top scientists had a greater chance of being more successful 20 years later. Moreover, this effect is more important in the case of PhD students affiliated with a less prestigious PhD school. Sarigöl et al. (2014) illustrate a similar phenomenon: a paper gets more citations if its’ authors are central in the large co-author network of their field. We add to this discussion by studying the co-author ego-networks of PhD students. Our findings confirm that the structure of the group collaboration matters for the future academic career of students.
We also find that those students are more successful, measured in citations, who focus on few papers. These results are robust across all large scientific fields. By concentrative efforts into a small number of publications, students can achieve higher quality papers that might be accepted to better journals. Because citations typically demand several years to accumulate, students need high-impact papers at the beginning of their career to stand out later when they are at the end of the early-career stage. This can help them in research proposals and thus facilitate academic careers in the long run.
Raw data, as accessed through the Hungarian Scientific Bibliography Database, are not publicly available due to privacy considerations of the authors. Access can be requested through the Library and Information Centre of the Hungarian Academy of Sciences; contact details can be found on the library’s website https://www.mtmt.hu/kapcsolat. Access to the data will be provided under the following conditions: the researcher presents legitimate research purposes to the ad-hoc ethical committee of the Library and Information Centre of the Hungarian Academy of Sciences and after the committee’s positive decision, signs a non-disclosure agreement.
Zenodo: Data and Code for the manuscript PhD students in life sciences can benefit from team cohesion, https://doi.org/10.5281/zenodo.5129288 (Vida et al., 2021).
This project contains the following underlying data:
- Phd_data.csv (contains anonymized author ID, network and publication variables after 2-4-6-8 years of PhD defense, and discipline of PhD thesis).
- Phd_regressions_code.R (analysis code used in the study).
Data are available under the terms of the Creative Commons Attribution 4.0 International license (CC-BY 4.0).
The authors acknowledge the kind help of Gusztáv Ladányi in collecting the bibliometric data and Dániel Horváth in matching the bibliometric data with thesis defense data.
Views | Downloads | |
---|---|---|
F1000Research | - | - |
PubMed Central
Data from PMC are received and updated monthly.
|
- | - |
Is the work clearly and accurately presented and does it cite the current literature?
Yes
Is the study design appropriate and is the work technically sound?
Yes
Are sufficient details of methods and analysis provided to allow replication by others?
Yes
If applicable, is the statistical analysis and its interpretation appropriate?
Yes
Are all the source data underlying the results available to ensure full reproducibility?
No source data required
Are the conclusions drawn adequately supported by the results?
Yes
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Scientometrics, bibliometrics, scholarly communication
Is the work clearly and accurately presented and does it cite the current literature?
Partly
Is the study design appropriate and is the work technically sound?
Partly
Are sufficient details of methods and analysis provided to allow replication by others?
Partly
If applicable, is the statistical analysis and its interpretation appropriate?
Partly
Are all the source data underlying the results available to ensure full reproducibility?
Yes
Are the conclusions drawn adequately supported by the results?
Yes
References
1. Baruffaldi S, Visentin F, Conti A: The productivity of science & engineering PhD students hired from supervisors’ networks. Research Policy. 2016; 45 (4): 785-796 Publisher Full TextCompeting Interests: No competing interests were disclosed.
Reviewer Expertise: scientometrics, innovation studies
Alongside their report, reviewers assign a status to the article:
Invited Reviewers | ||
---|---|---|
1 | 2 | |
Version 1 29 Jul 21 |
read | read |
Provide sufficient details of any financial or non-financial competing interests to enable users to assess whether your comments might lead a reasonable person to question your impartiality. Consider the following examples, but note that this is not an exhaustive list:
Sign up for content alerts and receive a weekly or monthly email with all newly published articles
Already registered? Sign in
The email address should be the one you originally registered with F1000.
You registered with F1000 via Google, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Google account password, please click here.
You registered with F1000 via Facebook, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Facebook account password, please click here.
If your email address is registered with us, we will email you instructions to reset your password.
If you think you should have received this email but it has not arrived, please check your spam filters and/or contact for further assistance.
Comments on this article Comments (0)