PhD students in life sciences can benefit from team cohesion [version 1; peer review: 1 approved with reservations]

Background: Scientific progress during doctoral studies is a combination of individual effort and teamwork. A recently growing body of interdisciplinary literature has investigated the determinants of early career success in academia, in which learning from supervisors and co-authors play a great role. Yet, it is less understood how collaboration patterns of the research team, in which the doctoral student participates, influences the future career of students. Here we take a social network analysis approach to investigate this and define the research team as the co-authorship network of the student. Methods: We use the Hungarian Scientific Bibliography Database, which includes all publications of PhD students who defended theses from the year 1993. The data also include thesis information, and the publications of co-authors of students. Using this data, we quantify cohesion in the ego-network of PhD students, the impact measured by citations received, and productivity measured by number of publications. We run multivariate linear regressions to measure the relation of network cohesion, and publication outputs during doctoral years with future impact. Results: We find that those students in life sciences, but not in other fields, who have a cohesive co-author network during studies and two years after defence receive significantly more citations in eight years. We find that the number of papers published during PhD years and closely after the defence correlates negatively while the impact of these papers correlates positively with future success of students in all fields. Conclusions: These results highlight that research teams are effective learning environments for PhD students where collaborations create a tightly knit knowledge network. Open Peer Review


Introduction
As international competition in science accelerates, there has been a growing interest in the determinants of individual success in academia (Sinatra et al., 2016;Clauset et al., 2017;Fortunato et al., 2018). A general notion is that success breeds success because recognized publications open new opportunities for funding and collaboration (Bol et al., 2018). This has directed attention to young scholars because achievements at early stage might generate future success (Wang et al., 2019). The relation between scientific collaboration and success has been investigated for some time (de Solla Price & Beaver, 1966;Luukkonen et al., 1993;Melin & Persson, 1996;Katz & Martin, 1997;Sonnenwald, 2007;Hood & Wilson, 2001;Sarigöl et al., 2014). It is especially useful to understand early career success because young scholars need mentors to learn from and are more likely to stand out in case they work with successful supervisors (Sekara et al., 2018;Ma et al., 2020;Li et al., 2019). Although early careers might learn from more people, it is less understood how future success of early career researchers depend on the team they work with during their early career.
Teams in scientific research are gaining dominance across all fields (Wuchty et al., 2007;Ziman, 1994). Research teams typically include postdocs, graduate and undergraduate students who collaborate with the principal investigator and other seniors of the group (Mali et al., 2012). Such collaborations can be observed in co-authorship that is frequently used to map collaboration networks in teams (Beaver, 2001;Glänzel & Schubert, 2004) and are thought to influence success of projects (Uzzi & Spiro, 2005). Two counteracting mechanisms are important in this respect. On the one hand, projects can create more novelty in the case where co-authors have not collaborated before because they can combine diverse expertise (De Vaan et al., 2015;Vedres, 2017;Zeng et al., 2021). On the other hand, team cohesion generated by shared co-authors, strong and persistent collaboration, trust and previous success can provide an environment in which knowledge sharing are efficient (Uzzi & Spiro, 2005;Aral & Van Alstyne, 2011;Mukherjee et al., 2019). Thus, the question, whether early career researchers benefit more from diverse than from cohesive teams, is important because striving for novelty in scientific research and efficient learning are difficult to achieve at the same time.
In this paper, we take a social network analysis approach to investigate co-authorship networks of early career researchers. To quantify diversity and coherence in the collaboration network of students and across the author teams they belong to, we apply the network constraint measure developed by Burt (1992Burt ( , 2000Burt ( , 2001. This measure ranges between 0 and 1: high values around 1 show that co-authors of the PhD student work frequently together; low values around 0 show that the PhD student works with co-authors who are otherwise not collaborating with each other. This measure has been very widely used to capture diverse knowledge access through connections in various contexts including creative industries (Juhász et al., 2020), innovation (Tóth & Lengyel, 2021) and to capture the role of network cohesion in knowledge transfer (Reagens & McEvily, 2003;Tortoreillo et al., 2012).
Our empirical case concerns researchers who have had a successful defense in any Hungarian doctoral school between 1993 and 2010. Our data contains information on the dissertation, including the scientific field and year of defense, and bibliometric information data comes from publication records of egos and their co-authors. We estimate the accumulated number of citations at the eighth year following defense, which gives us a simple measure of success at the end of the early phase of academic career (Van Balen et al., 2012).
Cross-sectional linear regressions with year and scientific field dummies show that the number of papers published until the second year after the defense correlates negatively with accumulated citations, but the impact of these papers correlates strongly with future impact. This finding indicates that thorough work focusing on a few but important papers is a much better strategy than producing many papers during doctoral studies. We find that in case of life science students, both the number of co-authors and most importantly the constraint measure correlates positively with future impact. These latter two co-efficients are not significant for other scientific fields. These results provide new evidence that PhD students in life-sciences can benefit from working in a cohesive research team probably because these provide a good learning environment.

Ethical approval
The ad-hoc ethical committee of the Library and Information Centre, Hungarian Academy of Sciences discussed the use of personal data for the purpose of the research project "Team cohesion of PhD students in life sciences", and approval to use the data was given. Consent from the participants to use their data was waived by the committee. The Library and Information Centre of the Hungarian Academy of Sciences as data handler of the Hungarian Scientific Bibliography Database is authorized to use the data for scientific research as per law 1994/XL and 10/2021.(II. 24.) regulation decreed by the President of the Hungarian Academy of Sciences.
All other data used in the study was publicly available.

Data
We combine two data sources to collect information about early-career scholars. Data on doctoral defenses have been collected from www.doktori.hu, an openly available collection of all successful PhD theses defended in Hungarian doctoral schools starting from 1993, the year when the PhD system was introduced in the country. We downloaded data from the website in January 2017. This data contains 16,151 Hungarian PhD students who defended their theses until that date. Information provided includes the ID and name of every PhD student, the title of their thesis, the year of defense, scientific area, and the name of supervisors. Our second data source is the Hungarian Scientific Bibliography Database (MTMT) that contains the scientific publications' metadata of all active Hungarian researchers.
The two databases can be matched on the individual student level. A total of 23% of students were matched between the doktori.hu database and MTMT data using their student IDs. The rest of the students were matched by hand using name and scientific field. We could identify 60% of the PhD students in the MTMT data. The number of PhD students who could be matched with an MTMT profile was 9,415 and is illustrated in Figure 1A by the year of defense. In our regression exercise, we focus on the future impact of PhD thus restrict the analysis to the 2,061 PhD students who defended theses in the 1993-2010 period. Bibliometric data was downloaded from MTMT by the data handler colleague of the Hungarian Academy of Sciences after the identification of PhD students in MTMT. This happened in two steps. First, we downloaded all 272,954 publication records of the identified 9,415 PhD students in 2017. Then, we identified 20,139 co-authors of PhD students in MTMT and downloaded their publication records in 2020. This final bibliometric dataset contains records of 1,205,184 papers published by 43,485 authors between 1990-2019. Note that only those authors are included who are affiliated with Hungarian institutions and must have registered on MTMT. There are around 50,000 MTMT accounts altogether, meaning that our data collection has covered around 86% of the total scientific community in the country. Figure 1B illustrates the number of all publications from the entire career of those PhD students who defended between 1993 and 2010 and from their co-authors. The data processing and analysis was carried out in R (version number 4.0.0).

Publication variables
Measuring scientific success, especially individual scientific performance is a complex problem. Traditionally, it is based on production (publication) numbers, scientific impact (citation numbers) and structural measurements for example the network characteristics of authorship (Van Balen et al., 2012;Glänzel et al., 2019). However, the raw citation number depends on several factors, such as the year of publication, research field, document type (e.g. research article, review article or proceedings), and journal characteristics (e.g. frequency of occurrence, number of articles in the journal). For example, it is easy to see that the earlier an article has appeared, the more citations it could receive. The citation habits are different in individual research fields, so to compare two citation measures we must do it in the same research area. The various document types use different number of references. Thus, the comparison is more accurate if it is made within the same document type. Moreover, the journal characteristic also can cause a bias on raw citation numbers. The solution for these problems is using normalized citation numbers.
In our case the MTMT database contains only raw citation numbers in the year when we downloaded the data, in 2020 ( Figure 2 illustrates the publication and citation distributions over the examined period). To handle this problem, we compared each PhD holder in two-year periods, in a cumulative way using the year of their PhD defense as a starting point. We compared PhD holders by their research field. The MTMT database contains document types as articles, books, others, but we were unable to distinguish research articles and review articles. Therefore, we did not consider the document types. Figure 3A shows the cumulative number of citations of the examined PhD students, while Figure 3B shows the cumulative number of their publications.

Network variables
To answer our research question, whether cohesive or diverse co-authorship network structure favors the success of a young researcher, we analyzed the weighted and dynamic ego-networks of PhD students. Such networks were generated from the publication records. These ego-networks include the PhD student in the center (ego), to which co-authors (alters in the ego-network terminology) are connected. Links are undirected but weighted by the number of co-authored papers. The networks are dynamic, such that we add new collaborators and new links to the ego-network of individual PhD students as new papers are published, but do not delete ties over the years. Since we have access to the publications of co-authors, the links between alters contain those publications that were not authored by the PhD student.
Cohesive networks are dense and include strong, high-bandwidth ties (Aral, 2016). That is, co-authors frequently publish with each other. Such network structures are thought to capture an environment, in which shared work experience and developed trust facilitate learning from peers. In cohesive networks knowledge transfer is faster and more efficient such that the PhD student can learn complex knowledge easier (Reagens & McEvily, 2003). On the contrary, diverse networks, in which co-authors have not worked with each other but with the PhD student, capture an environment that provides the student with diverse capabilities of co-authors. In such networks, innovation and novel combination is more likely (Burt, 2001). In case the student can integrate distinct pieces of knowledge, diverse networks might help them to publish papers with high degree of novelty.
We used Burt (2000) constraint indicator that characterizes ego-networks in the cohesive-diverse continuum using the formula: where p ij and p iq is the number of papers that PhD student i has co-authored with colleagues j and q, and p qj is the number of papers that j and q has co-authored without i. The indicator takes high values in case co-authors publish intensively together and low values are produced when co-authors do not publish together.
As the size of ego-networks grow, the probability that co-authors are connected might decrease, which has often been found in co-author networks (see for example Tóth & Lengyel, 2021). Thus, one must consider the degree of PhD students as well as their number of co-authors.
The distributions of degree and constraint are depicted in Figure 4. As expected, these two indicators change the opposite direction. The number of relations rise in time ( Figure 4A) which is obvious because we used a cumulative ego network and did not erase former co-authorships. We can also see an increase in the distribution of degree in time, which means that while some researchers could evolve their co-authorship networks after their PhD, others had a narrowed scientific network. The distribution of constraint slightly decreases ( Figure 4B), and the median of constraint also falls in time. The cause is that the size of ego networks grows in time and those PhD holders who get more and more co-authors have also more diverse collaboration network.
We calculated further measures that might be also used to characterize cohesion and diversity in ego-networks. Betweenness centrality quantifies diversity in the network of PhD students by measuring the number of shortest paths in the network that go through the ego. The higher betweenness centrality of the ego the more diversity in the network. Global clustering quantifies the fraction of closed triangles in the network among all possible triangles, while network density measures the fraction of observed ties among all possible ties with the ego. The higher these measures the higher cohesion in the ego-network. Table 1 reports Pearson correlation coefficients between network parameters at the second and eight year after PhD defense. As expected, we find a negative correlation between degree and all other network indices. Constraint is strongly correlated with network density. We have run alternative regression specifications with the network measures listed in Table 1 but only found significant results for constraint.

Regression framework
Our data enables us to capture impact of publications as a snapshot in 2020 by the total number of citations received until then. This allows for cross-sectional specification, in which we can compare students who finished in the same year and consider publications that they produced until a certain year after defense. This way, we can avoid the problem that earlier publications have more time to collect citations.
To answer the question whether cohesive co-authorship networks of PhD students during their studies help their future success, we estimate the number of accumulated citations (CIT i,t+8 ) of student i to the paper that they published until the 8 th (t+8) year following defense at year t with the following equation: where CIT i,tþ2 denotes citations to papers published until the second year after defense, PAP i,tþ2 and are papers published until the second year after defense and ΔPAP i,tþ8 is the number of papers published between the second and eight year after defense, DEG i,tþ2 is the degree, and CON i,tþ2 is the constraint measure of the student's co-author network, θ i is scientific area-specific fixed-effect, t i is year dummies and ε i is the error term.
To estimate Eq.2, we used ordinary least squares (OLS) linear regression models. The scientific area fixed effects are specified by research fields of doctoral schools. These latter refer to 54 categories of research fields defined by the Hungarian Accreditation Committee (HAC, 2018): exactly one research field has been assigned to each doctoral school. All variables are log-transformed.  Results Table 2 reports results of an OLS regression of estimating Eq.2. In columns 1-3, we estimate citations to papers that were published until the 8 th year following defense with variables that capture publications and co-authorship until the 2 nd year following defense. We introduce variables in a stepwise manner such that a baseline model is run in column 1 and network variables are introduced in columns 2 and 3.
Throughout the models, we found a very strong positive correlation between CIT t+2 and CIT t+8 that is a trivial relation but has importance in our empirical exercise. Because citations are collected for all publications in 2020, CIT t+8 includes CIT t+2 However, the very high correlations also mean that most of the citations at the end of the early career stage are received to the publications that were published during or closely after PhD studies. PAP t+2 is negatively correlated while ΔPAP t+8 is positively correlated with the dependent variable. These findings suggest that due to accumulation of citations, the best strategy for PhD students is to produce a few but high impact papers that will help them to collect citations in their early career.
In column 2, we introduce degree that leaves correlations of other covariates almost unchanged. DEG is positively correlated with CIT t+8 suggesting that the number of co-authors facilitates citations. Note that there might be various mechanisms at play; citations might grow with the number of co-authors because they can also cite the paper or spread the word, and alternatively, the project and the PhD student can gain from working with and learning from many collaborators.
Constraint is positively correlated with CIT t+8 Controlling for DEG, the number of publications, and including year and scientific field dummies, CON quantifies the extent to which co-authors of the PhD student have collaborated in publications that are published until the second year after the defense of the student. Our finding suggests that such cohesive ego-networks are beneficial for PhD students. Because we also control for the citations to papers, this finding confirms that PhD students benefit the most from working in cohesive collaboration networks because these create efficient learning environments.
Correlations of independent variables indicate that the models are not violated by multicollinearity. The highest value of the Pearson correlation coefficients is ρ = 0.41 between DEG and PAP t+2 We document the correlation between DEG and CON in Table 1 (ρ = 0.61), but the inclusion of these variables together is conceptually motivated as we describe before. Further, the inclusion of CON in Model 3 does not substantially influence the coefficient of DEG.
In Table 3, we report full regression models decomposed into four big scientific areas such as Science, Life Science, Engineering, Social Science. To achieve these scientific areas, we have grouped the 54 scientific fields (as per National Accreditation Committee). We found that DEG and CON is significant for the Life Science subsample while DEG only is Table 2. Estimates for citations eight years after defense. OLS regressions with year and scientific field fixed effects and robust standard errors.
(1) (2) CIT ( weakly significant in Engineering. Thus, cohesive research environment is important for Life Science students and less for students in other fields.

Discussion
In this study we examined the success of students who defended theses in Hungarian doctoral schools between 1990 and 2010 by looking at their publication records and accumulated citations in 2019. Our bibliometric database also contains the PhD students' publications and their co-authors' publications between 1990 and 2019. We analyzed whether cohesive or diverse co-author network structure gives a better chance to a young researcher to stand out in terms of citations eight years after defense. Linear regression models suggest that those students who participate in cohesive collaboration networks, receive significantly more citations at the end of their career. This result highlights the need for strong collaborations and effective learning environment during doctoral studies. However, our results regarding the structure of co-author networks are specific to Life Science students. Thus, cohesion is mostly important in areas where new knowledge is produced in teamwork.
The present paper contributes to a growing literature, in which studies try to determine factors that support the future success of young researchers. Li and co-authors (2019) demonstrate that those students who publish with top scientists had a greater chance of being more successful 20 years later. Moreover, this effect is more important in the case of PhD students affiliated with a less prestigious PhD school. Sarigöl et al. (2014) illustrate a similar phenomenon: a paper gets more citations if its' authors are central in the large co-author network of their field. We add to this discussion by studying the co-author ego-networks of PhD students. Our findings confirm that the structure of the group collaboration matters for the future academic career of students.
We also find that those students are more successful, measured in citations, who focus on few papers. These results are robust across all large scientific fields. By concentrative efforts into a small number of publications, students can achieve higher quality papers that might be accepted to better journals. Because citations typically demand several years to accumulate, students need high-impact papers at the beginning of their career to stand out later when they are at the end of the early-career stage. This can help them in research proposals and thus facilitate academic careers in the long run.

Data availability
Underlying data Raw data, as accessed through the Hungarian Scientific Bibliography Database, are not publicly available due to privacy considerations of the authors. Access can be requested through the Library and Information Centre of the Hungarian Academy of Sciences; contact details can be found on the library's website https://www.mtmt.hu/kapcsolat. Access to the data will be provided under the following conditions: the researcher presents legitimate research purposes to the ad-hoc This project contains the following underlying data: -Phd_data.csv (contains anonymized author ID, network and publication variables after 2-4-6-8 years of PhD defense, and discipline of PhD thesis).
-Phd_regressions_code.R (analysis code used in the study).

Koen Frenken
Copernicus Institute of Sustainable Development, Utrecht University, Utrecht, The Netherlands The paper presents an interesting study on team cohesion and its effects on PhD students' success based on a novel dataset. While the paper is well-written, I list suggestions for further development and three textual issues. Embedding in the literature: There is a sizeable literature by now on the (collaborative) conditions that support PhD students. I list below some references. I do not suggest that you have to integrate all these studies in your paper, but rather that you read the papers and see which ones are relevant to your study. Taking some of these papers into account would better embed your study in the current literature and also better highlight what your contribution is. 123456789101112131415

1.
Limitations: You explain well your data and methodology but some of the limitations are not well spelled out. First, you look only at PhD students who finished by writing the thesis. Hence, the ones who did not finish, arguably the least successful ones, are left out. This means that -in a way -you "sample on the dependent variable" in the sense that your dependent variable is a success variable, while you analyse only those who successfully defended the thesis. I do not see this as a major issue, but I think it should be highlighted as a limitation.
Second, on page 4 you report that 60 percent of the PhD students could be identified in the MTMT dataset. Here, I expected more information why the remaining 40 percent could not be matched.I understand from page 5 that the co-author network of a PhD student that is collected, comprises only of authors in MTMT, which I reckon excludes foreign co-authors. Please clarify. And, if my comment is correct, you could also highlight this as a data limitation Third, it is further also unclear whether the "raw citation numbers" in the MTMT database include only citations by authors in the MTMT. Again, please clarify. And, if my comment is correct, you could also highlight this as a data limitation.

2.
Results: page 8, table 1: why do you report a correlation matrix including multiple variables that are NOT included in the regression analysis later on? Rather uncommon. Instead, I suggest you show first a table with the descriptive statistics only of the variables included in the regression analysis and then also a correlation matrix, again, only for the variables included in the regression analysis later on.
The benefits of publishing with F1000Research: Your article is published within days, with no editorial bias • You can publish traditional articles, null/negative results, case reports, data notes and more • The peer review process is transparent and collaborative • Your article is indexed in PubMed after passing peer review • Dedicated customer support at every stage • For pre-submission enquiries, contact research@f1000.com