Background

F1000Research

2046-1402

F1000 Research Limited

London, UK

10.12688/f1000research.173178.1

Case Study

Articles

Assessing data management and compliance in large research collaborations via knowledge bases: A semi-structured interview approach

[version 1; peer review: 2 approved with reservations]

Mohammadi

Maryam

Data Curation Investigation Visualization Writing – Review & Editing a 1 Politt

Katja

Data Curation Investigation Methodology Writing – Review & Editing b 1 2 Jorschick

Annett

Funding Acquisition Resources Writing – Review & Editing https://orcid.org/0009-0004-0776-7113 c 1 1Department of Linguistics, Bielefeld University, Bielefeld, Germany 2Institut für Germanistik, Rostock University, Rostock, Germany

a maryam.mohammadi@uni-bielefeld.de b katja.politt@uni-bielefeld.de c annett.jorschick@uni-bielefeld.de

No competing interests were disclosed.

9 1 2026

2026

17 12 2025

2026

This is an open access article distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Background

Large-scale research collaborations rely not only on robust research strategies but also on structured data management and systematic knowledge exchange. Ensuring compliance with ethical and legal requirements is essential from the beginning. This includes obtaining informed consent and adhering to data protection laws such as the GDPR (for Europe), as well as following Open Science and FAIR principles, particularly when working with personal data. Additionally, the systematic assessment and documentation of project objectives, data characteristics, and other project-specific features are essential for advancing the scientific contribution and long-term development of such collaborations.

Methods

In this paper, we introduce a methodology designed to identify commonalities across research projects and to enhance data governance within large research consortia. The approach consists of three components: (1) a semi-structured interview that served a dual purpose: first, to raise awareness among researchers regarding ethical obligations, data protection requirements, and open science principles; second, to systematically collect metadata on planned studies, data types, participant groups, and methodological procedures; (2) structured processing and organization of the collected information; and (3) visualization of project interrelations through knowledge graphs. The methodology was piloted within a collaborative research centre in linguistics.

Results

The collected metadata were systematically structured and used to construct knowledge graphs capturing interrelations among projects, data types, methodologies, and participant groups. These visualizations enable research consortia to make informed decisions about collaboration, infrastructure planning, and data reuse.

Conclusions

The proposed methodology offers a systematic way to assess data management practices, while also fostering a culture of compliance and transparency from the ground up. Knowledge graph visualizations provide a practical tool for identifying synergies, promoting data reuse, and strengthening transparency across projects. This approach can serve as a foundation for developing sustainable research infrastructures in consortia working with diverse empirical data.

data management data exchange knowledge graph open science structured data

Deutsche Forschungsgemeinschaft, CRC 1646

512393437

This research has been funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation)– CRC 1646, project No 512393437, project INF.

1. Introduction

Establishing large-scale research collaborations, projects or research groups united by a common research question, requires the development of well-planned research strategies, as well as secure management, transfer, and long-term sustainability of generated knowledge and collected data Mittal (2023). To foster and facilitate inter-project collaborations within such collaborations, it is essential to collect and systematically organize relevant metadata (e.g., data types, participant groups, methods of analysis) of the planned studies and systematically organize them to (1) provide a comprehensive overview of individual projects, (2) highlight their interconnections and thereby, (3) facilitate possibilities for data sharing and reuse across projects.

It is crucial to highlight the importance of these topics early, especially when working with data collected from humans, as compliance with European law regulations ( GDPR, 2016) is mandatory when handling personal data. Transferring such knowledge to projects ensures adherence with ethical guidelines, including understanding the principles of informed consent, developing consent forms, submitting ethics applications, and promoting Open Science principles such as the FAIR principles Wilkinson et al. (2016). Moreover, raising awareness of legal obligations within projects is essential. This includes informing researchers about Record of Processing Activities, a key requirement of the General Data Protection Regulation (Art. 30, GDPR, 2016), as well as the Technical and Organizational Measures, which must be implemented to protect personal data (Art. 32, GDPR, 2016). An early assessment of relevant information serves to (1) raise researchers’ awareness of legal and data sharing considerations, (2) identify knowledge resources and gaps, and (3) uncover potential connections between projects and resource-sharing opportunities.

The structured collection and integration of project-level metadata can serve as a strategic tool to strengthen collaboration and inform the future development of the research consortium. Constructing knowledge graphs of these metadata help to make methodological, linguistic, and participant-related commonalities and differences across projects visible, thus supporting evidence-based decisions about coordination, shared resources, and infrastructure planning. Especially within large research consortia, this enables researchers and coordinators to identify meaningful connections between projects, foster interdisciplinary exchange, and reduce redundant efforts, thereby increasing both scientific impact and operational efficiency.

This paper outlines an approach to addressing these challenges and introduces a methodology for systematically collecting, restructuring, and visualizing relevant information and metadata. We developed and conducted a semi-structured interview protocol used across all research projects within a large Collaborative Research Centre (CRC) in linguistics in Germany. They were conducted as part of the infrastructure project (INF), which is meant to ensure the “systematic management of data relevant in the context of the Collaborative Research Centre […] intended to facilitate scientific synergies […] through shared data platforms and/or communication forums as well as through efficient use of data.” (form 50.06, Forschungsgemeinschaft, 2022, p. 10–11). Based on the knowledge gained from the interviews, INF is developing a metadata infrastructure platform designed to handle the management, storage, (re-)use, and sharing of the diverse empirical linguistic data collected within the CRC ( Jorschick et al., 2024). Many data types contain personal information or sensitive health data, which imposes legal and ethical constraints on this project ( Berez-Kroeker et al., 2022). However, the careful collection, processing, and visualization of this information and its interconnections, as described here, constitute the first steps toward the construction of this platform. The interviews outlined in the following sections lay the groundwork for these objectives.

The interviews covered key aspects of data collection, management, storage, protection, reuse, and sharing. Additionally, they served to raise awareness among project researchers regarding legal and technical aspects, evaluated existing knowledge, and helped define training priorities. Subsequently, the collected information was transcribed, filtered for relevance, and systematically structured before being used to build a knowledge base and to visualize interconnections through knowledge graphs.

In what follows, we first describe the development and implementation of the semi-structured interviews, with the full list of questions provided in Appendix A. The subsequent outlines the processing and (re-)coding of the data to construct a knowledge base and visualization of relevant information.

2. Developing and conducting the interviews

The development of the interview questions and methodology was based on the goals of the INF project outlined in the previous section. In this section, we first describe the development and piloting of the interview questions, followed by a description of how the interviews were conducted.

2.1 Preparation

In the preparation phase, the interview questions were designed by first reviewing the project proposals to identify possible groupings of projects in regards to e.g., the kind of data they work with or the type of analysis they plan to run. Additionally, the goals of the interview were defined clearly in order to develop an effective interview schedule that ensured we could collect rich data on the topics the INF project was interested in ( Bearman, 2019). Questions were grouped by their general topic to make the conversation flow more naturally during the interviews. The questions were designed for a semi-structured interview (SSI), c.f. e.g., Karatsareas (2022), which contains both closed (e.g., Are there any PIs in your project that are not from Own University Name?) and open questions (e.g., How do you plan to analyse your data? Are special analysis methods required?) or a combination of both (e.g., If you are running experiments: Do you plan to compensate your participants? If so, how?).

This led to the development of the following sets of questions: 1.

general requirements for data management and data protection,

data collection and documentation,

data storage, processing, and analysis,

archiving and reuse according to FAIR principles ¹

required support from the INF project, and

space for additional comments or questions.

Each set consisted of several questions, which can be found in full in the appendix of this paper. The following example consists of the questions regarding (1) general requirements for data management and data protection. (a)

Are there any PIs in your project that are not from Own University Name? If not, are there any collaborators from other universities that you are planning to share personal data with?

(b)

Who is responsible for the data management in the project?

(c)

What type of data does the project work with (audio, video, text generation, perception, ratings, etc.)?

(d)

If you are running experiments: How many experiments do you plan to run?

(e)

If you are running experiments: Who are your target participants (children, elderly, clinically-oriented, etc.)?

(f )

If you are running experiments: Do you plan to compensate your participants? If so, how (cash, voucher, course credit; on site, via third party)?

(g)

Do you have personal, pseudonymized, or anonymous data? (Note that this can change depending on the stage of the project)

(h)

What language are you working on? Have you considered offering a consent form in that language (other than English and German), too?

These questions can be used as an example how the SSI was meant to guide researchers through the whole project development and setup phase. In the beginning, the projects are supposed to write data management plans and to think about what data they need, whether they plan on collecting personal data or data from vulnerable groups, and how often they want to collect data. Table 1 gives an overview of the alignment of the interview goals with the questions asked, i.e., the association between questions, schedule, and rationale suggested by Bearman (2019).

Table 1. Interview schedule illustrating the correspondence between interview questions and their rationale.

Question	Rationale
Are there any PIs in your project that are not from Own University Name? If not, are there any collaborators from other universities that you are planning to share personal data with?	To elicit whether any sharing of data under special regulations between universities/countries is planned.
Who is responsible for the data management in the project?	To (i) ensure that projects are aware of responsibilities and (ii) to have one contact person in the project.
What type of data does the project work with (audio, video, text generation, perception, ratings, etc.)?	To (i) elicit data types, (ii) connect projects working with similar data, and (iii) become aware of special needs e.g. in regard to data protection, anonymization, analysis, or storage.
If you are running experiments: How many experiments do you plan to run?	To ensure projects have a clear experimental plan.
If you are running experiments: Who are your target participants (children, elderly, clinically-oriented, etc.)?	To ensure data of vulnerable groups are adequately protected.
If you are running experiments: Do you plan to compensate your participants? If so, how (cash, voucher, course credit; on site, via third party)?	To ensure projects are aware of data protection measures for e.g. collecting signatures or bank details.
Do you have personal, pseudonymized, or anonymous data? (Note that this can change depending on the stage of the project)	To (i) elicit whether data protection measures for personal data apply and to (ii) raise awareness of storage needs, e.g. not to store pseudonymization key lists in the same place as pseudonymized data.
What language are you working on? Have you considered offering a consent form in that language (other than English and German), too?	To (i) connect projects working on the same languages and (ii) to ensure participants are given consent forms in an appropriate language.

The table presents the correspondences between an example set of SSI questions and the rationale behind asking the questions.

The table presents the correspondences between an example set of SSI questions and the rationale behind asking the questions. Sending the list of questions to the projects beforehand ensured that they had talked about their plans and responsibilities and were able to ask clarifying questions during the interview itself.

Although the later analysis does not directly process any personal data, project details can be easily inferred due to the known identities of individuals involved in the projects. Therefore, ethics approval from the university ethics committee was obtained before starting the interviews. All projects signed a written consent form prior to their interview session, which detailed the data handling procedures (see Ethics statement). ²

After reviewing the initial project proposals, we conducted a pilot phase involving three projects: one that required special data protection measures for sensitive health data from participants with aphasia, one working with written corpus data free of copyright constraints, and one handling pseudonymized or anonymized experimental data from psycholinguistic studies.

Following the pilot phase, we revised the list of questions based on feedback from the pilot projects and the interviewers’ experience. Although individual questions remained unchanged, we reordered some to improve conversational flow (see Appendix A for a full list of questions). Subsequently, we contacted all remaining projects except one, which has no research objectives but a coordinating function, via email. The email outlined the interviews’ objectives and included the questionnaire as well as the consent form for prior within-team review. Projects were given the opportunity to ask clarifying questions prior to choosing a time slot. Each project selected a two-hour slot from a predefined list via the non-tracking planning tool, Nuudel. We, very generously, allotted two hours per interview to accommodate any additional questions arising during the discussion. SSIs typically last no longer than one hour, and later we will see that this was almost always the case indeed. The two-hour window also accounted for potential technical issues in hybrid setups or delays caused by traffic.

2.2 Conducting the interviews

Interviews were conducted on-site or in a hybrid format, depending on project members’ availability, with one interview held entirely online. In total, sixteen projects were interviewed. Project teams were encouraged to participate fully, and at least one principal investigator (PI) was required to attend to ensure representation by a member responsible for the project. When scheduling conflicts prevented individual project members from attending, teams discussed the questions beforehand to ensure all attendees were informed of their colleagues’ perspectives.

At least two INF project members conducted each interview, except in three cases where illness reduced participation to one. In sessions with two INF members present, one led the interview and took notes, while the other provided support and posed follow-up questions. We held sessions in a dedicated meeting room using a 360° video conferencing device (Meeting Owl 3) with Zoom for hybrid formats, to ensure high audio quality and seamless on-site and remote participation.

At the start of each interview, participants were reminded of the interview’s purpose, and it was confirmed that the consent form was understood and signed by a project leader. The interview procedure and note-taking process were explained, and participants could ask clarifying questions in advance. The interview questions were then discussed in chronological order as listed in Appendix A. If necessary, discussions shifted to other relevant questions, which each of these shifts to another question mentioned explicitly by the interviewer. Although written responses were not mandatory, teams were strongly encouraged to review questions in advance and ensure the members familiar with data management and analysis.

One interviewer took real-time notes directly in a GitLab markdown file, which were later reviewed for typographical errors and then forwarded to each project for validation of contents. All subsequent corrections and comments were incorporated into GitLab directly to ensure version control. Most interviews lasted between 60 and 75 minutes; the shortest was 45 minutes and the longest two hours. Sessions with projects with sensible data and a large number of international collaborations took longer than e.g., interviews with projects who do not elicit data from participants but work with data from existing corpora without any copyright restrictions.

3. Building a knowledge representation

The qualitative insights from the interview notes were particularly valuable in informing research teams about essential (data management) rules and strategies to consider during their studies. Furthermore, the interviews produced rich datasets and metadata from the individual projects, offering resources for identifying potential collaborative opportunities and enhancing data reusability across teams.

These opportunities are not typically apparent at the surface level, as projects often appear independent and disconnected from one another, making their potential interrelations unclear. However, through deeper investigation into various dimensions of each project, such (indirect) connections can be uncovered, revealing avenues for future collaboration and integrated research efforts.

To systematically represent and explore this information, we employed knowledge graphs, structured frameworks that model entities (in this case, individual projects) as nodes and their relationships as edges. This approach enables the organization and integration of data sources, revealing indirect or hidden connections. Knowledge graphs facilitate semantic search, enrich data exploration, and simplify decision-makings by surfacing relationships that may not be immediately apparent.

The next section outlines the development of the coding schema and the process of transforming descriptive interview notes into a structured dataset. We then present illustrative examples of the resulting knowledge graphs.

3.1 Data processing and coding scheme development

The coding schema was developed after completing all interviews, ensuring consistency across the dataset. In the first step, we filtered the data by excluding entries that did not offer generalizable insights applicable to other projects, including: (i) personal data that may change, e.g., the designated data steward of the projects, (ii) information that has not yet been finalized, such as archiving processes, and (iii) items primarily intended to inform teams about essential (data management) rules and strategies, such as completing specific data protection forms or procedures.

In the second step, we constructed a structured dataset from the remaining information, assigning one column to each data point. The coding scheme was developed based on questions concerning both the data and metadata that the projects intended to engage with. For instance, it included questions about the type of data being used (e.g., audio, video, text generation), participant type (e.g., children, adults, elderly, autistic individuals), and the languages used in experiments (e.g., English, German, Farsi). Table 2 presents a toy example of the dataset (for details, in Appendix B).

Table 2. The dataset schema.

Project ID	Area	DataType	Language	ParticipantTypes	Collaborators	Identification
X00	A	Audio; Rating	English; German	Adults	Y01; X05	Unanonymized
X01	B	Rating; Video	German; English; Farsi	Adults; Children	X02; Z05; Y02	Pseudonymized

The table presents a toy example of the dataset constructed from the interview notes.

In the third step, we established a set of rules to ensure consistent and dynamic data conversion, trying to create a machine- and human-readable dataset, as well as allowing new data to be added without requiring modifications to the analysis code. The coding schema followed the standardized.csv format, in which columns are separated by commas (,). For columns containing multiple values (e.g., Language) individual values were separated by semicolons (e.g., English;German). During the visualization stage (see the next section), semicolon-separated values were dynamically converted into lists to support analytical tasks. Since the analysis scripts handled these conversions automatically, the order of values was inconsequential. For instance, entries such as “ English;German” and “ German;English” in the Language column were treated as equivalent.

We avoided abbreviations and ensured that column names and values were meaningful and easy to interpret in the generated graphs and analysis. Furthermore, we introduced meaningful placeholder values such as None and NotApplicable to prevent empty entries in the dataset. These placeholders helped maintain data consistency and avoid potential issues of white-spaces in data analysis (see also the description of the data cleaning in next section).

The dataset was designed with three types of columns: (i) Open Values, where no predefined list of values was specified, allowing annotators to add entries dynamically during data conversion. For example, the Language column included a list of languages extracted from interview notes. (ii) Exhaustive Values, which had a fixed set of predefined values. For instance, the Identification column was limited to the values { anonymized, unanonymized, pseudonymized}. (iii) Non-Exhaustive Values, where an initial set of predefined values was provided, but the list could be expanded as needed. For example, the DataType initially included { audio, video, text, rating}, but new values could be added during the conversion. An overview of these types of columns is provided in Table 2.

This approach ensured that the dataset remained adaptable, enabling projects to incorporate new data without requiring changes to the dataset structure or the accompanying ( R) coding scripts. The scripts were designed to dynamically integrate newly added values into the visualizations (see the next section). Finally, the interviewer who initially took the notes during the meetings converted them into a structured dataset, following the schema and rules outlined above. This approach ensured both consistency and accuracy in the resulting dataset.

3.2 Creating the knowledge graphs

The interviews conducted contributed not only to the conceptual development of the data management software but also served as the foundation for building knowledge bases for the projects. These knowledge bases enable researchers to identify potential connections between projects and uncover opportunities for resource sharing opportunities, within their current studies or in future collaborations. This approach helps minimize redundant efforts by reducing the need to conduct new experiments when relevant data has already been collected by other teams, ultimately conserving both time and financial resources.

Data analysis was conducted using R Studio (Version 2024.12.0+467) ( R studio team, 2025). First, we developed a script to dynamically convert multi-value columns into individual rows, restructuring it into long format. Next, a data-cleaning step was performed to eliminate empty columns and rows. Although placeholders like None were used to fill empty fields, empty values could still result from inadvertent whitespaces in semicolon-separated columns, despite the annotators’ efforts to avoid spaces within or at the end of lists. Finally, we generated several knowledge graphs from the processed dataset, using the visNetwork package (Version 2.1.2) ( Almende B.V. and Contributors and Thieurmel, 2025).

The following sample graphs were generated using visNetwork function, with ProjectID as the list for nodes in relation to other entities, such as Collaborators, Language or ParticipantType, represented as edges. The default “ barnesHut”’ solver in the visPhysics setting was employed, which positions nodes by approximating the forces between distant nodes, rather than computing pairwise interactions, thereby optimizing computational efficiency. This means that for the visualizations in the following figures, spatial proximity does not imply closer relationships between projects. For a zoomable and interactive version of the graphs, please see OSF.

Figure 1 represents a knowledge graph illustrating the interconnections among projects. In this visualization, nodes (circles labelled with project numbers) represent individual projects, while edges denote research collaborations. Projects are color-coded according to their respective research areas (A, B, C) to enhance visual distinction. Edges are represented by arrows, indicating either unidirectional or bidirectional collaborative relationships. The number of arrows connected to a node reflects the extent of that project’s involvement within the network; the more arrows, the more collaborative links it maintains. For example, project Ö, as the central project, is directly connected to nearly all other projects. While some projects primarily benefit from collaborating with Ö, others engage in reciprocal partnerships. Thus, the bidirectional edge between Ö and B04 indicates mutual collaboration, whereas the unidirectional edge from Ö to A02 suggests that the collaboration benefits A02 without a reciprocal contribution.

Figure 1. Knowledge graph of projects interconnections.

One advantage of this kind of visualization is that it can reveal ‘intermediate projects’, which might benefit from a collaboration with projects that already share an edge. One example for this is the connection between Ö and B05 (cf. Figure 1). These two projects do not share an edge, meaning that they do not collaborate directly. However, they both share edges with C05, which is the intermediate project between them. Identifying such interconnections highlights potential for future collaborative opportunities.

Figures 2 and 3 illustrate the distribution of languages and participant types across the experimental projects. In large-scale collaborative researches, such metadata serve as valuable resources for other projects seeking data involving specific languages or particular participant groups. This facilitates the finding of relevant datasets and supports the identification of potential opportunities for data sharing and re-usability cross-project.

Figure 2. Knowledge graph of working languages. Figure 3. Knowledge graph of participant types.

A visualization like this can highlight both similarities and differences between projects. For example, two projects focusing on different languages may still have potential for collaboration if they both involve adult participants. Conversely, multiple projects working with the same language, e.g., Hungarian, share a common language, even if they involve different participant groups or study distinct phenomena. Thus, identifying these connections is crucial not only for tailoring support in areas such as ethical compliance or data management processes, but also for the development of an effective metadata schema for the data management platform (see Section 1).

4. Conclusion

This paper presented a systematic approach to collecting, restructuring, and visualizing (meta) data of project methodology, data types, and requirements in a large CRC in linguistics. For this, the construction and implementation of a semi-structured interview was described. The goal of the interviews was twofold: First, they helped raise awareness among projects regarding legal, ethical, and technical aspects. The projects of the CRC process e.g., personal information or sensitive health data, imposing legal and ethical constraints on data management and processing. Second, the information in the interviews was transcribed and re-coded comprehensively via a coding scheme in order to develop a structured knowledge base of intra-CRC connections between projects. The dataset was converted into knowledge graphs, a structured visualization of information where entities (nodes) and their relationships (edges) were modelled to capture and organize knowledge. One of the key advantages of this visualization is its ability to reveal potential interconnections via intermediate projects, a benefit often overlooked in large-scale projects. In the future, we aim to refine the model into a conceptual framework by developing a relevant ontology based on specific knowledge bases. The ontology models will enhance the consistency and establish higher standards across a broader domain.

Limitations

The method described here can be improved on by extending it to other research collaborations and thereby being able to highlight potential for additional collaborations. A further limitation is that we could not share the full dataset underlying the described method due to confidentiality constraints.

Ethical considerations

The present study was reviewed and approved by the ethics committee of Bielefeld University on November 4th, 2024, ethics application number 2024-305. All participants provided their written informed consent to participate in this study.

Data availability

The data underlying this study cannot be shared publicly due to confidentiality restrictions. Researchers may request access from the corresponding author ³ and must provide institutional approval and ethics clearance to obtain the data.

Underlying data

OSF: INF-Interviews-Graphs (2025). https://osf.io/pmwyb/overview ( Mohammadi et al., 2025)

This project contains the following underlying data: •

INF-interview-R-codes.rmd. R Markdown script containing the statistics and graph-generation codes. Please note that the figures are generated by the script on a non-anonymized dataset, which cannot be shared due to confidentiality constraints.

•

toy-interview-dataset.csv. Sample dataset template with illustrative toy example records. Please note that the sample dataset includes only illustrative examples, intended to replicate the methods outlined in the scripts.

•

INF-interview-R-codes.html. HTML output of the R Markdown, displaying the resulting knowledge graphs.

Extended data

OSF: INF-Interviews-Graphs (2025). https://osf.io/pmwyb/overview ( Mohammadi et al., 2025)

This project contains the following extended data: •

Appendix-A. list of the interview questions.

•

Appendix-B. list of dataset columns used to convert the interviews into structured format.

•

Supplementary Figure 1. knowledge graph illustrating projects interaction within CRC teams.

•

Supplementary Figure 2. knowledge graph illustrating the languages used in the project’s experiments within the CRC teams.

•

Supplementary Figure 3. knowledge graph illustrating the types of participants in the experiments within the CRC teams.

Data are available under the terms of the Creative Commons CC-By Attribution 4.0 International license.

References

Almende

Contributors

: visNetwork: Network Visualization using ‘vis.js’ Library. 2025. Reference Source

Bearman

: Eliciting rich data: A practical approach to writing semi-structured interview schedules. Focus on Health Professional Education: A Multi-Professional Journal. 2019;20(3):1–11. 10.11157/fohpe.v20i3.387

Berez-Kroeker

McDonnell

Koller

: The Open Handbook of Linguistic Data Management. Cambridge, MA: The MIT Press;2022. 10.7551/mitpress/12200.001.0001

Forschungsgemeinschaft D: Collaborative Research Centres. 2022. Reference Source

GDPR: Regulation (EU) 2016/679 of the European Parliament and of the Council. Off. J. Eur. Union. 2016;L 119:1–88.

GO FAIR Initiative: FAIR Principles. 2025. CC BY 4.0. Accessed 05.08.2025. Reference Source

Jorschick

Schrader

Buschmeier

: What Can I Do with this Data Point? Towards Modeling Legal and Ethical Aspects of Linguistic Data Collection and (Re-)use. Proceedings of the Workshop on Legal and Ethical Issues in Human Language Technologies@ LREC-COLING. 2024;2024: pp.47–51.

Karatsareas

: Semi-Structured Interviews. Kircher

Zipp

, editors. Research Methods in Language Attitudes. Cambridge: Cambridge University Press;2022; pp.99–113. 10.1017/9781108867788.010

Mittal

Mease

Kuner

: Data management strategy for a collaborative research center. GigaScience. 2023;12:giad049. October 26, 2025 10/11. 37401720

10.1093/gigascience/giad049

PMC10318494

Mohammadi

Jorschick

Politt

: Assessing data management and compliance in large research collaborations via knowledge bases: A semi-structured interview approach. 2025. Reference Source

Mons

: Data Stewardship for Open Science: Implementing FAIR Principles. Chapman and Hall/CRC; 1st ed. 2018. 10.1201/9781315380711

RStudio Team: RStudio: Integrated Development Environment for R. 2025. Reference Source

Wilkinson

Dumontier

Aalbersberg

: The FAIR Guiding Principles for scientific data management and stewardship. Sci. Data. 2016;3:160018. 26978244

10.1038/sdata.2016.18

PMC4792175

The FAIR Guiding Principles for scientific data management and stewardship improve the findability, accessibility, interoperability, and reuse of research data ( GO FAIR Initiative, 2025). Making researchers aware of these principles early on is beneficial e.g. for ensuring that they think about metadata collection, possible licenses, and open formats in the early stages of their project lifecycle ( Mons, 2018).

The interview process lasted approximately three months, from 18.11.2024 to 13.02.2025.

Please contact the corresponding author, maryam.mohammadi@uni-bielefeld.de, to discuss the possibility of sharing the dataset.

10.5256/f1000research.190969.r463095

Reviewer response for version 1

Reichmann

Stefan

1 Referee https://orcid.org/0000-0003-1544-5064 1Graz University of Technology, Graz, Austria

Competing interests: No competing interests were disclosed.

23 3 2026

2026

This is an open access peer review report distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

recommendation

approve-with-reservations

The article describes a methodology designed for use by large research consortia to identify commonalities across research projects and to enhance data governance. The resulting visualizations enable research consortia to make informed decisions about collaboration, infrastructure planning, and data reuse. The proposed methodology offers a systematic way to assess data management practices, while also fostering a culture of compliance and transparency from the ground up. It consists of three components:

a semi-structured interview to a) raises awareness among researchers regarding ethical obligations, data protection requirements, and open science principles, and b), to systematically collect metadata on planned studies, data types, participant groups, and methodological procedures

structured processing and organization of the collected information; and

visualization of project interrelations through knowledge graphs.

The methodology was piloted within a collaborative research centre in linguistics. The collected metadata were systematically structured and used to construct knowledge graphs capturing interrelations among projects, data types, methodologies, and participant groups.

The methodology described in the paper has a clear practical remit. However, while the article describes a useful methodology and practical tool, I think it could profit from a deeper engagement with data management strategies of large-scale collaborations, as they are described by information science, sociology of science, and other fields, to make some of the claims of the paper sounder. In particular, the paper starts from the (correct) observation, described in the background section of the abstract, that large-scale research collaborations have specific needs with respect to data management. This diagnosis, while technically correct, seems a bit haphazard in that it is founded in a rather spurious engagement with the available literature. For instance, Christine L. Borgman (and others) have documented data sharing practices extensively, finding large variance across research settings in the ways research data are defined, produced, shared, reused, etc. Further, there is a large body of literature documenting large variance in data sharing practices, metadata practices, and RDM practices more generally, identifying discipline/field, region, and institution as the primary variables (among others). At present, the article proposes to organize the data collected with the interviews along a single dimension (commonalities and differences). While this may be enough to define a use case for the scientific knowledge graph developed, it is insufficient to reflect the breadth and depth of data-handling and management practices. I do agree that the resulting graph serves its intended purpose in principle, but in order to be useful it would need to account for the complexity of research data management which, I think, could have been derived from a deeper engagement with the vast empirical literature on the subject.

Further, given that the methodology was piloted among linguistics consortia, as a reader I would expect to learn more about the specificities of linguistics data involved in the relevant projects. My concern here is mainly that the resulting methodology/tool might not be readily transposable to other fields. In addition, I would recommend spending more time on explicating the needs that follow from the literature review for the target group – I am not (yet) convinced that large-scale collaborations of the sort described have any specific problems that require the described methodology to solve.

Regarding the interviews, I was very taken by the claim that these served to “raise awareness among researchers regarding ethical obligations, data protection requirements, and open science principles”. Unless the authors have any evidence from the interview materials to back this up, I would recommend toning it down or removing it altogether. Table 1 explicates the “rationale” for each of the interview questions, but exposes a striking disconnect between the two for some cases; for instance, I don’t think the first question represents its associated rationale clearly; nevertheless, it constitutes a good starting point for an interview about (meta)data practices, so I don’t think anything is won by including the “rationale” table.

Regarding the resulting knowledge graphs, I think they are easy to work with, but I would recommend thinking more deeply about the intended target audience for these knowledge graphs, what they are expected to use them for, and what benefits the authors expect for this group. In particular, the methodology promises to help research consortia “assess data management practices”; as a reader, I was left wondering, against which norms (of good RDM practice, presumably) does the methodology help assess RDM practices?

Minor Points: Some of the in-text citations need to be fixed; in particular, when referring to an article without naming the author(s) in the text, the full citation (NAME YEAR) should be inside brackets.

Is the case presented with sufficient detail to be useful for teaching or other practitioners?

Yes

Is the work clearly and accurately presented and does it cite the current literature?

Partly

If applicable, is the statistical analysis and its interpretation appropriate?

Not applicable

Are all the source data underlying the results available to ensure full reproducibility?

Yes

Are the conclusions drawn adequately supported by the results?

Partly

Is the background of the case’s history and progression described in sufficient detail?

Partly

Reviewer Expertise:

Sociology of Science, Research Infrastructures, Open Science

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

10.5256/f1000research.190969.r451745

Reviewer response for version 1

Mohr

Alicia Hofelich

1 Referee https://orcid.org/0000-0002-7644-4105 1College of Liberal Arts, University of Minnesota, Minneapolis, Minnesota, USA

Competing interests: No competing interests were disclosed.

3 2 2026

2026

recommendation

approve-with-reservations

This article describes an approach to capturing information about project commonalities, regulations, and tools using structured interviews with research teams. The interview responses are then converted to datasets, and visualized to reveal relationships and potential connections via knowledge networks.

While this article describes a helpful methodology and visualization tool for mapping features and relationships across projects, some of the methods and claims could be better clarified and/or supported to fully make them sound.

- The authors describe the interviews as benefiting individual projects/researchers by helping to raise awareness about legal, ethical, and technical aspects of the projects. However the questions asked would require more explicit education and connection to the rationale during the interview to be beneficial to the team. It is unclear whether the authors did this during the interview process. For example: “do you have external collaborations” would require quite a bit of explanation/follow up to help them identify other regulations or agreements that are required; asking who the participants are would also require several steps to ensure data involving vulnerable populations are adequately secured or restricted. It’s not clear from the description if the interviewers provided education on these points during the interview or if they were merely collecting data. If no follow up or education was provided during the interviews, I would strongly recommend scaling back the descriptions of this benefit and instead focus on making connections across projects, rather than benefits to individual projects being interviewed. It would be helpful to have more information about the extent to which the rationale in table 1 was made known to the researchers when meeting.

- Did researchers or administrators from the center review the graphs that were produced? Was there particular things that were helpful to them or that were new knowledge? It is difficult to assess how much new information was revealed through these graphs versus being an alternative way to document known relationships/opportunities. This is especially important to clarify given the claims about the graphs revealing “deeper” information than would be available at the surface.

- There is a lot of speculation about helpful connections in the graph (e.g., based on language or participant group), but it is unclear the extent to which those similarities translate into common metadata structures or resource uses. It would be helpful to have more specific examples of the applications mentioned and whether they are specific to this center or would be more broadly useful across more heterogeneous research groups.

Minor points:

- Are the graphs presented only of the exhaustive values? Were graphs or useful visualizations created for open and non-exhaustive values?

- What was the time spent on translating the interviews to the data format? Was there agreements or training involved in the coding?

- Interview timing not consistently reported - at end of section 2.1 said that typically lasted no longer than an hour; where as in 2.2 said the shortest was 45, most were between 60-75 (over an hour), and one was 2 hours.

- Appendix B on OSF has text cut off on the bottom

Is the case presented with sufficient detail to be useful for teaching or other practitioners?

Partly

Is the work clearly and accurately presented and does it cite the current literature?

Yes

If applicable, is the statistical analysis and its interpretation appropriate?

Not applicable

Are all the source data underlying the results available to ensure full reproducibility?

Partly

Are the conclusions drawn adequately supported by the results?

Partly

Is the background of the case’s history and progression described in sufficient detail?

Yes

Reviewer Expertise:

Psychology, data management, reproducible research, institutional networks