Design and implementation of semester long project and problem based bioinformatics course

Advancements in ‘high-throughput technologies’ have inundated Background us with data across disciplines. As a result, there is a bottleneck in addressing the demand for analyzing data and training of ‘next generation data scientists’. : In response to this need, the authors designed a single semester Methods “Bioinformatics” course that introduced a small cohort of students at the University of South Carolina to methods for analyzing data generated through different ‘omic’ platforms using variety of model systems.  The course was divided into seven modules with each module ending with a problem. : Towards the end of the course, the students each designed a project Results that allowed them to pursue their individual interests. These completed projects were presented as talks and posters at ISCB-RSG-SEUSA symposium held at University of South Carolina. : An important outcome of this course design was that the Conclusions students acquired the basic skills to critically evaluate the reporting and interpretation of data of a problem or a project during the symposium.


Introduction
Bioinformatics is a rapidly growing interdisciplinary field because of advances in both computer science and the life sciences. Rapid advances in sequencing technologies have led to a deluge of biological data, creating a need for expeditious, efficient, and effective analyses. Practioners of bioinformatics now add techniques from statistics, information science and engineering to develop algorithms and build predictive models to understand the dynamics within a biological system. This paradigm shift in how bioinformatics is perceived has resulted in an evolutionary model of growth across both of its root disciplines 1 . Bioinformatics as a field also enjoys a degree of duality: "episteme" (scientific knowledge) and "techne" (technical know-how), leading to the idea of 'Science informing the tools and the tools enabling science' 1 . In a 2017 survey of 704 NSF principal investigators, more than 90% of respondents replied that they were soon to be working with data sets that required high-performance computing, and they also identified bioinformatics data analyses to be the most urgent and unmet need required for successful completion of their projects 2 . Increased exposure of students at an undergraduate level will help address the need for specialists working in this field and also make the students attractive for opportunities in industry or in graduate school 3-5 . The Global Organization for Bioinformatics Learning, Education and Training (GOB-LET) identified through surveys that the skills required for 'basic data stewardship' are taught only in ~ 25% of education programs creating a gulf between theory and practice 6-8 .
Many courses have been designed and implemented to address the gaps faced in the field. They are project based, problem based or a combination of both to study one or more 'nextgeneration' datasets 9-12 . The courses have been designed as workshops 9 or as semester long courses using analyses from a single next-generation technology 10 . The authors haven't come across a course that incorporates multi-omics data analyses in a single semester. There have been studies that address a single problem using multi-omics approaches 11 and there have been pipeline designs that help integrate these data under a single platform 12 .
In response to this need, we designed a single semester course on bioinformatics in the Department of Biological Sciences at University of South Carolina that was targeted towards undergraduate seniors and graduate students who were mainly bench scientists working on experiments which generated data across different 'omic technologies' using different living systems.

Challenges in design of bioinformatics curriculum
The curriculum task force of the 'International Society of Computational Biology', a scholarly society for both bioinformatics and computational biology research scientists across the world, identified a set of 16 core competencies established through surveys and an iterative process of inputs from people associated with the fields of bioinformatics and computational biology 13 . However, one of the biggest challenges is the heterogeneity of the backgrounds of the course participants. There is 'no one size fits all' while designing a bioinformatics course. In fact, there are three different types of user groups that employ bioinformatics in their research (Table 1), and each of these user groups requires different competencies 14,15 .
Thus, there was considerable diversity in the backgrounds of the students registered for our course. In response, we chose to follow a 'learner adaptable' style of design of the curriculum. This approach allowed us to design the course based on the students' knowledge of the subject and their expectations of the course.

Methods
Course design Course conception. This course was designed to provide a structured Bioinformatics course that is geared towards the needs of students working on different "omics" experiments. The general premise of the course was to critically examine and analyze published or in-preparation datasets across different biological systems in a hands-on fashion. In addition, we wanted to introduce the students to the R programming language. Course Participants. We had nine participants registered for the course. Four of the students were undergraduate seniors, four were first or second year graduate students and one of them was an emergency medical technician (EMT) with a Bachelor of Science degree who was taking additional classes for credit and is now in medical school.
Learning objectives and outcomes of the course. We sent a three-question survey (Table 2) to all the participants to understand their reasoning for registering in the course.
The primary learning objective of the course was to introduce the students to the breadth and depth of the field of Bioinformatics for 'omics' data analyses. We also identified the following three course outcomes for the students.
I. At the end of the course, students should be able to identify and implement alternate strategies to answer genomics-based research questions.
II. Students should be comfortable with the use open-source genomic software and command line programming, and be able to use R statistical packages.
III. Students should be able to design and troubleshoot analyses of nucleotide sequence data and elicit biological information from the data.

Course structure
The course was divided into seven modules spread across the semester: Genome assembly and annotation, Comparative genomics, Introduction to Statistics, Metagenomics, Transcriptomics, These users utilize computational methods to analyze data and advance the scientific understanding of living systems.

Bioinformatics Engineers (BE)
These users, create, develop and manage novel computational methods needed for novel scientific discoveries. We wanted to gauge the level of expertise of the students and identify the level of programming to be introduced in class.
(i) 4 participants had taken a course on R.* (ii) 5 participants had no previous experience using any bioinformatics software or programming languages.

Q2) Motivation for registering in the course?
We wanted to understand the rationale of the students participating in the course Unanimous response of the participants was that they were working on some type of benchwork that would generate "omic" data.
Q3 Take away from the course? We wanted to ensure our learning outcomes matched the expectations of the course participants.
-Understand types of sequencing technologies -Learn how to analyze data -Learn better practices of biological data management *Since we did not have this information in the pre-class survey answers, we asked students their experience with programming languages in class. We got 7 responses in total to the pre-lab survey.
Proteomics and Cancer data analysis. Each module ended with a graded research problem either in a prokaryotic system or a eukaryotic system (Table 3 and Supplementary File 1).

Results
Based on the responses of the students, we assigned potential user groups as explained in Table 1 at the start of the class with their expected competency levels at the end of the class. Seven students replied and two students did not reply to the pre-course survey. We were able to obtain permission from six of the seven students who replied to the survey to have their answers published online anonymously. Any identifying information in terms of names or project details have been edited from the responses (Table 4).
Successful completion of the project assigned to every student by the end of a course module determined their competency of the course. In lieu of a final exam, each student designed a research project, conducted appropriate analyses, and summarized their results in the form of a poster or a talk at the end of the semester as part of the ISCB-RSG-SE USA (International society of Computational biology-Regional student group-Southeast USA) conference held on campus on Dec 8/9 of 2017. They also had the opportunity to listen to talks from professors working on bioinformatics projects and interacted with their peers from University of South Florida and University of Alabama. In addition, two graduate students wrote papers on their projects with input from their respective research advisors.

Discussion
This course covered a lot of topics in 13 weeks and some degree of mastery was required for each topic. In addition, half of the students had no familiarity with programming. As a result, many of the students were stretched beyond their comfort zone. However, since this was a small class, we were able to work with the students individually to help them be successful, and also tailor projects to the students' backgrounds and expectations. An important outcome of this course design was that the students acquired the basic skills to critically evaluate the reporting and interpretation of data of a problem or a project during the symposium.
Our leading goal was to develop a course that was responsive to the needs and background abilities of the participating students. It is important to recognize that every course will have students at different levels of learning with different goals. Hence when designing a course that caters to the needs of the students, it may be a good idea to have a small class. .
Comparison and analyses of the Global Ocean Sampling Expedition data available at the MG-RAST data repository. Students were also introduced to statistical hypothesis testing within data sets and between data sets.
Introduction to statistics (i)Descriptive and Inferential statistics.
(ii) Univariate and Bivariate analyses (iii) ANOVA and PCA R Statistical package: Students were introduced to the R package and were given cheat sheets on how to load, access, and manipulate biological data.
Students were introduced to these concepts and then allowed to work on their comparative metagenomics data analyses projects.

Transcriptomics
Students were introduced to the RNA sequencing technologies and analyzed data from an RNAi knock-down experiment of the pasilla splicing factor gene in Drosophila 19 . R Statistical package 20 Students detected differentially expressed genes using R packages and learned how to take confounding factors into account in differential expression analysis. They were also introduced to different visualization packages in R.

Proteomics
Students were introduced to protein diversity characterization using proteomics. The dataset used for this module was from Bioconductor Conference held at Stanford in July 2016.

R Statistical packages
Student used R/Bioconductor packages to explore, process, visualize, and understand mass spectrometry-based proteomics data. Students were reintroduced to RNASEQ analysis and its role in generation of cervical cancer data for Dr. Buckhaults' recent paper 24

Cancer data analyses
. They were also shown the features of UCSCS Cancer genome browser. Students analyzed TCGA database for gene expression association analyses for Gliobastoma.
Further data mining was carried out using Gene set enrichment analyses were carried out for previously identified genes to check for statistical importance.
*All the presentations associated with each module, course assignments and problem assignments are available for access in the supplementary section of the paper. The final projects that were presented as posters and talks are not available for access at this time.
In our class, every student had a different learning curve. We determined the competency of a student per module by their successful completion of the problem set and or the project. The first objective of the course was to expose the students to not just one living system but many including Bacterial, Human, Drosophila. The other objective was to introduce the students to the R computational platform 20 . Our initial challenge was to address the problems faced by the students in using the platform for the first time. We wanted the students to understand the intricacies of using R as a programming language but if we repeat this class, we will have the codes for the students as R-markdown documents. We would also have additional R assignments at the beginning of the course and out of class help sessions to help students get comfortable using R.
A major challenge was to identify ways to map the competencies required to the expectations of the course at both the undergraduate and graduate levels. Since we had a small number of students, we designed and delivered a structured curriculum that integrated both the continuously changing and stable technological platforms using model systems that were used by at least one student for every module.
As the important goal of the course was to address the needs of the students, we designed the current model of 'multi-project' modules of biological data analyses. Due to the small class size, we were able to give personalized attention to every student. In the future, a big change that we would incorporate would be to separate the projects and problems assigned to graduate and undergraduate students. Generally, the undergraduate students do not have their own data while the graduate students usually have or are in the process of obtaining data that they want to analyze. Therefore, we would either have separate sections for the graduate and undergraduate students or we would have a combined lecture but separate recitation section where the students would apply what they have learned in the lecture portion of the class. The graduate students would be encouraged to develop projects that are relevant to their research while the undergraduates would work in groups on projects designed by the instructor.

Ethical considerations
The authors have posted the pre-class survey answers of students who have consented to have their responses published anonymously. All identifying information has been edited from the responses. The post-class survey responses are given as a feedback to the instructors, also anonymously, through an online survey carried out by the university.

Grant information
The author(s) declared that no grants were involved in supporting this work.  The paper describes a semester long bioinformatics course targeting graduate seniors and graduate students who were bench scientists in need for learning how to analyse data generated across different 'omic technologies'. I find it weird that "The authors haven't come across a course that incorporates multi-omics data analyses in a single semester." If not in a single course, some curricula offer multi-omics data tools and analyses spread in more than one course. A comparison of the presented course with such curricula would be of great interest, as well as a discussion on the convenience of integrating such large amount of bioinformatics materials in a Biological Sciences curriculum. There is much discussion in the field on what is the best strategy to incorporate Bioinformatics in Life Sciences curricula and I wonder whether an overload of different topics, techniques, approaches, methods would be successful in contexts where instructor could not work individually with students. Table 3 displays a number of features of the course's modules. However, a well structured program of each module is missing. As for reproducibility, a lesson plan describing how much time was allocated to each classroom activity (lectures, work in group, hands-on, work on individual projects, types and frequency of formative assessments, etc.) would help. Teaching materials provided in the Supplementary materials are not structured at all. Teaching materials are organised in modules, but navigating modules it is very difficult to understand how to use the various files. There is no homogeneity in file names and a "readme" file describing the content of each folder (and how to use it in reproducing the course) is missing. Slides are not annotated. In summary, materials are not reusable in the current form and the course would not be reproducible based on them and on the information provided in the article. The teaching techniques/strategies used in the classroom were not described/discussed, apart from mentioning the importance of the individual work with students. I think the article would benefit from more details on the course design and from the description of the pedagogical approaches the instructors adopted to teach programming and computational skills to bench scientists. I understand that a key point was the small number of students. Nevertheless, most courses with a small 1.

7.
8. I understand that a key point was the small number of students. Nevertheless, most courses with a small number of students and motivated instructors usually produce successful results. One big challenge is when the number is high. It would be interesting if the authors could reason on how their course could be translated into one for a bigger group of students. What should be definitely changed? Which other strategies could be adopted (peer instruction? Helpers?)?
Finally, the authors use a lot of the term "competency/competencies". There is currently quite a lot of debate around the convenience of using competencies to describe the outcomes of courses. Indeed, competencies can hardly be assessed and mapped on a learning trajectory. By completing a single course, students may develop knowledge, skills and abilities (KSAs), which are measurable and accessible objects and the development of which can be followed along a learning trajectory, rather than competencies. Could the authors comment on this?
Here are more specific points: p.3 -Re the following sentence: "Practioners of bioinformatics now add techniques from statistics, information science and engineering to develop algorithms and build predictive models to understand the dynamics within a biological system." In my experience, practitioners of bioinformatics have always added techniques from statistics, information theory and engineering to develop algorithms to predict the functioning of biological systems. The paradigm shift caused by the rapid advances in sequencing technologies is of different kind in my opinion: in the first place, bioinformatics has become the only approach to make sense of the deluge of biological data the authors refer to. Moreover, the storage, management, sharing, annotation, "fairfication" of the enormous amount of data produced, poses important technological challenges and emphasizes the need for new professions. p. 3 -In the sentence: "Practioners of bioinformatics…", "Practioners" should be changed to "Practitioners". Please, check the whole manuscript for typos/misspellings. p. 3 -The authors put the sentence: "However, one of the biggest challenges is the heterogeneity of the backgrounds of the course participants" in opposition to the previous one on ISCB competencies ("However,…"). In contrast, I believe that Bioinformatics core competencies listed in Mulder et al. indirectly express the high degree of heterogeneity of backgrounds in bioinformatics. p.3 -Re the sentence: "In fact, there are three different types of user groups that employ bioinformatics in their research", I would not define Bioinformatics Engineers as bioinformatics users, but rather developers and managers/maintainers of computational tools. p.4, Table 1 -There is another relevant group of bioinformatics practitioners: those who take care of and manage data, bioinformatics resources and their interoperability and develop standards, data quality metrics, ontologies, annotation, etc. The "big data issue" is especially relevant in the "omics" field and, in my opinion, it would be good if the authors could mention this fourth group, even though none of their students did belong to it. p.3, In the sentence: "We sent a three-question survey (Table 2) to all the participants to understand their reasoning for registering in the course." I suggest that the authors replace "reasoning" with "motivations" or "reasons". p.3, in the sentence "We also identified the following three course outcomes for the students." The authors say "course outcomes". What is a course outcome? I suspect they mean "learning outcomes". There is quite a lot of confusion in the field around the definition and usage of "learning objectives", "learning outcomes" and "teaching objectives". I suggest that the authors replace "course outcomes" with "learning outcomes". p.3, Re Learning outcomes. The literature provides quite precise rules to write learning outcomes. You can use the sentence "by the end of the course, students will ( ) be able to" NOT should followed by an "actionable verb", namely a verb expressing an action or a behaviour that can be (at 8.

15.
followed by an "actionable verb", namely a verb expressing an action or a behaviour that can be (at least in principle) assessed. The verbs used in learning outcomes I ("identify" and "implement") are of this type, whereas some verbs used in II and III are not ("be comfortable", "elicit"). Moreover, it is a good practice to write learning outcomes that are as much specific as possible in terms of both the cognitive complexity level they express and their content. For example, in learning outcome I, "identify" and "implement" express two different levels of cognitive complexity and learning outcome II includes a large variety of contents. p.3, Learning outcome II. What do the authors mean by "command line programming"? Do they mean "Linux shell scripting" or "navigating files and directories using the command line shell"? To be able to use R statistical packages implies to be able to do (at least some) R programming. I suggest that the authors specify this. p.4, the footnote of Table 2 is misleading. What does it mean that the authors did not have the information about programming experience in the pre-class survey answers? Did they asked question 1 in the pre-class survey (as stated in the manuscript) or in class (as stated in the footnote)? Were the 7 responses about programming experience? If so, this means that the authors got 2 answers in class and 7 answers in the pre-class survey. Is this correct? Or the survey is another thing? Very confusing. pre-lab Table 2. Survey questions sent out to the students -As question 1 is about "programming experience", please notice that "using bioinformatics software" is not "programming". For consistency with answers to questions 1 and 2, please specify the distribution of answers to question 3. p.4, Re the sentence: "Based on the responses of the students, we assigned potential user groups as explained in Table 1 at the start of the class with their expected competency levels at the end of the class.", I have three main concerns: 1) I don't see where competency levels at the end of the class are listed (unless the authors are now calling "competency levels" what they called "characteristics" in Table 1. Should this be the case, in no way can students acquire the characteristics listed in Table 1 by completing the course described in this paper; 2) Competencies are yes/no objects, which means either an individual has a competency or they don't have it. Therefore, it may be problematic to talk about "competency levels"; it may be perhaps more appropriate to talk about knowledge, skills or abilities (KSAs) levels; 3) If by "class" you mean a series of lectures on a subject, could you specify at the end of which class (a module? The entire course?) you defined "expected competency levels"? As a side note, a single class can possibly increase the level of a KSA, surely not allow students to acquire a competency. p. 4: in this sentence: "Successful completion of the project assigned to every student by the end of a course module determined their competency of the course." It is not clear what do the authors mean by "competency of the course". Do they mean that the competency acquired in a module determined students' competency in the whole course? p. 6: In the sentence: "We determined the competency of a student per module by their successful completion of the problem set and or the project." what do the authors mean by "successful completion of the problem set and or the project"? There were students who did not successfully complete the project? How did instructors grade them?

Are sufficient details of methods and analysis provided to allow replication by others?
No No more instructive than what went well, their discussion of potential changes in subsequent iterations of the class is very helpful. Finally, the article is clearly written and easy to read.
That said, the manuscript has several issues that should be addressed. First, a number of references are potentially mis-cited. For example, References 6 and 7 cite a Global Organization for Bioinformatics Learning (GOBLET) study that showed that basic data stewardship skills are only taught in 25% of education programs. However, neither of these papers mention the GOBLET survey or the 25% statistic. In addition, References 11 and 12 do not deal with bioinformatics courses and Reference 15 does not discuss the competencies of different bioinformatics users as their use would imply. Similarly, I am concerned about the bioinformatics user groups given in Table 1. Specifically, the descriptions of the three groups are very similar to the three personas described in Reference 14, and the name of one (bioinformatics engineer) is the same (the names of the other two are almost the same). In short, it's not clear if the authors are restating the results of Reference 14 or are proposing a slightly different grouping. Although the posted resources are clearly an important contribution, I found them to be incomplete in one important aspect. In particular, the authors state that every module had a problem set/project associated with it, but this was missing from three of the seven modules. Furthermore, a brief description of the final research projects the students worked on would be helpful as it would indicate what the students were able to do at the end of the semester.
In addition to the above, very little is provided in terms of results. One of the results seems to be the placement of students into the three user groups. However, how the results of the pre-course survey were used to place the students into these groups and if and how they impacted the way in which the course was taught is not clear. Similarly, Table 4 and the corresponding description of it in the narrative, particularly the use of the word "expected," is confusing. Does Column 3 of the table refer to the group a given student was in at the end of the semester or where they were expected to be at some other point in the semester? In any event, how was this determined? Although the course evaluation is helpful in understanding how students felt the course went, I would have liked to have seen more assessment results, particularly if the learning objectives of the course had been met. In general, the paper would be strengthened by the results of another iteration of the course, one in which the proposed changes had been made and the learning gains of the students were assessed.
As previously mentioned, the article is well-written. However, I did notice two small errors. The first sentence of "Course design" should probably be "We had nine students register for the course." Also, "bioinformatics" is incorrectly capitalized in "This course was designed to provide a structured Bioinformatics course. . .".

If applicable, is the statistical analysis and its interpretation appropriate? Not applicable
Are all the source data underlying the results available to ensure full reproducibility?

Partly
The course itself covers a nice range of topics in applied bioinformatics, which might be expected to meet the needs of a diverse set of likely users. The course materials provided in the supplement might therefore find a good audience. One general concern, though, is that the supplementary materials contain some third-party resources, for which it might be more appropriate to include a reference or link rather than the material itself. The teaching approach is fairly applied, with a lot of focus on specific data resources and software, although with some attention to principles behind these resources. While some user communities might favor an approach more grounded in the principles and theory, the focus here seems typical of many bioinformatics courses aimed primarily at biology students. The authors might do a bit more to justify the balance of focus on practice versus theory, with reference to efforts at identifying specific bioinformatics competencies needed by their likely user community, several of which the paper cites.
The Results present some interesting material in the form of a pre-class survey and post-class course evaluation material. While the cohort here is a single small sample, some useful lessons can be drawn about the diversity of backgrounds and needs of even a small group like this. The paper would be considerably stronger with some more serious assessment of whether the learning objectives of the course were met. That is a non-trivial undertaking and cannot be done retroactively, but might be worth considering for a future iteration of the class if it is being continued. The materials do include results of a university-run course evaluation, which provide some indication of how students felt about the course, although that is different from showing how successfully they learned the material. This post-class evaluation makes for some interesting reading, although if it is being included with the paper, it might bear some comment in the Results and Discussion.
It would be useful also to see some comparison to other similar course material available in publicly accessible forms. While that is a difficult moving target, comparing to a few alternatives from prominent course repositories or MOOCs, particularly to highlight the unusual or especially innovative features of this course, would be valuable.
The paper does a nice job of presenting some lessons learned in the Discussion. It is commendable that the authors spend some time on what did not work so well in this class and consider how it might be done differently in the future. One would ideally like to see this taken further via a more comprehensive formative assessment process -with problems identified via a formal assessment, solutions proposed, and those solutions demonstrated to be effective in a re-assessment. It is understandable that that may be beyond the scope of a one-off paper like this, though, and it is nonetheless easy to see how others developing a class in this domain might benefit from the advice given here to avoid some of the same pitfalls.
Beyond these more specific technical points, the document is clear and generally well-written. I noted just a couple of minor errors: p. 4: ``International society of Computational biology'' should be ``International Society for Computational Biology''. p. 4: ``Regional student group -Southeast USA'' should be ``Regional Student Group -Southeast USA''. I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above. and statistical training by offering additional learning modules. The authors have also addressed the problems faced by the students and ways to tackle them in the future under 'Discussion' section.
(iv) Student issue (Interest): As an applied Bioinformatics course, the students had an opportunity to apply their learning to solve problems and projects in their area of interest/background. Active engagement and participation of the students was encouraged throughout the course by timely submission of projects and problem sets.
2. The authors recognize the need to have a better competency assessment of the students' preand post-course. In future, this can be accomplished in the form of pre-course problem solving and post-course problem solving to ensure that the students meet the set learning objectives. The course in the current format had the student's research, design, address and present their learning (with emphasis on critical evaluation and problem solving) in the form of a project presented as a talk/poster in the research symposium held at the end of the semester. To protect the student's data/projects, the final posters and presentations are not included in this paper.
3. As most of the participants were classified as 'Bioinformatics tool users' the authors chose to focus on applied bioinformatics as opposed to Bioinformatics theory. In order to have a bioinformatics focused theory class designed to address every 'omic' problem, the authors believe that it would be prudent to have just one or two modules together and introduce theory and problem/projects pertaining to the same.
4. The authors have cited the third-party resources in the main paper with reference numbers in the supplementary materials. The authors will add the supplementary references in supplementary section and main references in the main paper.
5. The course design and challenges addressed in this paper are pertaining to the small class size and may not accurately reflect the challenges faced at the level of MOOC learning. But the authors can add references to MOOC courses that offer similar style of training in the background section. .

bioRxiv
No competing interests were disclosed.

Competing Interests:
The benefits of publishing with F1000Research: Your article is published within days, with no editorial bias You can publish traditional articles, null/negative results, case reports, data notes and more The peer review process is transparent and collaborative Your article is indexed in PubMed after passing peer review Dedicated customer support at every stage For pre-submission enquiries, contact research@f1000.com