A formative usability study of workflow management systems in label-free digital pathology

Markus Jelonek; Arne Peter Raulf; Eileen Fiala; Joshua Butke; Thomas Herrmann; Axel Mosig

doi:10.12688/f1000research.108924.1

Home Browse A formative usability study of workflow management systems in label-free...

ALL Metrics

-

Views

-

Downloads

Get PDF

Get XML

Export

▬

✚

Study Protocol

A formative usability study of workflow management systems in label-free digital pathology

[version 1; peer review: 1 approved with reservations, 1 not approved]

Markus Jelonek¹, Arne Peter Raulf^2,3, Eileen Fiala¹, Joshua Butke^2,3, Thomas Herrmann¹, Axel Mosig ^2,3

Markus Jelonek¹, Arne Peter Raulf^2,3, [...] Eileen Fiala¹, Joshua Butke^2,3, Thomas Herrmann¹, Axel Mosig ^2,3

PUBLISHED 15 Feb 2022

Author details Author details

¹ Institute of Work Science, Chair of Information and Technology Management,, Ruhr-Universität Bochum, Bochum, NRW, 44801, Germany
² Center for Protein Diagnostics, Ruhr-Universsität Bochum, Bochum, NRW, 44801, Germany
³ Faculty of Biology and Biotechnology, Bioinformatics Group, Ruhr-Universität Bochum, Bochum, NRW, 44801, Germany

Markus Jelonek
Roles: Formal Analysis, Investigation, Methodology, Project Administration, Validation, Writing – Original Draft Preparation, Writing – Review & Editing

Arne Peter Raulf
Roles: Data Curation, Formal Analysis, Investigation, Methodology, Project Administration, Software, Validation, Writing – Original Draft Preparation, Writing – Review & Editing

Eileen Fiala
Roles: Data Curation, Formal Analysis, Investigation

Joshua Butke
Roles: Investigation, Software

Thomas Herrmann
Roles: Conceptualization, Methodology, Project Administration, Resources, Supervision, Writing – Original Draft Preparation, Writing – Review & Editing

Axel Mosig
Roles: Conceptualization, Methodology, Project Administration, Resources, Supervision, Writing – Original Draft Preparation, Writing – Review & Editing

OPEN PEER REVIEW

REVIEWER STATUS

Abstract

This research present a formative usability study that investigates the usability of different workflow management systems in the field of biomedical data analysis. Specifically, a task in the field of so-called label-free digital pathology has been analysed and one graphical user interface based workflow and one script-based workflow to solve the task was investigated. The main intention is to gain first in-sights into the systematic study of usability in the context of biomedical image analysis, and formulate experiences and guidelines for future usability studies dealing with workflow management systems. Embedded in a specific setup dealing with label-free digital pathology, the core question behind this research is how usability studies for scientific workflow management can be conducted, and how they can be used systematically to improve such tools. Further, this study address specific questions about the resource utilisation and management of usability studies, including the recruitment of participants as well as the design of specific workflows to be investigated.

Keywords

Scientific Workflow Management, Bioimage Analyis, Usability Study

Corresponding author: Axel Mosig

Competing interests: No competing interests were disclosed.

Grant information: This research was funded by the German Research Foundation (DFG), grant number MO 2804/1-1, assigned to
Axel Mosig.
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Copyright: © 2022 Jelonek M et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

How to cite: Jelonek M, Raulf AP, Fiala E et al. A formative usability study of workflow management systems in label-free digital pathology [version 1; peer review: 1 approved with reservations, 1 not approved]. F1000Research 2022, 11:192 (https://doi.org/10.12688/f1000research.108924.1) First published: 15 Feb 2022, 11:192 (https://doi.org/10.12688/f1000research.108924.1) Latest published: 15 Feb 2022, 11:192 (https://doi.org/10.12688/f1000research.108924.1)

Introduction

In recent years, numerous workflow management systems (WMSs) for the reproducible analysis of high-throughput experimental data in life science research have been developed.¹^–⁵ It is well-established that such WMSs are indispensable to warrant transparency, reproducibility and interoperability of increasingly complex data analysis tasks that typically combine a number of different data analysis tools and subtasks. Matching the needs of different user groups and application settings, WMSs have been implemented as either graphical user interfaces¹^–⁴ or as script-based tools,⁵ addressing a broad variety of user groups and an equally broad range of needs for user interaction or operability on large data sets.

Many of the prominent WMSs have been developed and maintained over many years and are now mature software systems whose relevance and popularity are undermined by large citation numbers. However, apparently little attention has been spent on systematically and formally exploring workflow management from the viewpoint of usability. Yet, usability is of interest in many respects. First, a detailed investigation of software usability can guide future developments of specific tools, and more generally may help to summarize experiences to formulate guidelines for future developments.⁶^,⁷ Second, the ergonomics of workflows affect the outcome of data-centric research studies. Since many studies and corresponding data analysis workflows involve a human-in-the-loop,⁸ usability aspects inevitably determine the inductive bias of obtaining insights from data. A thorough investigation of usability is thus not just a matter of assessing convenience and efficiency, but also a matter of elucidating the role of software in the scientific method.

In this contribution, the results of a formative usability study in the context of label-free digital pathology⁹ is presented. The goal of this study is to provide a first case study on how usability studies for scientific workflow management can be conducted, and how they can systematically be used for improving WSMs. Specifically, a study that investigates and compares two distinct groups of users following two different workflows was conducted to solve one and the same image analysis task.

This study deals with a setting in label-free digital pathology based on hyperspectral infrared microscopic imaging data and it is based on a workflow originally established by Kallenbach-Thieltges et al. 2013,⁹ which was the basis for a number of applications of infrared microscopy in different clinical settings.¹⁰^–¹² In label-free digital pathology, tissue samples are measured using an infrared microscope. The infrared microscope yields a hyperspectral microscopic image with a spatial resolution of around 5 μm, where each pixel is represented by an infrared spectrum covering a wavelength bandwidth of several hundred optical channels. To identify disease-associated tumor regions in tissue samples, the images need to be pre-segmented using unsupervised learning approaches which allows to extract training data for a supervised classifier, as illustrated in Figure 1.

Figure 1. Overview for the workflow.

Panel A) shows the workflow as abstract scheme, while panel B) shows the same workflow implemented in the Orange framework. Panel C) gives an impression of the main implementing this workflow in OpenVibSpec.

This study involves two different implementations of this workflow. First, a script-based implementation from the OpenVibSpec tools was used,¹³ which constitutes a Python re-implementation of the workflow used in previous studies.⁹^–¹² The second implementation was based on the graphical user interface (GUI) of the Orange WMS.¹ Furthermore, two distinct user groups with different levels of programming skills and life science background were recruited to investigate their usage patterns across the two different implementations. In addition, this study addresses the question of how to evaluate a given workflow for biomedical image analysis.

Usability testing

As documented by the International Standards Organisations (ISO) in ISO 9241, the usability of a product or a system is defined as the extent to which a user can use the system to achieve his/her goals “effectively, efficiently and with satisfaction”.¹⁴ In its core, usability is not one single factor of a system or product but multiple factors of the system or product contribute to or reduce its usability. These factors or concepts are: learnability, efficiency, memorability, errors and satisfaction.⁶ Usability is seen as relevant during the whole lifetime of a product or system, from development to procurement, and is an important aspect of the perceived user experience (UX) of a product or system.

To evaluate the usability of a system, a variety of methods and types of usability tests or inspections can be used. Such evaluations can differ, for example, in the participants’ knowledge (expert vs. novice user), in its test design (task-based test vs. cognitive walkthrough), or if it is done during the development or with a final version of the system (formative vs. summative evaluation).⁶^,¹⁵^,¹⁶ Besides test-based evaluations, heuristic usability inspections are also commonly used.¹⁷

Formative studies are of a rather exploratory nature and can be conducted very early in the development phase of systems, even on the basis of simple wireframes or prototypes to define requirements and reflect on the interaction design at an early design stage.¹⁶ In these studies, mainly qualitative data is collected, for example comments and impressions by users regarding the interaction design of the system, if and how successful a task could be completed or how they perceived the systems feedback (cf. Ref. 15). Common methods for formative usability studies are, for instance, thinking aloud tests, in which users are encouraged to articulate their thoughts during use or task-based tests with follow-up interviews. In summative studies, a more formal approach is taken and attempts are made to demonstrate the effectiveness of a newly developed system. To conduct a summative usability test, at least functioning prototypes are needed. A typical goal of a summative test would be to highlight the benefits of a new system or interface to other existing ones. However, the boundaries of formative and summative testing are not rigid, as a summative test could also be applied after several development stages of a system.

To assess the usability of a system, a combination of different metrics and methods can be applied to collect the data of interest. However, the type of test can affect the number of participants in a study. For example, an expert-based usability evaluation can possibly be applied with a smaller number of participants than for example user-based usability tests. Yet, the number of needed participants in usability tests is under dispute.¹⁸^–²²

In this study, a formative approach for the usability evaluation was used. The primary aim of this study is not to compare the efficiency of both tools but rather to explore possibilities and constraints designing usability studies for scientific workflow management tools and how such usability studies contribute to their improvement.

Methods

In order to generate a broad spectrum of insights and perspectives, a formative approach was chosen for the usability evaluation of the two WMS. A task-based study with participants was conducted, in which participants were asked to use a combination of supervised and unsupervised machine learning algorithms to classify microscopic data of tissue images, as shown in Figure 1. During use, observations were noted by the study facilitator and participants were asked to express thoughts aloud. After completing the tasks, a semi-structured interview was conducted with each participant to verify the observations made and to solicit further perspectives. The interview incorporated questions based on Nielsen’s usability heuristics.¹⁷

Software, data and documentation

Two alternative implementations of the workflow were investigated, one that could be used via script and one that offered a GUI. For the script-based implementation of the workflow, the Python-based implementation provided in the OpenVibSpec tools¹³ was utilized with a Jupyter notebook²³ as an interface (See source data).²⁴ The implementation provides structured access to perform clustering of image spectra, to extract training spectra from annotation masks, and to train and validate supervised classifiers. Following common practice,⁹ $k$ -means clustering²⁵ was employed for unsupervised segmentation combined with random forests²⁶ for supervised classification.

The study GUI based implementation relies on the functions provided by the Orange framework¹ (See source data).²⁴ In Orange, a workflow can be created by connecting different widgets. These widgets represent certain data operations on a data set and usually allow some kind of configuration. As illustrated in Figure 1, the data for the workflow was pre-established in the interactive Orange GUI, as data import and export operations were already added. Thus, users were only required to create the workflow with suitable data operations and adjust parameters. This adaptation of the workflow saved considerable time for the study, as the import and export operations demanded a noticeable amount of time. The final workflow is illustrated in Figure 1A.

In order to facilitate at least basic comparison between the two implementations, this study was limited to those methods that are implemented in both platforms, noting that both Orange and OpenVibSpec are based on the same standard libraries, most notably Scikit-learn²⁷ and Numpy.²⁸

For both implementations, participants were briefed with a targeted training video to explain the technical realization of the workflow, in order to warrant the necessary level of prior knowledge for the study (See source data).²⁴ Participants had the chance to watch the video or parts of the video again at any time during the study.

The data set was derived from previously published studies investigating colorectal carcinoma using infrared microscopy.¹¹^,²⁹ In brief, the complete data set comprises infrared microscopic images of roughly 200 tissue-microarray (TMA) spots of more than 100 different patients, where each spot is roughly 1 mm in diameter and represented by an infrared microscopic image covering roughly 250×250 pixel spectra. TMA slides were purchased from US Biomax Inc., MD, USA, and the infrared microscopic images are accompanied by conventional hematoxylin and eosin (H&E) stained images which were acquired subsequent to infrared microscopy. For the script-based implementations, many images of a complete TMA spot were used as a basis for the study. For the GUI-based implementation, images were reduced in size to 224×224 pixels for the training data and 195×210 for the test data in order to meet constraints in terms of computational resources and time window per participant.

Study participants and setup

For the study, 22 participants (7 females, 15 males) were recruited in total, among students of two different study programs at the Ruhr-University Bochum with ages ranging from 23 to 39 years (M = 26.05, SD = 3.7) (see Underlying data³⁰).

In order to take account of the study’s objectives in the selection of test users, two target groups were deliberately sought which correspond to the users of the tools compared, namely Orange and OpenVibSpec.

The first group was recruited among students with at least advanced undergraduate knowledge in a life science study program and no or limited programming skills (6 females, 5 males, mean age = 25.27, SD = 2.83). The second group was recruited among computer science students at comparable level, possessing good programming skills, but no or limited background in the life sciences (1 female, 10 males, mean age = 26.82, SD = 4.26).

To counteract a bias of the study by specifically selected test persons, all participants were deliberately selected in such a way that they did not have any special expertise in the actual question being compared, i.e. microspectroscopic tissue characterisation. Based on this, all participants were provided with the same training material, which should shed more light on the background of the data origin and the methods used. Care was taken to ensure a lightweight introduction was given to enable all participants to concentrate fully on the task during the actual test series.

The implementations were made available to participants on a desktop computer equipped with an Intel i7-2600 CPU at 3.40Ghz and 8 GBytes memory. The actual implementations were installed on a remote Linux server with 64 CPUs and 512 GBytes of main memory, and made available on the desktop computer through a remote desktop connection, which was also equipped with a Linux OS.

The task had to be adapted in such a way that it could be carried out with both tools, but at the same time produce a result that could be interpreted in the sense of the introductory material provided to the participants; This last requirement and the general orientation of the two tools caused a discrepancy in the execution speed of the Orange tool, which is describe in more detail in the results section. Since the data reading and saving processes also took several minutes, they were skipped in the task for Orange and the participants were given a minimally prepared workflow in which the training and test data widgets were already provided.

The procedure for each participant was conducted as follows: First, each participant completed a demographic questionnaire and gave their consent to participate in this study (see Underlying data³⁰). Then, participants watched an introduction video with educational material on workflow-based hyperspectral infrared microscopic imaging data analysis. After watching the video, participants started the task with either the GUI tool or the script-based implementation. The instructions to complete the script-based workflow were given in the Jupyter environment.³¹ For the GUI workflow, participants received the instructions on how to complete the task on paper and PDF. During the study, participants had the chance to ask questions, e.g. if they got stuck or had any other kind of problem to complete the task. While using the tools, the participants were also asked to think out loud. During the task, their screen was recorded and notable usability issues were documented by the study facilitator. After completing both workflows, a semi-structured interview was conducted with each participant in which the usability of both tools was focused.

This semi-structured interview consisted of general questions about the participants and their prior knowledge, as well as questions integrating Nielsen’s usability heuristics¹⁷ (see Underlying data³⁰). The interviews made it possible to specifically question observations that we had documented with the respective participants and to judge their comments and wishes more accurately.

Results

The results are structured into three subsections. First, is the specific task-related observations and results gained from participants’ use of both tools. In the following several usability problems are provided with suggestions for improvement. All the usability issues that occurred were not listed, however the examples that had a major negative impact on task completion in this study were mentioned. Lastly, task completion and task understanding in each of the two participant groups were examined, and how user studies on real data can be conducted to help make WMS available to a wider range of users was analysed.

Task-related observations

Task completion

Regarding the task completion time, the group of life science students needed slightly longer to finish the task in OpenVibSpec and, on average, around 5 min longer to finish the task in Orange (see Table 1) (see Underlying data).³⁰ The task completion time did not differ notably between Orange and OpenVibSpec for the computer scientists. The time differences for the life science group were on average around 3 min. These results suggest two aspects. First, the task duration was reasonably evenly distributed between the two systems, as both groups needed around the same time to finish the task. Second, the results also show that the computer scientists finished the task faster with both tools in this study. It should be noted, however, that participants were not asked to finish the task as quickly as possible, but could do it on their own pace. Therefore, the faster completion time of the computer scientist group can probably be justified by a more practiced use of computers in general.

Table 1. Mean time and standard deviation (in mm:ss) needed to complete the task in either tool.

Group	Orange		OpenVibSpec
Group	M	SD	M	SD
Life Science	39:44	08:23	36:33	07:56
Computer Science	34:05*	06:43	33:44	06:05

* n=10, as one participant could not finish the task with Orange due to technical problems.

Task complexity

The task was rated as quite complex by the life scientist group. Using the scripted approach with OpenVibSpec, all participants managed to complete the task successfully. However, 10 out of 11 participants of the life sciences group needed very close supervision and a fair amount of assistance, as the learning curve seemed too steep without prior programming knowledge. For example, exchanging a simple placeholder variable with a concrete integer value for the segmentation of the image was not obvious to many, although it was clearly described in the instructions. Without the intervention of the study facilitator, some participants would have already given up on this task at this point. Questioned afterwards, the participants also stated that the instructions actually clearly described the exchange of the placeholder variable for an integer value. Some said they were simply overwhelmed by the task and by the code. As for the computer science group, still 5 out of 11 participants needed some sort of support during the script-based task. Overall, interviews indicated that the task was easier to solve with the GUI. Especially participants who found the script-based variant somewhat discouraging positively emphasized the GUI, as the GUI enabled them to follow the workflow at least visually and, despite the instructions, they had the feeling of being in control. However, even with step-by-step instructions for the GUI, it was a highly challenging task for novice users as most of them relied on support by the study facilitator at one point or the other. In conclusion, task complexity can be interpreted in several ways. The different perception of the two approaches may have been due to the differently prepared instructions (once text instructions in Jupyter, once screenshots for Orange). However, this is contradicted by the fact that at least one participant in the Life Science Group was able to solve the task who had neither Python nor programming experience. Instead, the perception of complexity seems to be based more on programming skills in general. Many participants, especially in the life science group, had to learn how to run programming code in the Jupyter environment at the beginning of the task. Some participants even had difficulties recognizing which sections belonged to the instructions and which belonged to the script on their own, until they received support from the study facilitator. Such difficulties made the barrier to solve the task even higher for these participants.

Comprehension of results

Participants’ understanding of the workflow outcomes was quite heterogeneous. Although all participants were given a basic introductory video to watch before the first task, four life scientists and four computer scientists were unable to give any real conclusion about the outcome images, such as what the classification of the images could mean and how one would proceed further. In total, four participants (two Computer Scientists, six Life Scientists) expressed that they had followed the steps of the instructions but could not elaborate on the context of the subject matter. Overall, it had been easier to understand the workflow in Orange, since the widget names would have at least made it understandable that a data set was manipulated with several operations. The majority of participants (nine computer scientists, five life scientists) were able to explain that the workflow applied artificial intelligence methods to classify tissue data. Furthermore, they were also aware that they could not interpret the results in more detail and would have to present the result to an expert to repeat the clusterings with other parameters if necessary. This shows that although the participants did not know the dataset before participating in the study, nor had they used the tools before, they were able to perform the task and also were able to comprehend it well in terms of its subject matter. In addition, one participant explicitly pointed out that the resulting images lacked a unique assignment of index colors. It was impossible to understand how the method decided which areas of the image were clustered together. Here, the participant mentioned the use of explanatory labels to understand the clustering.

Severe usability issues

The usability issues presented here refer primarily to those that made designing the task for the study challenging and those that hindered or negatively impacted participants’ task completion.

Learnability

As mentioned in the previous section, participants of the life science group had severe problems in completing the task on their own with the script-based approach of OpenVibSpec. The learning curve seemed too steep for the majority of these participants despite detailed instructions. However, the learnability of OpenVibSpec was disrupted in that many participants first had to learn how to use the Jupyter notebook, e.g. to run scripts or even to understand the controls at all. The task became considerably easier if someone had previous experience in programming. Participants with prior experience in Python and Jupyter notebooks had little to no problems focusing on and solving the task. In comparison, all participants were able to replicate the workflow independently with the GUI of Orange, even though none of the participants knew the application beforehand.

Efficiency

Prior to the usability test, during the preparation of the study, longer computing times for Orange was recognized. Therefore, the data set used in this study was reduced for Orange. Since the data reading and saving processes also took noticeably longer computing times, they were skipped in the task for Orange, and the participants were given a minimally prepared workflow beforehand. During the interviews, several participants from both groups presumed that the tools share the same code base, except that one offers a GUI and the other only script-based. These assumptions made by the participants suggest that the changes to the workflow (smaller data set for Orange) and to the task (skipping the import and export operations in Orange) resulted in both tools being considered equivalent in terms of efficiency and considered comparable.

In addition, it can be noted that participants who had no programming experience rated the efficiency of Orange better. The reason given was learnability: In OpenVibSpec, a longer training period would be necessary and programming experience would be required, whereas Orange could be operated directly via the GUI.

Error messages and error prevention

An important usability criterion is the handling of error messages, which should represent the error in a human-readable form. During the study, error messages occurred in both tools that were basically Python error messages and could only be interpreted by participants with programming experience. For all other participants, the study facilitator had to intervene and provide assistance to the participants. In addition to this, in the Orange GUI it was not always clear to the participants when an error message was displayed and when it was only a warning message. The occurrence of such error and warning messages led to uncertainty in many participants’ completion of the task, which is why the study facilitator explained the messages shortly and participants could continue with the task.

Another usability issue concerns error prevention in general. An example for a missing error prevention that hindered participants to achieve the expected resulting image occurred during the classification with the random forest algorithm in Orange. Only 7 out of 11 participants from the computer science group were able to successfully complete the task. Three achieved a different result image via incorrectly connected widgets. In the life sciences group, 6 out of 11 participants were able to correctly solve the task with Orange. For 5 out of 11, the same connection error of two widgets occurred. This error was a result of a rather simple usability issue while creating the workflow. However, users did not realize they had done something wrong until they saw the resulting image, as the feedback given by the GUI was not explicit and clear enough. These participants did not notice that the order in which they connected the training and test data to a specific widget led to an automatic classification of the data by the tool as training or test data. Due to the incorrect order of the connections, training data was labeled as test data and vice versa (see Figure 2). Here, the feedback to the users was not clear enough, as it was not noticed by those to whom it happened. On the other hand, this error could have been prevented if the order of the connection had not mattered. In OpenVibSpec such an error did not occur because an automatic interpretation of the dataset is not offered, but one had to explicitly specify the training dataset or test dataset in the method call.

Figure 2. Example how a simple usability issue mixed up training and test data in the GUI which led to a false image classification: When connecting the test data (‘Test’) first with the ‘Test and Score’ widget, the tool automatically labeled it as training data.

If afterwards the the training data (‘Train’) was connected with ‘Test and Score’, the tool labeled the training data as test data. This issue occurred to 8 participants in total.

Participant groups

The data revealed that the participation of two different user groups resulted in a broader range of views, task-related observations, and usability problems found.

For example, according to the statements in the interview and the observations during the study, it was apparent that most participants had a more pleasant user experience with the GUI tool, in particular in the life scientists group. Although 6 out of 11 participants in this group had difficulties understanding the workflow and the results of the workflow, the GUI gave them at least the possibility to try to comprehend the data operations and the feeling to be in control whereas the script-based approach seemed too complex to be comprehended for many. Participants of both groups believed that as non-programmers they would prefer the GUI-tool, as the script-based approach with OpenVibSpec would require too much prior knowledge. Instead, the GUI has used familiar interaction techniques, for example, using drag-and-drop or context menus.

With regard to the exploratory behavior of the participants, no real differences were found between both groups. However, in the life science group there were three participants who performed the same k-Means clustering three times in OpenVibSpec, while in the computer science group there was only one participant who repeated the clustering with the same cluster size three times. All other participants took a more exploratory approach to the task and chose three different cluster sizes (Table 2). In addition, there were also two participants in the life science group who chose a clustering of c=1, which ultimately resulted in a monochromatic image, suggesting that they were not entirely sure what the clustering was doing, at least beforehand. In Orange, no participant attempted exploratory testing of cluster sizes. Instead, the k-means widget was used without modifying the settings, resulting in every participant using the same cluster size (c=5). To get to the settings for the widget, one would have had to open it. However, this was neither requested nor suggested in the task description for Orange. Participants would probably have noticed the cluster size if it had been visually reflected in the widget and perhaps could have been changed directly in the GUI without having to open the settings. Such visual feedback directly visible workflow could promote exploration of different clusterings. However, due to the two-step procedure (first open the widget settings, then change the cluster size), the participants neither knew nor questioned the number of clusters used in the workflow in Orange.

Table 2. Average number of times participants repeated the k-means classification (runs) and average cluster size used (c). n=11 for each group.

Group	Orange		OpenVibSpec
Group	runs	c	runs	c
Life Science	1	5	2.3	4.4
Computer Science	1	5	3	4.5

Another difference between the groups was that some participants in the life science group were negative about the script-based approach. During task processing with OpenVibSpec, there were at least two participants who clearly verbalized their frustration with the tool, especially when errors occurred. Another participant expressed a strong preference not to have to go through the workflow, but expected the tool to simply analyze the image data and provide him with the results without having to start multiple clusterings or algorithms, so that he would hardly have to interact with the tool at all.

Regarding the perceived efficiency, six participants (four life scientists and two computer scientists) rated Orange as more efficient than OpenVibSpec and two participants (one of each group) rated the efficiency of both tools as equal, although the computing times of Orange were noticeably longer. One possible explanation for this perception is that in OpenVibSpec three runs for k-means were prepared and the majority of participants explored differences between cluster sizes (see Table 2). In addition, the calculations lasted longer when participants used a higher cluster size. Another possible explanation for the perceived efficiency is, that participants (especially from the life science group) felt more comfortable using the graphical widgets.

Discussion

Usability issues and task

As planned, the study revealed many usability problems in either of the tools and on different levels of severity. Several issues were discovered as a result of inviting two different user groups in the study. For example, the high barrier entry for non-programmers and script-based workflow management system was particularly evident for the life science group. Here we could show that even the simplest changes in the scripts were challenging for novice users and could only be solved with assistance of the study facilitator. Accordingly, training or tutorials would be necessary for users if they are to apply such script-based tools for their purposes. In addition, some usability problems (including the long loading times for Orange) could already be noticed during the preparation of the study. Ideally, these problems should have been fixed prior to conducting the study, so that the data import could have been performed in both tools.

Another challenge that frequently occurred was that participants had difficulty understanding or interpreting the results. The problem of interpretability could not be traced back exclusively to one principle in the approach selected, but it is particularly likely that either more feedback needs to be worked out based on the methods used. However, it is also possible that the question of bioimaging needs to be more linked to the workflow used rather than the tool.

Considering the amount of effort for the study and the usability problems found, it seems likely that a smaller number of participants would have been sufficient for the results. This illustrates the problem of the number of participants to be used quite well (see Section 0). Probably, even with small-step user tests, severe usability problems would have already been noticed. In part, we have already encountered this during the development of the tasks for the study (e.g., efficiency problems). However, such studies reveal numerous usability issues and requirements for changes, as well as many views and ideas to be interpreted, which, if implemented well, can ultimately lead to better tools and larger user bases.

Methodological approach

In order to clarify the importance and benefit of usability tests in the field of WMSs, this study’s methodical approach requires reflection and discussion. One challenge was to design a task that covered roughly the same aspects in two different tools (GUI vs. script). To achieve this,several more modules/plugins had to be integrated into Orange that were not included in the main application. This in turn, only became apparent with the need to create comparative tasks in both tools. The development of the tasks for evaluation therefore already revealed certain issues (e.g. missing modules, efficiency problems). Such usability issues could probably have been found systematically for both tools with a heuristic analysis before the study. Thus, depending on the scope of the analysis and available resources, even a heuristic inspection of the tools could find many problems and opportunities for improvement before conducting a larger scaled user study.

Due to the participation of novice users and the exploratory nature of the study, the results are limited with respect to the use of bioimage processing. Here, in comparison, it would be interesting to see how experienced users working in the field would handle the two tools. In particular, an exploratory strategy, including various clusterings, and the interpretation of results could vary with experienced users. However, a positive aspect of our approach was that different user groups were selected for the study, as both groups highlighted different requirements for the tools.

From a methodological perspective, the formative design of the study was quite appropriate because the tools take very different approaches and the participants had relatively little prior experience in the subject matter. However, it would have been useful to give all participants a brief introduction to Python and Jupyter notebooks beforehand, so that handling the notebooks would not have interfered with the actual execution of the workflow in OpenVibSpec. Additionally, instructions on how to perform different k-means clusterings in Orange would have been helpful, since none of the participant had tried this on their own. On the other hand, multiple clusterings would have further increased the processing time in Orange.

The study benefited from the fact that not only one tool was tested by participants, but that participants used two different tools (GUI vs. script-based) and thus also compared advantages and disadvantages of both approaches to implement workflows. Furthermore, clear preferences between user groups could be identified and partial needs could be revealed. For example, the need to train potential users with programming knowledge became clear, although the task in the study could also be solved without programming knowledge.

Another aspect that could not be considered in this study are learning effects and memorability. Thus, it would be interesting to see how the participants would handle the applications after several weeks, how well they would solve the tasks, and whether the same usability problems would occur again. In addition, one could examine how much learning effort participants without programming expertise would have to put in before they would be able to independently recreate workflows with OpenVibSpec and Orange, and how the perception towards the two tools would change.

Conclusions

This study presented findings from a formative usability study that involved two different user groups solving a specific biomedical image analysis task based on two different implementations, one GUI based and one script-based. With the exploratory nature of this study, specific points of improvements for the different workflow implementations regarding different user groups could be identified. One example for such an improvement is the explanation of workflow results. Whereas research directions such as Explainable AI³² focus on making the internal decisions of AI-powered systems more transparent, this study shows that in the field of bioimaging, explaining the output of the systems could be improved by providing useful information to support the interpretation of the results.

A major limitation of this study certainly lies within the limited prior knowledge of the participants. For future studies, more insight will be gained by involving user groups with more prior knowledge. Also, the study setup only allowed basic comparison between the different user groups and the two implementations, again mainly due to the limited prior knowledge of the participants. Extended future studies may overcome this limitation by conducting a community-wide data challenge accompanied by a usability study of selected participating groups. Besides providing access to significant user groups and allowing a more systematic comparison between user groups and tools, such study may also elucidate the inductive bias of different users and tools for workflows which require a human-in-the-loop.

Data availability

Source data

Zenodo: A formative usability study of workflow management systems in label-free digital pathology - Data and Code²⁴

This repository contains the following underlying data:

• Tutorial-Video.mp4 a tutorial video presented to participants at the beginning of the study
• code/openvibspec/code for script based workflow, including a github snapshot of the OpenVibSpec repository
• orange/GUI based workflow
– workflow-instructions-orange.pdf instructions for orange given to participants
– CompleteWorkflowOrange.ows Orange file that contains the full workflow
• data/openvibspec training and test data for use in openvibspec
• data/orange training and test data for use in Orange

Underlying data

Zenodo: A formative usability study of workflow management systems in label-free digital pathology - Questionaires³⁰

This repository contains the following underlying data:

• data.xlsx demographic data of participants, time needed for task, additional notes
• demographics_questionnaire_de.docx demographic questionnaire in German
• demographics_questionnaire_en.docx demographic questionnaire in English
• interview_de.docx interview questions in German
• interview_en.docx interview questions in English
• maxqda_export_de.xlsx MaxQDA Export file with codes and interview statements in German

Data are available under the terms of the Creative Commons Attribution 4.0 International license (CC-BY 4.0).

Author contributions

AM and TH conceived the project. MJ and EF conducted the usability study. APR and JB prepared software and data for the study. MJ, APR, TH and AM analyzed the outcome of the study. MJ, APR, TH and AM wrote and revised the manuscript.

References

1. Demšar J, Curk T, Erjavec A, et al.: Orange: data mining toolbox in python. J. Mach. Learn. Res. 2013; 14(1): 2349–2353.
2. Berthold MR, Cebron N, Dill F, et al.: Knime-the konstanz information miner: version 2.0 and beyond. AcM SIGKDD Explorations Newsletter. 2009; 11(1): 26–31. Publisher Full Text
3. Carpenter AE, Jones TR, Lamprecht MR, et al.: Cellprofiler: image analysis software for identifying and quantifying cell phenotypes. Genome Biol. 2006; 7(10): 1–11.
4. Giardine B, Riemer C, Hardison RC, et al.: Galaxy: a platform for interactive large-scale genome analysis. Genome Res. 2005; 15(10): 1451–1455. PubMed Abstract | Publisher Full Text
5. Köster J, Rahmann S: Snakemake—a scalable bioinformatics workflow engine. Bioinformatics. 2012; 28(19): 2520–2522. Publisher Full Text
6. Nielsen J: Usability engineering. Morgan Kaufmann; 1994.
7. Hornbæk K: Current practice in measuring usability: Challenges to usability studies and research. Int. J. Hum. Comput. Stud. 2006; 64(2): 79–102. Publisher Full Text
8. Endert A, Hossain MS, Ramakrishnan N, et al.: The human is the loop: new directions for visual analytics. J. Intell. Inf. Syst. 2014; 43(3): 411–435. Publisher Full Text
9. Kallenbach-Thieltges A, Großerüschkamp F, Mosig A, et al.: Immunohistochemistry, histopathology and infrared spectral histopathology of colon cancer tissue sections. J. Biophotonics. 2013; 6(1): 88–100. PubMed Abstract | Publisher Full Text
10. Großerueschkamp F, Bracht T, Diehl HC, et al.: Spatial and molecular resolution of diffuse malignant mesothelioma heterogeneity by integrating label-free FTIR imaging, laser capture microdissection and proteomics. Sci. Rep. 2017; 7(1): 1–12. Publisher Full Text
11. Kuepper C, Großerueschkamp F, Kallenbach-Thieltges A, et al.: Label-free classification of colon cancer grading using infrared spectral histopathology. Faraday Discuss. 2016; 187: 105–118. PubMed Abstract | Publisher Full Text
12. Großerueschkamp F, Kallenbach-Thieltges A, Behrens T, et al.: Marker-free automated histopathological annotation of lung tumour subtypes by FTIR imaging. Analyst. 2015; 140: 2114–2120. PubMed Abstract | Publisher Full Text
13. OpenVibSpec tools for infrared and Raman microscopy: 2021. Reference Source
14. ISO 9241-11:2018(en): Ergonomics of human-system interaction - part 11: Usability: Definitions and concepts.2018. Reference Source
15. Lazar J, Feng JH, Hochheiser H: Research methods in human-computer interaction. Morgan Kaufmann; 2017.
16. Lewis JR: Usability: lessons learned … and yet to be learned. Int. J. Mob. Hum. Comput. Interact. 2014; 30(9): 663–684. Publisher Full Text
17. Nielsen J: Ten usability heuristics.2005.
18. Schmettow M: Sample size in usability studies. Commun. ACM. 2012; 55(4): 64–70. Publisher Full Text
19. Lewis JR: Sample sizes for usability tests: mostly math, not magic. Interactions. 2006; 13(6): 29–33.
20. Hwang W, Salvendy G: Number of people required for usability evaluation: the 10±2 rule. Commun. ACM. 2010; 53(5): 130–133. Publisher Full Text
21. Macefield R: How to specify the participant group size for usability studies: a practitioner’s guide. J. Usability Stud. 2009; 5(1): 34–45.
22. Molich R: A commentary of “how to specify the participant group size for usability studies: A practitioner’s guide” by R. Macefield. J. Usability Stud. 2010; 50(3): 124–128.
23. Kluyver T, Ragan-Kelley B, Pérez F, et al.: “Positioning and Power in AcademicPublishing: Players, Agents and Agendas”, Proceedings of the 20th International Conference on Electronic Publishing, published by IOS Press, Amsterdam, NL, 2016.
24. Jelonek M, Raulf AP, Fiala E, et al.: A formative usability study of workflow management systems in label-free digital pathology - Data and Code.February 2022. Publisher Full Text
25. Lloyd S: Least squares quantization in PCM. IEEE Trans. Inf. Theory. 1982; 28(2): 129–137. Publisher Full Text
26. Breiman L: Random forests. Mach. Learn. 2001; 45(1): 5–32. Publisher Full Text
27. Pedregosa F, Varoquaux G, Gramfort A, et al.: Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 2011; 12: 2825–2830.
28. Harris CR, Millman KJ, van der Walt SJ , et al.: Array programming with NumPy. Nature. September 2020; 585(7825): 357–362. PubMed Abstract | Publisher Full Text
29. Raulf AP, Butke J, Küpper C, et al.: Deep representation learning for domain adaptable classification of infrared spectral imaging data. Bioinformatics. 2020; 36(1): 287–294. PubMed Abstract | Publisher Full Text
30. Jelonek M, Raulf A, Fiala E, et al.: A formative usability study of workflow management systems in label-free digital pathology - Questionnaires.February 2022. Publisher Full Text
31. Randles BM, Pasquetto IV, Golshan MS, et al.: Using the jupyter notebook as a tool for open science: An empirical study. 2017 ACM/IEEE Joint Conference on Digital Libraries (JCDL). 2017; pages 1–2. IEEE.
32. Adadi A, Berrada M: Peeking inside the black-box: a survey on explainable artificial intelligence (XAI). IEEE access. 2018; 6: 52138–52160. Publisher Full Text

Comments on this article Comments (0)

Version 1

VERSION 1 PUBLISHED 15 Feb 2022

Author details Author details

¹ Institute of Work Science, Chair of Information and Technology Management,, Ruhr-Universität Bochum, Bochum, NRW, 44801, Germany
² Center for Protein Diagnostics, Ruhr-Universsität Bochum, Bochum, NRW, 44801, Germany
³ Faculty of Biology and Biotechnology, Bioinformatics Group, Ruhr-Universität Bochum, Bochum, NRW, 44801, Germany

Markus Jelonek
Roles: Formal Analysis, Investigation, Methodology, Project Administration, Validation, Writing – Original Draft Preparation, Writing – Review & Editing

Arne Peter Raulf
Roles: Data Curation, Formal Analysis, Investigation, Methodology, Project Administration, Software, Validation, Writing – Original Draft Preparation, Writing – Review & Editing

Eileen Fiala
Roles: Data Curation, Formal Analysis, Investigation

Joshua Butke
Roles: Investigation, Software

Thomas Herrmann
Roles: Conceptualization, Methodology, Project Administration, Resources, Supervision, Writing – Original Draft Preparation, Writing – Review & Editing

Axel Mosig
Roles: Conceptualization, Methodology, Project Administration, Resources, Supervision, Writing – Original Draft Preparation, Writing – Review & Editing

Competing interests

No competing interests were disclosed.

Grant information

This research was funded by the German Research Foundation (DFG), grant number MO 2804/1-1, assigned to
Axel Mosig.
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Article Versions (1)

version 1

Published: 15 Feb 2022, 11:192

https://doi.org/10.12688/f1000research.108924.1

Copyright

© 2022 Jelonek M et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download

Export To

metrics

	Views	Downloads
F1000Research	-	-
PubMed Central Data from PMC are received and updated monthly.	-	-

Citations

0

SEE MORE DETAILS

CITE

how to cite this article

Jelonek M, Raulf AP, Fiala E et al. A formative usability study of workflow management systems in label-free digital pathology [version 1; peer review: 1 approved with reservations, 1 not approved]. F1000Research 2022, 11:192 (https://doi.org/10.12688/f1000research.108924.1)

NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.

track

receive updates on this article

Track an article to receive email alerts on any updates to this article.

Open Peer Review

Version 1

VERSION 1

PUBLISHED 15 Feb 2022

Views

15

Reviewer Report 05 Jul 2022

Nelson Pacheco Rocha, Universidade de Aveiro, Aveiro, Portugal

Not Approved

https://doi.org/10.5256/f1000research.120369.r142157

The manuscript is well written, and the research question deserves attention. However, the manuscript presents several flaws.

Section Usability Testing.

This section needs to be better structured as it is a bit confusing in ... Continue reading

The manuscript is well written, and the research question deserves attention. However, the manuscript presents several flaws.

Section Usability Testing.

This section needs to be better structured as it is a bit confusing in terms of some concepts.

Starting with the title. Usability testing is about usability assessment involving experiences with end-users. However, in the section Usability Testing the authors introduce not only test methods but also inspection methods. Probably, a better name for the section would be usability evaluation or usability assessment.

The authors stated, "in its test design (task-based tests vs. cognitive walkthrough". This is incorrect since usability evaluations can be performed using test methods or inspection methods. Cognitive walkthrough is just one of the inspection methods. Another one is the heuristic evaluation. In this respect, the sentence "Besides test-based evaluations, heuristic usability inspections are also commonly used" should be rewritten.

Still, in the section Usability Testing, the authors stated, “For example an expert-based usability evaluation can possibly be applied with a smaller number of participants than for example user-based usability tests”. The concept of expert-based usability evaluation was not introduced before and can be misinterpreted. For instance, a usability test of a medical device involving physicians or nurses (which are experts in their domains and, therefore, might be applied the designation expert-based usability evaluation) also requires a significant number of participants. Probably, what the authors want to say is that inspection methods (conducted by usability experts) require a smaller number of participants than usability tests.

Section Methods

Study participants and setup

In this subsection, authors should systematize the rationale for the experiment (i.e., two distinct groups of participants) and how they planned to recruit participants.

On the other hand, the characteristics of the individuals who participated in the experiments (i.e., the first and third paragraphs of this subsection) are results and as such should appear in the Results section.

On the other hand, the methods section lacks important information. The authors state that a “semi-structured interview consisted of general questions about the participants and their prior knowledge, as well as questions integrating Nielsen’s usability heuristics”. However, the Nielsen’s heuristics (i.e., Visibility of system status, Match between system and real world, User control and freedom, consistency and standards, Error prevention, Recognition rather than recall, Flexibility and efficiency of use, Aesthetic and minimalist design, Help users recognize, diagnose, and recover from errors, and Help and documentation) do not match with the dimensions that appear in the results (i.e., Task completion, Task complexity, Comprehension of results, Severe usability issues, Learnability, Efficiency, Error messages, and error prevention). Neither do the “general questions about the participants and their prior knowledge.

Therefore, in the section methods, the authors should introduce and justify the dimensions that they considered in the Results (i.e., Task completion, Task complexity, Comprehension of results, Severe usability issues, Learnability, Efficiency, Error messages, and error prevention). Moreover, it is required an explanation of how these dimensions were measured.

Section Results and Discussion

In the abstract, the authors stated: “The main intention is to gain first in-sights into the systematic study of usability in the context of biomedical image analysis, and formulate experiences and guidelines for future usability studies dealing with workflow management systems”. Therefore, the study reported in the manuscript has the broad objective to contribute to guidelines for future usability studies dealing with workflow management systems. This is also referred to in the Introduction Section: “may help to summarize experiences to formulate guidelines for future developments”. However, analyzing the remaining sections of the manuscript, this goal is never referred to. Therefore, the authors should either reformulate the objectives in the Abstract and Introduction or introduce in the Results and Discussion sections an analysis of how the study contributes to this broad objective.

Is the rationale for, and objectives of, the study clearly described?

Yes
Is the study design appropriate for the research question?

Partly
Are sufficient details of the methods provided to allow replication by others?

No
Are the datasets clearly presented in a useable and accessible format?

Partly

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: Usability assessment.

I confirm that I have read this submission and believe that I have an appropriate level of expertise to state that I do not consider it to be of an acceptable scientific standard, for reasons outlined above.

CITE

Report a concern

Respond or Comment

Views

15

Reviewer Report 07 Mar 2022

Johannes Köster, Institute of Human Genetics, University of Duisburg-Essen, Essen, Germany

Approved with Reservations

https://doi.org/10.5256/f1000research.120369.r123780

The authors describe an exemplary usability study on the analysis of label free pathology imaging data. For this purpose, non-experts from two groups with different backgrounds (computer science students and life science students) have been recruited. The participants were asked ... Continue reading

The authors describe an exemplary usability study on the analysis of label free pathology imaging data. For this purpose, non-experts from two groups with different backgrounds (computer science students and life science students) have been recruited. The participants were asked to complete a data analysis task with (a) a GUI based WMS (Orange), and (b) a scripting language, domain specific WMS (OpenVibSpec). In both cases, participants received instructions on how to build the analysis. The analysis task itself was basically the same, despite some necessary simplifications for the Orange case.

General comments:
The manuscript is well written and understandable. I would like to express that I have expertise in reproducible data analysis, but NOT in usability studies with humans. Hence, this review should be complemented by an expert in this field.

Specific comments:

The definition of the two terms formative and summative does not become entirely clear. In particular in the case of summative. I suggest outlining both the “goal” and the “method” of each study type separately.
On Page 7, paragraph 1: What is meant by “at least one participant [...] was able to solve the task”? Why is it impossible to provide the exact number?
I have some problems with the “efficiency” judgement that is extracted from the study. The authors of course do not claim that this is a generalizable outcome of their study, however, I think it should be discussed how exactly efficiency should be defined and how it can best be measured in this context. It seems like the different participants have also different definitions of efficiency in their mind. What measures have to be taken to “normalize” the interpretation of efficiency across participants, either beforehand or retrospectively? Or are there non subjective criteria that could be checked in the study?
I think it would be nice to show more of the study results in a visual way in the paper. What questions were asked and what answers were given (e.g. barplots)? Maybe some visualization showing how the final solutions of the participants differed?
I wonder what number of participants would be needed to draw generalizable conclusions (at least for this specific task/WMS combination). This should be discussed and analyzed. Were the recruited groups large enough? Is it possible to estimate the required size for the future? If not what would be needed to come up with proper estimations?
It seems that it would also be a good idea to discuss and look for (I know the sample size is very small) batch effects in the study groups. Things like the time of the day (how tired were the participants), the weather (maybe some people did this on a sunny warm day, while others performed the task while it was anyway raining). Can it be excluded that any batch effect was confounded with certain judgements of the participants?

Is the rationale for, and objectives of, the study clearly described?

Yes
Is the study design appropriate for the research question?

Partly
Are sufficient details of the methods provided to allow replication by others?

Yes
Are the datasets clearly presented in a useable and accessible format?

Partly

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: Reproducible data analysis, bioinformatics. I would like to express though that I am not an expert on user studies at all. Hence, it should be important to also have a reviewer from that domain!

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

CITE

Report a concern

Respond or Comment

Comments on this article Comments (0)

Version 1

VERSION 1 PUBLISHED 15 Feb 2022

Open Peer Review

Reviewer Status

Reviewer Reports

	Invited Reviewers
	1	2
Version 1 15 Feb 22	read	read

Johannes Köster, University of Duisburg-Essen, Essen, Germany
Nelson Pacheco Rocha, Universidade de Aveiro, Aveiro, Portugal

Comments on this article

All Comments(0)

Add a comment

Sign up for content alerts

Browse by related subjects

Back to all reports

Reviewer Report

15 Views

05 Jul 2022 | for Version 1

Nelson Pacheco Rocha, Universidade de Aveiro, Aveiro, Portugal

15 Views Cite this report Responses(0)

Not Approved

The manuscript is well written, and the research question deserves attention. However, the manuscript presents several flaws.

Section Usability Testing.

This section needs to be better structured as it is a bit confusing in terms of some concepts.

Starting with the title. Usability testing is about usability assessment involving experiences with end-users. However, in the section Usability Testing the authors introduce not only test methods but also inspection methods. Probably, a better name for the section would be usability evaluation or usability assessment.

The authors stated, "in its test design (task-based tests vs. cognitive walkthrough". This is incorrect since usability evaluations can be performed using test methods or inspection methods. Cognitive walkthrough is just one of the inspection methods. Another one is the heuristic evaluation. In this respect, the sentence "Besides test-based evaluations, heuristic usability inspections are also commonly used" should be rewritten.

Still, in the section Usability Testing, the authors stated, “For example an expert-based usability evaluation can possibly be applied with a smaller number of participants than for example user-based usability tests”. The concept of expert-based usability evaluation was not introduced before and can be misinterpreted. For instance, a usability test of a medical device involving physicians or nurses (which are experts in their domains and, therefore, might be applied the designation expert-based usability evaluation) also requires a significant number of participants. Probably, what the authors want to say is that inspection methods (conducted by usability experts) require a smaller number of participants than usability tests.

Section Methods

Study participants and setup

In this subsection, authors should systematize the rationale for the experiment (i.e., two distinct groups of participants) and how they planned to recruit participants.

On the other hand, the characteristics of the individuals who participated in the experiments (i.e., the first and third paragraphs of this subsection) are results and as such should appear in the Results section.

On the other hand, the methods section lacks important information. The authors state that a “semi-structured interview consisted of general questions about the participants and their prior knowledge, as well as questions integrating Nielsen’s usability heuristics”. However, the Nielsen’s heuristics (i.e., Visibility of system status, Match between system and real world, User control and freedom, consistency and standards, Error prevention, Recognition rather than recall, Flexibility and efficiency of use, Aesthetic and minimalist design, Help users recognize, diagnose, and recover from errors, and Help and documentation) do not match with the dimensions that appear in the results (i.e., Task completion, Task complexity, Comprehension of results, Severe usability issues, Learnability, Efficiency, Error messages, and error prevention). Neither do the “general questions about the participants and their prior knowledge.

Therefore, in the section methods, the authors should introduce and justify the dimensions that they considered in the Results (i.e., Task completion, Task complexity, Comprehension of results, Severe usability issues, Learnability, Efficiency, Error messages, and error prevention). Moreover, it is required an explanation of how these dimensions were measured.

Section Results and Discussion

In the abstract, the authors stated: “The main intention is to gain first in-sights into the systematic study of usability in the context of biomedical image analysis, and formulate experiences and guidelines for future usability studies dealing with workflow management systems”. Therefore, the study reported in the manuscript has the broad objective to contribute to guidelines for future usability studies dealing with workflow management systems. This is also referred to in the Introduction Section: “may help to summarize experiences to formulate guidelines for future developments”. However, analyzing the remaining sections of the manuscript, this goal is never referred to. Therefore, the authors should either reformulate the objectives in the Abstract and Introduction or introduce in the Results and Discussion sections an analysis of how the study contributes to this broad objective.

Is the rationale for, and objectives of, the study clearly described?

Yes
Is the study design appropriate for the research question?

Partly
Are sufficient details of the methods provided to allow replication by others?

No
Are the datasets clearly presented in a useable and accessible format?

Partly

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

Usability assessment.

I confirm that I have read this submission and believe that I have an appropriate level of expertise to state that I do not consider it to be of an acceptable scientific standard, for reasons outlined above.

Respond to this report

Responses (0)

Back to all reports

Reviewer Report

15 Views

07 Mar 2022 | for Version 1

Johannes Köster, Institute of Human Genetics, University of Duisburg-Essen, Essen, Germany

15 Views Cite this report Responses(0)

Approved With Reservations

The authors describe an exemplary usability study on the analysis of label free pathology imaging data. For this purpose, non-experts from two groups with different backgrounds (computer science students and life science students) have been recruited. The participants were asked to complete a data analysis task with (a) a GUI based WMS (Orange), and (b) a scripting language, domain specific WMS (OpenVibSpec). In both cases, participants received instructions on how to build the analysis. The analysis task itself was basically the same, despite some necessary simplifications for the Orange case.

General comments:
The manuscript is well written and understandable. I would like to express that I have expertise in reproducible data analysis, but NOT in usability studies with humans. Hence, this review should be complemented by an expert in this field.

Specific comments:

The definition of the two terms formative and summative does not become entirely clear. In particular in the case of summative. I suggest outlining both the “goal” and the “method” of each study type separately.
On Page 7, paragraph 1: What is meant by “at least one participant [...] was able to solve the task”? Why is it impossible to provide the exact number?
I have some problems with the “efficiency” judgement that is extracted from the study. The authors of course do not claim that this is a generalizable outcome of their study, however, I think it should be discussed how exactly efficiency should be defined and how it can best be measured in this context. It seems like the different participants have also different definitions of efficiency in their mind. What measures have to be taken to “normalize” the interpretation of efficiency across participants, either beforehand or retrospectively? Or are there non subjective criteria that could be checked in the study?
I think it would be nice to show more of the study results in a visual way in the paper. What questions were asked and what answers were given (e.g. barplots)? Maybe some visualization showing how the final solutions of the participants differed?
I wonder what number of participants would be needed to draw generalizable conclusions (at least for this specific task/WMS combination). This should be discussed and analyzed. Were the recruited groups large enough? Is it possible to estimate the required size for the future? If not what would be needed to come up with proper estimations?
It seems that it would also be a good idea to discuss and look for (I know the sample size is very small) batch effects in the study groups. Things like the time of the day (how tired were the participants), the weather (maybe some people did this on a sunny warm day, while others performed the task while it was anyway raining). Can it be excluded that any batch effect was confounded with certain judgements of the participants?

Is the rationale for, and objectives of, the study clearly described?

Yes
Is the study design appropriate for the research question?

Partly
Are sufficient details of the methods provided to allow replication by others?

Yes
Are the datasets clearly presented in a useable and accessible format?

Partly

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

Reproducible data analysis, bioinformatics. I would like to express though that I am not an expert on user studies at all. Hence, it should be important to also have a reviewer from that domain!

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

Respond to this report

Responses (0)

[1] 1. Demšar J, Curk T, Erjavec A, et al.: Orange: data mining toolbox in python. J. Mach. Learn. Res. 2013; 14(1): 2349–2353.

[2] 2. Berthold MR, Cebron N, Dill F, et al.: Knime-the konstanz information miner: version 2.0 and beyond. AcM SIGKDD Explorations Newsletter. 2009; 11(1): 26–31. Publisher Full Text

[3] 3. Carpenter AE, Jones TR, Lamprecht MR, et al.: Cellprofiler: image analysis software for identifying and quantifying cell phenotypes. Genome Biol. 2006; 7(10): 1–11.

[4] 4. Giardine B, Riemer C, Hardison RC, et al.: Galaxy: a platform for interactive large-scale genome analysis. Genome Res. 2005; 15(10): 1451–1455. PubMed Abstract | Publisher Full Text

[5] 5. Köster J, Rahmann S: Snakemake—a scalable bioinformatics workflow engine. Bioinformatics. 2012; 28(19): 2520–2522. Publisher Full Text

[6] 6. Nielsen J: Usability engineering. Morgan Kaufmann; 1994.

[7] 7. Hornbæk K: Current practice in measuring usability: Challenges to usability studies and research. Int. J. Hum. Comput. Stud. 2006; 64(2): 79–102. Publisher Full Text

[8] 8. Endert A, Hossain MS, Ramakrishnan N, et al.: The human is the loop: new directions for visual analytics. J. Intell. Inf. Syst. 2014; 43(3): 411–435. Publisher Full Text

[9] 9. Kallenbach-Thieltges A, Großerüschkamp F, Mosig A, et al.: Immunohistochemistry, histopathology and infrared spectral histopathology of colon cancer tissue sections. J. Biophotonics. 2013; 6(1): 88–100. PubMed Abstract | Publisher Full Text

[10] 10. Großerueschkamp F, Bracht T, Diehl HC, et al.: Spatial and molecular resolution of diffuse malignant mesothelioma heterogeneity by integrating label-free FTIR imaging, laser capture microdissection and proteomics. Sci. Rep. 2017; 7(1): 1–12. Publisher Full Text

[11] 11. Kuepper C, Großerueschkamp F, Kallenbach-Thieltges A, et al.: Label-free classification of colon cancer grading using infrared spectral histopathology. Faraday Discuss. 2016; 187: 105–118. PubMed Abstract | Publisher Full Text

[12] 12. Großerueschkamp F, Kallenbach-Thieltges A, Behrens T, et al.: Marker-free automated histopathological annotation of lung tumour subtypes by FTIR imaging. Analyst. 2015; 140: 2114–2120. PubMed Abstract | Publisher Full Text

[13] 13. OpenVibSpec tools for infrared and Raman microscopy: 2021. Reference Source

[14] 14. ISO 9241-11:2018(en): Ergonomics of human-system interaction - part 11: Usability: Definitions and concepts.2018. Reference Source

[15] 15. Lazar J, Feng JH, Hochheiser H: Research methods in human-computer interaction. Morgan Kaufmann; 2017.

[16] 16. Lewis JR: Usability: lessons learned … and yet to be learned. Int. J. Mob. Hum. Comput. Interact. 2014; 30(9): 663–684. Publisher Full Text

[17] 17. Nielsen J: Ten usability heuristics.2005.

[18] 18. Schmettow M: Sample size in usability studies. Commun. ACM. 2012; 55(4): 64–70. Publisher Full Text

[19] 19. Lewis JR: Sample sizes for usability tests: mostly math, not magic. Interactions. 2006; 13(6): 29–33.

[20] 20. Hwang W, Salvendy G: Number of people required for usability evaluation: the 10±2 rule. Commun. ACM. 2010; 53(5): 130–133. Publisher Full Text

[21] 21. Macefield R: How to specify the participant group size for usability studies: a practitioner’s guide. J. Usability Stud. 2009; 5(1): 34–45.

[22] 22. Molich R: A commentary of “how to specify the participant group size for usability studies: A practitioner’s guide” by R. Macefield. J. Usability Stud. 2010; 50(3): 124–128.

[23] 23. Kluyver T, Ragan-Kelley B, Pérez F, et al.: “Positioning and Power in AcademicPublishing: Players, Agents and Agendas”, Proceedings of the 20th International Conference on Electronic Publishing, published by IOS Press, Amsterdam, NL, 2016.

[24] 24. Jelonek M, Raulf AP, Fiala E, et al.: A formative usability study of workflow management systems in label-free digital pathology - Data and Code.February 2022. Publisher Full Text

[25] 25. Lloyd S: Least squares quantization in PCM. IEEE Trans. Inf. Theory. 1982; 28(2): 129–137. Publisher Full Text

[26] 26. Breiman L: Random forests. Mach. Learn. 2001; 45(1): 5–32. Publisher Full Text

[27] 27. Pedregosa F, Varoquaux G, Gramfort A, et al.: Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 2011; 12: 2825–2830.

[28] 28. Harris CR, Millman KJ, van der Walt SJ , et al.: Array programming with NumPy. Nature. September 2020; 585(7825): 357–362. PubMed Abstract | Publisher Full Text

[29] 29. Raulf AP, Butke J, Küpper C, et al.: Deep representation learning for domain adaptable classification of infrared spectral imaging data. Bioinformatics. 2020; 36(1): 287–294. PubMed Abstract | Publisher Full Text

[30] 30. Jelonek M, Raulf A, Fiala E, et al.: A formative usability study of workflow management systems in label-free digital pathology - Questionnaires.February 2022. Publisher Full Text

[31] 31. Randles BM, Pasquetto IV, Golshan MS, et al.: Using the jupyter notebook as a tool for open science: An empirical study. 2017 ACM/IEEE Joint Conference on Digital Libraries (JCDL). 2017; pages 1–2. IEEE.

[32] 32. Adadi A, Berrada M: Peeking inside the black-box: a survey on explainable artificial intelligence (XAI). IEEE access. 2018; 6: 52138–52160. Publisher Full Text

A formative usability study of workflow management systems in label-free digital pathology

Abstract

Keywords

Introduction

Figure 1. Overview for the workflow.

Usability testing

Methods

Software, data and documentation

Study participants and setup

Results

Task-related observations

Table 1. Mean time and standard deviation (in mm:ss) needed to complete the task in either tool.

Figure 2. Example how a simple usability issue mixed up training and test data in the GUI which led to a false image classification: When connecting the test data (‘Test’) first with the ‘Test and Score’ widget, the tool automatically labeled it as training data.

Participant groups

Table 2. Average number of times participants repeated the k-means classification (runs) and average cluster size used (c). n=11 for each group.

Discussion

Usability issues and task

Methodological approach

Conclusions

Data availability

Source data

Underlying data

Author contributions

References

Comments on this article Comments (0)

Open Peer Review

Comments on this article Comments (0)

Open Peer Review

Reviewer Status

Reviewer Reports

Comments on this article

Browse by related subjects

Competing Interests Policy

Stay Updated