As with the training cases, the number of test cases varies across the tasks median: 20; IQR: 12, 33 ; min: 1, max: 30, The median ratio of training cases to test cases was 0. The test data used differs considerably from the training data, not only in quantity but also in quality. In these cases, an image was annotated by a median of 3 IQR: 3, 4 , max: 9 observers. We identified the relevant parameters that characterize a biomedical challenge following an ontological approach see Methods. The list of parameters which are generally not reported includes some that are crucial for interpretation of results.
Eighty five percent of the tasks did not give instructions on whether training data provided by challenge organizers may have been supplemented by other publicly available or private data, although the training data used is key to the success of any machine learning algorithm see e. Forty five percent of tasks with multiple annotators did not describe how the annotations were aggregated. The supplementary material further shows how the reporting of individual parameters evolves over time.
Analysis of multimodal data uncovers often regions or patterns of interest that would otherwise remain unnoticed and enables this way new insights and even support for certain decisions that are made during the conservation-restoration treatments. One data point represents the robustness of one task, quantified by the percentage of simulations in bootstrapping experiments in which the winner remains the winner. Pairwise comparisons would have to be adjusted for multiplicity and adjustment depends on the number of algorithms in the task. Nature Research menu. Metric-based aggregation begins with aggregating metric values over all test cases e.
In total, 97 different metrics have been used for performance assessment three on average per task. The fact that different names may sometimes refer to the same metric was compensated for in these computations. Thirty nine percent of all tasks provided a final ranking of the participants and thus determined a challenge winner.
Fifty seven percent of all tasks that provide a ranking do so on the basis of a single metric. Overall, 10 different methods for determining the final rank last step in computation of an algorithm based on multiple metrics were applied. We determined a single-metric ranking based on both versions for all segmentation challenges and found radical differences in the rankings as shown in Fig. In one case, the worst-performing algorithm according to the HD 10th place was ranked first in a ranking based on the HD Robustness of rankings with respect to several challenge design choices.
The center line in the boxplots shows the median, the lower, and upper border of the box represent the first and third quartile. The whiskers extend to the lowest value still within 1. Key examples red circles illustrate that slight changes in challenge design may lead to the worst algorithm A i : Algorithm i becoming the winner a or to almost all teams changing their ranking position d. One central result of most challenges is the final ranking they produce.
Winners are considered state of the art and novel contributions are then benchmarked according to them. The significant design choices related to the ranking scheme based on one or multiple metric s are as follows: whether to perform metric-based aggregate, then rank or case-based rank, then aggregate and whether to take the mean or the median. In some cases, almost all teams change their ranking position when the aggregation method is changed.
According to bootstrapping experiments Figs. The ranking scheme is a deciding factor for the ranking robustness. According to bootstrapping experiments with segmentation challenge data, single-metric based rankings those shown here are for the DSC are significantly more robust when the mean rather than the median is used for aggregation left and when the ranking is performed after aggregation rather than before right. One data point represents the robustness of one task, quantified by the percentage of simulations in bootstrapping experiments in which the winner remains the winner.
Robustness of rankings with respect to the data used. Metric-based aggregation with mean was performed in all experiments.
Top: percentage of simulations in bootstrapping experiments in which the winner according to the respective metric remains the winner. Bottom: percentage of other participating teams that were ranked first in the simulations. Statistical analysis of the segmentation challenges, however, revealed that different observers may produce substantially different rankings, as illustrated in Fig.
However, a re-evaluation of all segmentation challenges conducted in revealed that rankings are highly sensitive to the test data applied Fig. While missing data handling is straightforward in case-based aggregation the algorithms for which no results were submitted receive the last rank for that test case it is more challenging in metric-based aggregation, especially when no worst possible value can be defined for a metric.
For this reason, several challenge designs simply ignore missing values when aggregating values. Our experimental analysis of challenges was complemented by a questionnaire see Methods. It was submitted by a total of participants from 23 countries. A variety of issues were identified for the categories data, annotation, evaluation, and documentation cf. Many concerns involved the representativeness of the data, the quality of the annotated reference data, the choice of metrics and ranking schemes, and the lack of completeness and transparency in reporting challenge results.
Main results of the international questionnaire on biomedical challenges. Issues raised by the participants were related to the challenge data, the data annotation, the evaluation including choice of metrics and ranking schemes and the documentation of challenge results. The establishment of common standards and clear guidelines is currently hampered by open research questions that still need addressing. However, one primary practice that can be universally recommended is comprehensive reporting of the challenge design and results. The MICCAI satellite event team used the parameter list in the challenge proposal submission system to test its applicability.
This paper shows that challenges play an increasingly important role in the field of biomedical image analysis, covering a huge range of problems, algorithm classes, and imaging modalities Fig. Finally, challenge rankings are sensitive to a range of challenge design parameters, such as the metric variant applied, the type of test case aggregation performed, and the observer annotating the data. One of the key implications of our findings is the discrepancy between the potential impact of challenges e.
Our study shows that the specific according to our questionnaire sometimes arbitrarily taken challenge design choices e. Hence, the challenge design — and not only the value of the methods competing in a challenge — may determine the attention that a particular algorithm will receive from the research community and from companies interested in translating biomedical research results. As a consequence, one may wonder which conclusions may actually be drawn from a challenge.
It seems only consequent to ask whether we should generally announce a winner at all. A similar concern was recently raised in the related field of machine learning. Sculley et al Collaborative challenges without winners, which have been successfully applied in mathematics, for example 55 , 56 , could potentially solve this issue to some extent but require a modular challenge design, which may not be straightforward to implement. The concept of combining competitive elements with collaborative elements, as pursued in the DREAM challenges 57 , should be further investigated in this context.
Even if a specific challenge design resulted in a robust ranking e. This may differ crucially from application to application. A related but increasingly relevant problem is the fact that it is often hard to understand which specific design choice of an algorithm actually makes this algorithm better than the competing algorithms.
It is now well-known, for example, that the method for data augmentation i.
Along these lines, Lipton and Steinhardt 58 point out that the way in which machine learning results are reported can sometimes be misleading, for example, by failing to identify the sources of empirical gains and through speculation disguised as explanation. The authors thus argue for a structured description not only of the challenge itself but also of the competing algorithms.
Ideally, competing methods would be released open source admittedly a potential problem for participants from industry , and a structured description of the method would be generated automatically from the source code. Due to the lack of common software frameworks and terminology, however, this is far from straightforward to implement at this stage.
An overarching question related to this paper is whether not only the control of the challenge design but also the selection of challenges should be encouraged. Today, the topics that are being pursued in the scope of challenges are not necessarily related to the actual grand challenges that the communities face. Instead, they are a result of who is willing and allowed to release their data and dedicate resources to organizing a competition.
Given the fact that the pure existence of benchmarking data sets for a particular problem clearly leads to more people investing resources into the topic, mechanisms should be put in place to additionally channel the resources of the scientific community to the most important unsolved problems.
Overall, the demand for improvement along with the complexity of the problem raises the question of responsibility. The authors encourage the different stakeholders involved in challenge design, organization, and reporting to help overcome systemic hurdles. Societies in the field of biomedical image processing should make strategic investments to increase challenge quality.
One practical recommendation would be to establish the concept of challenge certification.
Analogously to the way clinical studies can be classified into categories reflecting the evidence level e. Ideally, the certification would include a control process for the reference annotations. Similarly, the authors see it as the role of the societies to release best practice recommendations for challenge organization in the different fields that require dedicated treatment.
In turn, platforms hosting challenges should perform a much more rigorous quality control. To improve challenge quality, for example, it should be made possible to give open feedback on the data and design of challenges e. Furthermore, a more rigorous review of challenge proposals should be put in place by conferences. While the instantiation of the list can be regarded as cumbersome, the authors believe that such a manner of quality control is essential to ensure reproducibility and interpretability of results. This initiative, however, can only be regarded as a first step, also because control mechanisms to ensure that the proposed challenge designs will be implemented as suggested are resource-intensive and still lacking.
Furthermore, the parameter list still lacks external instantiation from some domains, especially in the field of biological image analysis. They should further identify open problems in the field of biomedical image analysis that should be tackled in the scope of either collaborative or competitive challenges and provide funding for the design, organization, and certification of these challenges. This is in contrast to common practice where funding is typically provided for solving specific problems.
Journal editors and reviewers should provide extrinsic motivation to raise challenge quality by establishing a rigorous review process. Several high-impact journals have already taken important measures to ensure reproducibility of results in general. These should be complemented by concepts for quality control regarding comprehensiveness of reporting, generation of reference annotations and choice of metrics and ranking schemes.
Furthermore, journal editors are encouraged to work with the respective societies to establish best practice recommendations for all the different subfields of a domain, e.
Organizers of challenges are highly encouraged to follow the recommendations summarized in this paper and to contribute to the establishment of further guidelines dedicated to specific subfields of biomedical image analysis. They should put a particular focus on the generation of high-quality reference data and the development and deployment of an infrastructure that prevents cheating and overfitting to the challenge data. While this paper concentrates on the field of biomedical image analysis challenges, its impact can be expected to go beyond this field. Importantly, many findings of this paper apply not only to challenges but to the topic of validation in general.
It may be expected that more effort is typically invested when designing and executing challenges which, by nature, have a high level of visibility and go hand in hand with publication of the data compared to the effort invested in performing in-house studies dedicated to validation of an individual algorithm. Therefore, concerns involving the meaningfulness of research results in general may be raised.
This may also hold true for other research fields, both inside and outside the life sciences, as supported by related literature 59 , 60 , 61 , 62 , Clearly, it will not be possible to solve all the issues mentioned in a single large step. The challenge framework proposed could be a good environment in which to start improving common practice of benchmarking.
This textbook collects a series of research papers in the area ofImage Processing and Communications which not only introduce a summary of. This textbook collects a series of research papers in the area of Image Processing and Communications which not only introduce a summary of current .
Implementing a possibly domain-specific checklist of parameters to be instantiated in order to describe the data used in a challenge can safely be recommended across scientific disciplines. In the long run, this could encourage further improvements in the documentation of the algorithms themselves. In conclusion, challenges are an essential component in the field of biomedical image analysis, but major research challenges and systemic hurdles need to be overcome to fully exploit their potential to move the field forward.
Challenge: open competition on a dedicated scientific problem in the field of biomedical image analysis.