
Researchers typically evaluate AI model performance using benchmarks—“test” sets of data on which they run their algorithms to assess the results. However, good performance on standard benchmarks doesn’t always translate to real-world success, particularly in medical AI, where standard benchmarks encounter problems with biased data, small datasets, and limited testing time.
The Johns Hopkins Computational Cognition, Vision, and Learning group has developed a new, large-scale, collaborative medical image dataset called Touchstone to address these issues and enable fairer—and more realistic—medical AI evaluation. Led by Bloomberg Distinguished Professor Alan Yuille, the group presented its benchmark at the 38th Annual Conference on Neural Information Processing Systems in December.
“We aim to promote improved evaluation standards for medical AI—standards that better guide progress toward models with genuine clinical usefulness,” says co-lead author Pedro R.A.S. Bassi, a visiting PhD student from the University of Bologna in Italy. “Our standard involves testing AI on images from additional hospitals; using large test datasets; providing a large, multi-institutional training dataset; evaluating AI bias; inviting AI creators to train their own algorithms; and creating a benchmark with a long-term commitment.”
To build Touchstone, the CCVL team took 11,098 CT scans of nine types of abdominal organs collected from 87 hospitals. They split the scans into a training set called AbdomenAtlas 1.0, which is publicly available for researchers, and a private test set reserved for rigorous evaluation. Algorithms trained and evaluated on Touchstone can help with tasks like surgery planning, robotic surgery guidance, and organ volume measurement, which is important for detecting diseases; for example, diabetes can be linked to changes in the volume of the pancreas, while cirrhosis causes changes in the volume of the liver.
The group then invited 14 inventors of 19 new AI techniques for analyzing medical images to train their algorithms on Touchstone, after which the CCVL researchers independently evaluated those algorithms on a private dataset from the Johns Hopkins Hospital. Their evaluation assessed the algorithms’ performance on nine anatomical structures, comparing average results, analyzing metadata groups, ranking them by class, visualizing worst-case performance, and assessing inference time and computational cost—key factors for the clinical deployment of AI algorithms.
The researchers were careful to address six major problems with standard medical segmentation benchmarks in their implementation of Touchstone:
1. There are often too many similarities between training and testing data.
In many existing benchmarks, the CT scans in the test set often share source hospitals, scanner hardware, and even patient populations with scans in the training set. As a result, AI algorithms may perform well on the test set but poorly in real-world scenarios.
That’s why the CCVL group used a completely separate, private dataset to evaluate the algorithms trained on AbdomenAtlas 1.0. They also encourage other researchers to test their medical AI models across multiple hospitals never used in training to truly gauge the models’ real-world performance.
2. Other benchmarks use too-small test sets.
Because the process of annotating medical data is so expensive and time-consuming, most of the annotated medical data that researchers can secure is used for training, leaving very little for testing their AI algorithms. In contrast, Touchstone’s test set is much larger than the test sets of all current public CT benchmarks combined, thus enhancing the statistical significance of its results: A 1% average accuracy increment across 5,000 CT scans is more indicative of a genuine algorithmic improvement than a 1% variation across 50 CT scans, the researchers say.
To confirm this assertion, they tested the 19 submitted algorithms on both large and small test sets and found that AI model rankings were far more stable when using large test sets. In other words, only using small test sets made it difficult to claim that one model consistently outperforms another, as any observed advantage could be entirely due to chance.
3. Average performance isn’t enough.
Most standard benchmarks only compare average performance and don’t take the time to identify each algorithm’s strengths and weaknesses in different scenarios.
“For instance, one algorithm might excel at segmenting small, circular structures—like the gall bladder—while another performs better on long, tubular ones, such as the aorta,” the team writes in its paper. “Only comparing the average performance across many classes can hide these nuances.”
To avoid this pitfall, the researchers evaluated submitted algorithms on different types of abdominal organs and reported which performed best for each class of segmentation task, making it more likely that the appropriate algorithm will be chosen to complete the real-world task for which it’s best suited.
4. Benchmark results can rely on unfair comparisons.
“If one researcher writes a paper comparing multiple AI models, he may put more time and effort into tuning his own model—that is, ensuring it performs well—than the others,” Bassi notes. “This creates unfair comparisons, as small changes in training parameters can significantly influence AI performance.”
To promote fairness in its Touchstone benchmark, the Hopkins team had each submitted AI model trained by its creators. This ensured that their results were as fair as possible between the 14 teams who submitted their algorithms for evaluation.
5. Researchers can’t take their time.
Most existing benchmarks are available for only a limited time—sometimes as short as three months—pressuring researchers into quickly tuning and training their models and sometimes excluding teams with fewer resources.
In contrast, the CCVL team has made a long-term commitment to its benchmark; they promise to organize recurring challenges for a minimum of five years, continually curate larger data sets, and commit significant time and resources to improving their benchmark’s label quality and task diversity.
6. Biased data leads to biased results.
Finally, the researchers performed bias analyses on the 19 submitted algorithms. They found that the algorithms performed significantly worse on scans of older patients and African Americans, highlighting the need to increase the presence of these demographic groups in public CT scan datasets.
They also found that certain diseases can affect models’ abilities to identify and segment organs accurately; for example, cancer or physical trauma may deform an organ’s shape, making it harder for an AI model to determine its boundaries.
“These findings are important because they show that evaluating AI models on unseen hospitals with large test datasets and analyzing their biases are important in gauging their clinical usefulness and finding potential limitations,” Bassi says.
—
While the Touchstone benchmark and dataset can directly help the development of accurate medical segmentation, the ideals it promotes are important in evaluating any medical AI task, the researchers say.
“By promoting more realistic evaluation standards, our work helps ensure that medical AI models are truly reliable and beneficial in real-world clinical practice, leading to more accurate diagnoses,” Bassi adds.
With the success of the first edition of Touchstone, its creators are actively pursuing multi-center, out-of-distribution datasets to further enhance the benchmark, and hope that their initiative will inspire more institutes to contribute their private datasets for third-party evaluation.
The next version of the group’s benchmark, Touchstone 2.0, is open for participation. Learn more about it here.