National Alliance for Medical Image Computing

The Publication Database hosted by SPL

All Publications | Upload | Advanced Search | Gallery View | Download Statistics | Help | Import | Log in

On Evaluating Brain Tissue Classifiers without a Ground Truth

1Psychiatry Neuroimaging Laboratory, Department of Psychiatry, Brigham and Women’s Hospital, Boston, MA, USA.
2Clinical Neuroscience Division, Laboratory of Neuroscience, Boston VA Healthcare System, Brockton Division, Department of Psychiatry, Harvard Medical School, Boston, MA, USA.
3Laboratory of Mathematics in Imaging, Department of Radiology, Brigham and Women’s Hospital,Harvard Medical School, Boston, MA, USA.
Publication Date:
Volume Number:
Issue Number:
Neuroimage. 2007 Jul 15;36(4):1207-24.
PubMed ID:
Evaluation, Validation, Image segmentation, Agreement, Gold standard
Appears in Collections:
K02 MH001110/MH/NIMH NIH HHS/United States
P41 RR013218/RR/NCRR NIH HHS/United States
R01 MH040799/MH/NIMH NIH HHS/United States
R01 MH050740/MH/NIMH NIH HHS/United States
U54 EB005149/EB/NIBIB NIH HHS/United States
Generated Citation:
Bouix S., Martin-Fernandez M., Ungar L., Nakamura M., Koo M.S., McCarley R.W., Shenton M.E. On Evaluating Brain Tissue Classifiers without a Ground Truth. Neuroimage. 2007 Jul 15;36(4):1207-24. PMID: 17532646. PMCID: PMC2702211.
Downloaded: 2926 times. [view map]
Paper: Download, View online
Export citation:
Google Scholar: link

In this paper, we present a set of techniques for the evaluation of brain tissue classifiers on a large data set of MR images of the head. Due to the difficulty of establishing a gold standard for this type of data, we focus our attention on methods which do not require a ground truth, but instead rely on a common agreement principle. Three different techniques are presented: the Williams' index, a measure of common agreement; STAPLE, an Expectation Maximization algorithm which simultaneously estimates performance parameters and constructs an estimated reference standard; and Multidimensional Scaling, a visualization technique to explore similarity data. We apply these different evaluation methodologies to a set of eleven different segmentation algorithms on forty MR images. We then validate our evaluation pipeline by building a ground truth based on human expert tracings. The evaluations with and without a ground truth are compared. Our findings show that comparing classifiers without a gold standard can provide a lot of interesting information. In particular, outliers can be easily detected, strongly consistent or highly variable techniques can be readily discriminated, and the overall similarity between different techniques can be assessed. On the other hand, we also find that some information present in the expert segmentations is not captured by the automatic classifiers, suggesting that common agreement alone may not be sufficient for a precise performance evaluation of brain tissue classifiers.

Additional Material
1 File (285.764kB)
Bouix-NeuroImage2007-fig3.jpg (285.764kB)