Workshop on 3D Segmentation in the Clinic:

- A Grand Challenge -

For more Information:
Join the Challenge!
29. October ' 07
Brisbane, Australia
10th International Conference on Medical Image Computing and Computer Assisted Intervention  

































Evaluation measures

For each test case, a reference segmentation is available, called 'reference' here. The segmentation results, called 'segmentation' here, are evaluated by assigning a score to each test case. The maximum score is 100, and will only be obtained when the segmentation is exactly the same as the reference. The total score of a method is obtained by averaging the scores of all test cases.

The score of each test case itself is the average of five scores, each also scaled from 0 to 100. The five scores are obtained from five different evaluation measures:

  • Volumetric overlap. This is the number of voxels in the intersection of segmentation and reference, divided by the number of voxels in the union of segmentation and reference. This value is 1 for a perfect segmentation and has 0 as the lowest possible value, when there is no overlap at all between segmentation and reference.
  • Relative absolute volume difference, in percent. The total volume of the segmentation is divided by the total volume of the reference. From this number 1 is subtracted, the absolute value is taken and the result is multiplied by 100. This value is 0 for a perfect segmentation and larger than zero otherwise. Note that the perfect value of 0 can also be obtained for a non-perfect segmentation, as long as the volume of that segmentation is equal to the volume of the reference.
  • Average symmetric absolute surface distance, in millimeters. The border voxels of segmentation and reference are determined. These are defined as those voxels in the object that have at least one neighbour (from the 26 nearest neighbours) that does not belong to the object. For each voxel in these sets, the closest voxel in the other set is determined (using Euclidean distance and real world distances, so taking into account the generally different resolutions in the different scan directions). All these distances are stored, for border voxels from both reference and segmentation. The average of all these distances gives the averages symmetric absolute surface distance. This value is 0 for a perfect segmentation.
  • Symmetric RMS surface distance, in millimeters. This measures is similar to the previous measure, but stores the squared distances between the two sets of border voxels. After averaging the squared values, the root is extracted and gives the symmetric RMS surface distance. This value is 0 for a perfect segmentation.
  • Maximum symmetric absolute surface distance, in millimeters. This measure is similar to the previous two, but only the maximum of all voxel distances is taken instead of the average. This value is 0 for a perfect segmentation.

For more background information about these measures we refer to Gerig, M. & Chakos, M. Valmet: a new validation tool for assessing and improving 3D object segmentation, MICCAI 2001, Springer, Berlin, 2001, 516-523.

Scoring system

To convert these five measures to scores a linear scaling with a cut-off is applied. A perfect result (1 for volumetric overlap, 0 for the other measures) yields 100 points. We have determined averages for the five measures from an independent human segmentation of several test cases. These resulted in the following values:

Liver, volumetric overlap: 6.4%
Liver, relative absolute volume difference: 4.7%
Liver, average symmetric absolute surface distance: 1.0mm
Liver, symmetric RMS surface distance: 1.8mm
Liver, maximum symmetric absolute surface distance: 19mm

Caudate, volumetric overlap: 15.8%
Caudate, relative absolute volume difference: 5.6%
Caudate, average symmetric absolute surface distance: 0.27mm
Caudate, symmetric RMS surface distance: 0.56mm
Caudate, maximum symmetric absolute surface distance: 3.4mm

If a segmentation has this value, it obtains 75 points. The scores are scaled linearly by these two fixed values that yield 100 and 75 points, but negative scores are truncated to 0. This is done to avoid excessive negative influence on the total score of cases in which the segmentation is a complete failure.

In this scoring system, a method with a score of 75 points performs roughly as good as a human. Note that this is only an approximation, only a few human segmentations were performed to gauge the scores. Moreover, the human observer who performed the liver segmentation was a medical student with not much experience in liver segmentation; an accurate (interactive) segmentation may therefore well achieve scores above 75 points.

Notes for the caudate segmentation task:

  • Due to the extremely high quality of the second rater segmentations and the resulting low reference errors, a segmentation reaching these values obtains 90 points (not 75 as for the liver). We decided on this modification to prevent many okayish-looking submissions from getting zero scores.
  • For the caudate segmentation task, the test cases consist of multiple groups. One group of test cases are scans of the same subject, performed on different scanners. For these cases no reference segmentations are available, and these cases will therefore not contribute to the total score. These cases are included to test if a method is reproducible. These results are reported separately.
  • Left and right caudate segmentations were merged to produce a single segmentation.
  • In addition to the measures listed above, also Pearson correlation between reference and segmentation volumes is reported, as this is an often used measure for this particular segmentation task.