Evaluation measures
For each test case, a reference segmentation is available, called
'reference' here. The segmentation results, called 'segmentation'
here, are evaluated by assigning a score to each test case. The
maximum score is 100, and will only be obtained when the segmentation
is exactly the same as the reference. The total score of a method is
obtained by averaging the scores of all test cases.
The score of each test case itself is the average of five scores, each
also scaled from 0 to 100. The five scores are obtained from five
different evaluation measures:
- Volumetric overlap. This is the number of voxels in the
intersection of segmentation and reference, divided by the number of
voxels in the union of segmentation and reference. This value is 1 for
a perfect segmentation and has 0 as the lowest possible value, when
there is no overlap at all between segmentation and reference.
- Relative absolute volume difference, in percent. The total volume
of the segmentation is divided by the total volume of the reference.
From this number 1 is subtracted, the absolute value is taken and the
result is multiplied by 100. This value is 0 for a perfect
segmentation and larger than zero otherwise. Note that the perfect
value of 0 can also be obtained for a non-perfect segmentation, as
long as the volume of that segmentation is equal to the volume of the
reference.
- Average symmetric absolute surface distance, in millimeters. The border
voxels of segmentation and reference are determined. These are defined
as those voxels in the object that have at least one neighbour (from
the 26 nearest neighbours) that does not belong to the object. For each
voxel in these sets, the closest voxel in the other set is determined
(using Euclidean distance and real world distances, so taking into
account the generally different resolutions in the different scan
directions). All these distances are stored, for border voxels from
both reference and segmentation. The average of all these distances
gives the averages symmetric absolute surface distance. This value is 0 for a
perfect segmentation.
- Symmetric RMS surface distance, in millimeters.
This measures is similar to the previous measure, but stores the squared distances between the two
sets of border voxels. After averaging the squared values, the root is extracted and gives the symmetric RMS surface distance.
This value is 0 for a perfect segmentation.
- Maximum symmetric absolute surface distance, in millimeters. This measure is
similar to the previous two, but only the maximum of all voxel
distances is taken instead of the average. This value is 0 for a
perfect segmentation.
For more background information about these measures we refer to
Gerig, M. & Chakos, M. Valmet: a new validation tool for assessing and
improving 3D object segmentation, MICCAI 2001, Springer, Berlin, 2001,
516-523.
Scoring system
To convert these five measures to scores a linear scaling with a
cut-off is applied. A perfect result (1 for volumetric overlap, 0 for
the other measures) yields 100 points. We have determined averages for
the five measures from an independent human segmentation of several
test cases. These resulted in the following values:
Liver, volumetric overlap: 6.4%
Liver, relative absolute volume difference: 4.7%
Liver, average symmetric absolute surface distance: 1.0mm
Liver, symmetric RMS surface distance: 1.8mm
Liver, maximum symmetric absolute surface distance: 19mm
Caudate, volumetric overlap: 15.8%
Caudate, relative absolute volume difference: 5.6%
Caudate, average symmetric absolute surface distance: 0.27mm
Caudate, symmetric RMS surface distance: 0.56mm
Caudate, maximum symmetric absolute surface distance: 3.4mm
If a segmentation has this value, it obtains 75 points. The scores are
scaled linearly by these two fixed values that yield 100 and 75
points, but negative scores are truncated to 0. This is done to avoid
excessive negative influence on the total score of cases in which the
segmentation is a complete failure.
In this scoring system, a method with a score of 75 points performs
roughly as good as a human. Note that this is only an approximation,
only a few human segmentations were performed to gauge the scores.
Moreover, the human observer who performed the liver segmentation was
a medical student with not much experience in liver segmentation; an
accurate (interactive) segmentation may therefore well achieve scores
above 75 points.
Notes for the caudate segmentation task:
- Due to the extremely high quality of the second rater segmentations
and the resulting low reference errors, a segmentation reaching these values obtains 90 points (not 75 as for the liver).
We decided on this modification to prevent many okayish-looking submissions from getting zero scores.
- For the caudate segmentation task, the test cases consist of
multiple groups. One group of test cases are scans of the same
subject, performed on different scanners. For these cases no reference
segmentations are available, and these cases will therefore not
contribute to the total score. These cases are included to test if a
method is reproducible. These results are reported separately.
- Left and right caudate segmentations were merged to produce a single
segmentation.
- In addition to the measures listed above, also Pearson correlation
between reference and segmentation volumes is reported, as this is an
often used measure for this particular segmentation task.
|