We did two kinds of analysis on the user study data to test the hypothesis. For the first part, we wanted to see if the two methods generated the same measurements, so this first graph shows the mean lightness, averaged over all subjects, for each color, and each task. As you can see, the measurements are basically identical for all colors except red, but even this difference was not statistically significant, given the size of the sample, and the variance of the values within the sample.To test the second part of the hypothesis, we calculated the standard deviation of the lightness values (for each color and task) within each subject, and then averaged the standard deviations across subjects. The standard deviation is going to be higher when there is greater variability in the values, so the lower the standard deviation, the more precise the measurement method is. As you can see, the face-based method has much better precision for all colors, some more so than others, in general about three times better precision.
So, the user study indicated that the face-based method measures the same quantity as minimally distinct boundary, the previous luminance matching method, and does so with better precision. Plus- we didn't tell the subjects they were being timed, but the amount of time taken for each match was recorded, and it turns out that the face-based method was took basically just as long as the MDB, so we considered that a big win.