Cue-conflict benchmark has become a de-facto standard for evaluating the visual cues on which neural networks rely in their decision-making processes. But, does it truly reflect the underlying biases?
While cue-conflict offers a principled way to disentangle visual features, we argue that its current instantiation introduces artifacts and ambiguities that hinder meaningful interpretation:
The cues generated by stylization are inherently ambiguous, which makes them unclear not only to humans but also to models.
Stylization cannot control the relative contributions of shape and texture cues to form a balanced 50:50 mixture, which is necessary to measure true cue preference rather than cue dominance arising from unequal perceptibility.
Restricting model evaluation to preselected classes prevents the benchmark from capturing true cue utilization, as predictions not included in the evaluated class set are ignored, leading to distorted interpretations of model behavior.
The cue-conflict metric fails to distinguish models that differ substantially in how strongly they rely on shape and texture cues, as it normalizes bias by relative proportions while ignoring large differences in overall cue sensitivity.
Beyond Cue-conflict — REFINED-BIAS Benchmark
Instead of only diagnosing the problem, we design both dataset and metrics to make cue analysis faithful and comparable.
We define shape and texture based on human perception rather than model heuristics, and generate cues that faithfully capture these characteristics.
A non-repeating geometric structure that encompasses both the global outline of an object and its distinctive local substructures.
A visual pattern that consistently repeats within patches of various image sizes.
We propose a novel metric that evaluates how prominently the correct shape and texture labels appear in the model’s full prediction ranking.
We compute the reciprocal ranks of the ground-truth shape and texture labels and denote them as \( \mathrm{RB}_S \) and \( \mathrm{RB}_T \), respectively. Unlike conventional MRR, the ranking is computed directly from the classification logits.
Here, \( N \) denotes the total number of samples, and \( r_{\text{shape}, i} \) and \( r_{\text{texture}, i} \) represent the ranks of the correct shape and texture labels for the \( i \)-th sample in the model’s prediction ranking, respectively.
We evaluate whether the outcomes are consistent with our intuition and whether they remain plausible given existing knowledge. To this end, we first evaluate the dataset itself using extensive training strategies for diverse pre-trained models compared with the cue-conflict benchmark. We then focus more on assessing the correctness of the revised metric.
To ensure that shape–texture bias can be reliably assessed, we evaluate both our dataset and metric using models with a fixed ResNet-50 architecture trained under diverse strategies. These strategies naturally encourage different levels of shape or texture reliance, allowing us to test whether our benchmark can consistently detect such variations.