Cue-conflict benchmark has become a de-facto standard for evaluating the visual cues on which neural networks rely. However, the current stylization-based instantiation often produces ambiguous shape and texture signals, leading to empirically unstable evaluations. This raises a fundamental question: “Does the benchmark truly reflect genuine perceptual bias or is it confounded by cue construction artifacts?”
While cue-conflict offers a principled way to disentangle visual features, we argue that its current instantiation introduces artifacts and ambiguities that hinder meaningful interpretation:
The cues generated by stylization do not consistently instantiate clean cue separation, resulting in imperfect disentanglement and spurious correlations between texture and shape signals.
Stylization offers no explicit control over the relative contribution of shape and texture, leading to a cue imbalance that prevents fair preference measurement.
Relative preference cannot serve as a standalone measure, as it conflates directional preference with absolute cue sensitivity.
Evaluation in a restricted label space can distort the model's true prediction, which may overestimate cue usage and obscure the true perceptual behavior.
Beyond Cue-conflict — REFINED-BIAS Benchmark
Instead of only diagnosing the problem, we design both dataset and metrics to make cue analysis faithful and comparable.
We define shape and texture based on human perception rather than model heuristics, and generate cues that faithfully capture these characteristics.
A non-repeating geometric structure that encompasses both the global outline of an object and its distinctive local substructures.
A visual pattern that consistently repeats within patches of various image sizes.
We propose a novel metric that evaluates how prominently the correct shape and texture labels appear in the model’s full prediction ranking.
We compute the reciprocal ranks of the ground-truth shape and texture labels and denote them as \( \mathrm{Shape\text{-}Sens} \) and \( \mathrm{Texture\text{-}Sens} \), respectively. Unlike conventional MRR, the ranking is computed directly from the classification logits. Based on these sensitivities, we then define the relative bias for shape and texture.
Here, \( N \) denotes the total number of samples, and \( r_{\text{shape}, i} \) and \( r_{\text{texture}, i} \) represent the ranks of the correct shape and texture labels for the \( i \)-th sample in the model’s prediction ranking, respectively.
We evaluate whether the outcomes are consistent with our intuition and whether they remain plausible given existing knowledge. To this end, we first evaluate the dataset itself using extensive training strategies for diverse pre-trained models compared with the cue-conflict benchmark. We then focus more on assessing the correctness of the revised metric.
To ensure that shape–texture bias can be reliably assessed, we evaluate both our dataset and metric using models with a fixed ResNet-50 architecture trained under diverse strategies. These strategies naturally encourage different levels of shape or texture reliance, allowing us to test whether our benchmark can consistently detect such variations.