Intermediate values defend against Goodharting

08 Jan, 2025

Adversarial inputs in image generation are typically specialized to only the last layer of the model, so you can ~entirely defeat them by having the model predict classifications at each layer, and checking if they agree.
I.e. Goodharted examples are notably different from normal examples in that the more observations you make of them the less they look like the class. if we just take the classifier output of an image model, we're only making one observation of the world model. if we make many more observations of that world model we can fully 'pin down' the properties of the class.
A stretch: If we view an image model's layers as a reasoning trace, and imagine using it as a value model over images, then having intermediate values over reasoning robustly defends against Goodharting.
Applying similar logic to agentic models: In order to make a robust value model over action trajectories, we should have the model care about both intermediate actions and their consequences. This lets us make many more observations about the nature of the action trajectory, which should greatly constrain the space of successful Goodharts. Still, this isn't a perfect solution - I see no reason why this won't also fail when the value model is under too much optimization pressure relative to its intelligence.
My intuition is that the human brain does this pervasively, but I'm not sure.