I don't think this is a good test. If I prefix it with "a riddle" then GPT 4 got...

I don't think this is a good test. If I prefix it with "a riddle" then GPT 4 got it right for me

"Yellow"

I think the "temperature" (randomness) of a LLM makes it so you'd need to run a lot of these to know if it's actually getting it right or just being lucky and selecting the right color randomly