I don't think this is a good test. If I prefix it with "a riddle" then GPT 4 got it right for me
"Yellow"
I think the "temperature" (randomness) of a LLM makes it so you'd need to run a lot of these to know if it's actually getting it right or just being lucky and selecting the right color randomly
"Yellow"
I think the "temperature" (randomness) of a LLM makes it so you'd need to run a lot of these to know if it's actually getting it right or just being lucky and selecting the right color randomly