I re-read your Big-O comment, and it still does not make sense. The fact, that m...

xamuel · on Oct 23, 2020

It's not about whether the agent will or will not find an optimal solution (no agent can possibly find good solutions in every environment). It's about whether the agent could even understand the true environment based on your real-value-reward description of it.

Suppose the true environment has various tasks the agent can do which give big-O-complexity-valued rewards, one task giving a reward of O(2^n), and others giving rewards of O(n^i) for various i. For concreteness, say Task A rewards O(2^n), Task B rewards O(n^10000), and Task C rewards O(n^20000).

Now suppose you present this to the agent using real-valued rewards, say, where O(n^i) is replaced by arctan(i) and O(2^n) is replaced by arctan(2)+pi, as you suggest, then the agent will be deluded into thinking, e.g., that Task B and Task C give almost identical rewards (Task B gives reward 1.57069633 and Task C gives reward 1.57074633, which barely differ from each other at all). This is misleading because in the true environment, Task C gives much more reward than Task B. Yes, the agent understands Task C gives a bigger reward, but the agent totally mis-understands how much bigger :)

lostmsu · on Oct 24, 2020

Your paper does not prove anything in regards to what agent "mis-undestands". "Understanding" is not a mathematical concept in this case. It would still progressively find more and more optimal solutions to this task, so I see no problem there.

xamuel · on Oct 24, 2020

Suppose I claimed it's impossible to understand Shakespeare with all the consonants are deleted, leaving only vowels. Most people would accept that claim without pedantically asking me to define "understand". Same story here. I'm arguing that certain environments, when shoe-horned into the traditional RL model, are like Shakespeare with all the consonants removed.

The agent might (or might not) find progressively more and more optimal solutions to the interpreted environment, but not to the true environment (unless by dumb luck), because it does not see what the true environment actually is. For example, in the concrete environment described above, you could suppose completing Task C takes more steps than Task B. Then the agent is likely to conclude, on the basis of the misleading arctan rewards, that Task C is not worth the extra steps. Depending on the extra steps, this conclusion could be badly false.