Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Multi-armed bandits are also well known in game AI. They got popular with the introduction of Monte Carlo Tree Search, where MAB are used there to select which subtrees to search for largest expected payoff, e.g.:

https://www.aaai.org/ocs/index.php/AIIDE/AIIDE13/paper/view/... https://courses.cs.washington.edu/courses/cse599i/18wi/resou... etc.

For what it's worth the MAB algo in the original post looks like Epsilon Greedy. It's probably better to look into Upper Confidence Bound variants like UCB1 that dynamically adjust how much to explore vs exploit, e.g.:

https://towardsdatascience.com/comparing-multi-armed-bandit-...



UCB on game trees (MCTS) was the first breakthrough that created decently playing Go programs, if I remember correctly.


You are correct. MCTS + UCB and other variants were state of the art leading up to AlphaGo. And even then, MCTS was also used in AlphaGo.

The main change in AlphaGo was using a deep learning network to encode a value network for fast rollouts and a policy network for move selection (rather than using the UCB rule). They later removed the value network and rollouts entirely, but even AlphaZero uses MCTS.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: