ON A UNIFORM TWO-ARMED BANDIT PROBLEM

Toshio Hamada

doi:10.11329/jjss1970.14.179

Abstract

Suppose that there are two experiments e⁰ and e¹, and by performing e0 or e¹, a random sample X or Y is obtained from the uniform distribution on the interval [0, p] or [0, q], respectively. The true values of p and q are unknown, but there is the prior knowledge that p and q have Pareto distributions as prior distributions. When x is obtained, the reward is x. Experiments can be made sequentially for n times, and at each time one of two experiments may be selected and performed. The objective is to maximize the total expected reward. This problem is formulated by dynamic programming and analyzed. It is found that there exists the function which describes the optimal strategy. Some properties of this function are derived.

Content from these authors

Favorites & Alerts

Add to favorites
Additional info alert
Citation alert
Authentication alert

Corresponding author

Register with J-STAGE for free!