FURTHER RESULTS ON A UNIFORM TWO-ARMED BANDIT PROBLEM WITH ONE ARM KNOWN

Toshio Hamada

doi:10.11329/jjss1970.15.193

Abstract

Suppose that there are two experiments e₀ and e₁ and by performing e_i (i=0, 1), a random sample X_i is obtained from the uniform distribution on the interval [p_i, q_i]. The values of p₁ and q₁ are known a priori, but at least one of two values p₀ and q₀ are unknown. There is a conjugate prior distribution for the unknown parameters. Experiments are performed sequentially for n times and at each time one of two experiments is selected and performed. The expected value of sum of n observations is maximized. Two cases are considered: First case is that p₀=p₁=0, q₁=1 and the reward is discounted by a discount factor. Another case is that both p₀ and q₀ are unknown. For both cases, the problem is formulated by the principle of optimality of dynamic programming, the optimal strategy is derived and critical values for the strategy are calculated.

Content from these authors

Favorites & Alerts

Corresponding author

Register with J-STAGE for free!