Suppose that the trial to select and perform one of two experiments
e0 and
e1 has to be made sequentially
n times. By performing
ei (
i=0, 1), a continuous random sample
Zi is obtained from the distribution with parameter
ui and then the reward
aZi is obtained, where
a=(E[
Z1])
-1. The value of
u1 is known a priori, but that of
u0 is unknown and there is a natural conjugate prior distribution of
u0. The change of experiments at the subsequent two trials causes the switching cost. Also, the final reward is incurred at the end of the final trial. The objective is to maximize the total expected reward. This problem is formulated by the principle of optimality of dynamic programming and the optimal strategy is derived by using a critical value function. Several properties of this function are also derived.
抄録全体を表示