Suppose that there are two experiments
e0 and
e1, and by performing
e0 or
e1, a random sample
X or
Y is obtained from the uniform distribution on the interval [0,
p] or [0,
q], respectively. The true values of
p and
q are unknown, but there is the prior knowledge that
p and
q have Pareto distributions as prior distributions. When
x is obtained, the reward is
x. Experiments can be made sequentially for
n times, and at each time one of two experiments may be selected and performed. The objective is to maximize the total expected reward. This problem is formulated by dynamic programming and analyzed. It is found that there exists the function which describes the optimal strategy. Some properties of this function are derived.
View full abstract