Article ID: 2024EAP1078
This work introduces the dueling dice problem, which is a variant of the multi-armed dueling bandit problem. A die is a set of m arms in this problem, and the goal is to find the best set of m arms from n arms (m ≤ n) by an iteration of dueling dice. In a round, the learner arbitrarily chooses two dice α ⊆ [n] and β ⊆ [n] and lets them duel, where she roles dice α and β, observes a pair of arms i ∈ α and j ∈ β, and receives a probabilistic result Xi,j ∈ {0, 1}. This paper investigates the sample complexity of an identification of the Condorcet winner die, and gives an upper bound O(nh-2(log log h-1 + log nm2γ-1)m log m) where h is a gap parameter and γ is an error parameter. Our problem is closely related to the dueling teams problem by Cohen et al. 2021. We assume a total order of the strength over arms similarly to Cohen et al. 2021, which ensures the existence of the Condorcet winner die, but we do not assume a total order of the strength over dice unlike Cohen et al. 2021.