Multi-objective Contextual Multi-armed Bandit With a Dominant Objective - 2018


We propose a replacement multi-objective contextual multi-armed bandit (MAB) downside with two objectives, where one of the objectives dominates the other objective. Within the proposed problem, the learner obtains a random reward vector, where each element of the reward vector corresponds to at least one of the objectives and therefore the distribution of the reward depends on the context that is provided to the learner at the start of every spherical. We tend to call this problem contextual multi-armed bandit with a dominant objective (CMAB-DO). In CMAB-DO, the goal of the learner is to maximise its total reward in the non-dominant objective whereas guaranteeing that it maximizes its total reward within the dominant objective. During this case, the optimal arm given the context is the one that maximizes the expected reward in the non-dominant objective among all arms that maximize the expected reward within the dominant objective. Initial, we have a tendency to show that the optimal arm lies in the Pareto front. Then, we propose the multi-objective contextual multi-armed bandit algorithm (MOC-MAB), and outline 2 performance measures: the two-dimensional (2d) regret and the Pareto regret. We have a tendency to show that each the 2nd regret and the Pareto regret of MOC-MAB are sublinear in the quantity of rounds. We have a tendency to also compare the performance of the proposed algorithm with other state-of-the-art strategies in synthetic and real-world datasets. The proposed model and also the algorithm have a wide range of real-world applications that involve multiple and presumably conflicting objectives starting from wireless communication to medical diagnosis and recommender systems.

