Minimal Sufficient Explanations for Factored Markov Decision Processes

Minimal Sufficient Explanation

My notes on Omar Zia Khan, Pascal Poupart, and James P Black, ICAPS 2009

Automated planning problems have long been attempted using Markov Decision Processes (MDPs). MDPs are capable of handling the probabilistic and sequential nature of planning problem. They solve the problem by providing a policy which is a mapping from states to actions. However, to use this policy in the real-world, we first need users to trust the policy. The issue of trust can be ameliorated if the policy provides an explanation for its recommeded actions. This is the pretext of khan et al 2009.

Policy in MDP is usually obtained by optimizing Bellman’s equation which considers value of a state as expected discounted total reward. This paper uses an alternate formulation of the value function introduced by and extends it to the factored MDPs. In the alternate formulation, value function is expressed as a dot product of reward and the discounted occupancy frequencies of all the states. With this formulation, an optimal policy for a given state can explain the recommended action by simply showing that the recommended action has highest occupancy frequency of highly rewarded state(s). To this effect the paper propose three explanation templates:

ActionName is the only action that is likely to take you to Var1 = Val1, Var2 = Val2, … about $\lambda$ times, which is higher (or lower) than any other action.
ActionName is likely to take you to Var1 = Val1, Var2 = Val2, … about $\lambda$ times, which is high (or low) as any other action''
ActionName is likely to take you to Var1 = Val1, Var2 = Val2, … about $\lambda$ times.

Var1 = Val1, Var2 = Val2, … in the template represents a single state (in case of MDPs) or a scenario\footnote{In the case of factored MDPs, the reward function is also factored and the scenario used is defined as set of states which have similar rewards.} (in case of Factored MDPs). Multiple templates can be used for explaining a recommended action.

Power of these templates can be realized if they provide minimal-sufficient explanations (MSE). An explanation is sufficient if it can prove that the action recommended is optimal (or no other action has better utility than the recommended action). Sufficient explanation is minimal if the number of templates used are minimum possible.

When the policy ($\pi^{\ast}$) is optimal, the value of the recommended action for state $s_0$~$\left(V^{\pi^\ast}\right.$ or $\left.Q^{\pi^\ast}(s_0, \pi^\ast(s_0))\right)$ will be higher than value of all other actions $\left(Q^{\pi^\ast}(s_0,a), , \forall a \neq \pi^\ast(s_0)\right)$. Khan et al. 2009 define $V_{MSE}$ (eqn. below) which has two components: (a) expected utility from all the states(or scenarios) included in the explanation, and (b) worst case utility of all the states (or scenarios) not included in the explanation. By this definition $V_{MSE}$ cannot exceed the value of optimal action. If the explanation is minimal and sufficient, $V_{MSE}$ will be higher than the value of any other action.

\[ V_{MSE} = \sum_{i \leq k}\mathbb{E}[\operatorname{utility}(s_i)] + \sum_{i > k} \min(\operatorname{utility}(s_i)) \]

\[ V^{\pi^{\star}} \geq V_{MSE} > Q^{\pi^\star}(s_0,a)\ \ \forall a \neq \pi^\star(s_0) \]

With that definition of $V_{MSE}$, Khan et al. propose an algorithm to produce MSE at any state $s_0$ for a given optimal policy in factored MDP setting. They evaluate the MSE on two experimental domains of “course-advising” and “handwashing”. The experimental results show that the number of templates in MSE is high (or low) when the ratio of the expected utility for the optimal and the second best optimal policy is high (or low). User study with advisors in the course-advising domain show that the advisors found the explanations useful and perhaps beneficial for grade-conscious or poor performing students. User study with students show that more than 50% of the students found the explanations easy to understand, accurate and informative. However, most of them needed further information to trust the action recommended.

Critique

The paper addresses a very important topic of explaining the recommendations of a Markov Decision Process. The work is especially commendable as it was done way ahead of the Explainable AI Hype. They provide a very principled and straight forward way to explain the recommendations from the perspective of how the policy is constructed. However, it might not be intuitive enough to the users. As seen in the user-study, most of the users wanted to compare the explanation of the recommended action with their preferred actions or need more information about occupancy frequencies.

In complex domains, Bellman’s optimality equation is not solved exactly and hence, the value function of each state is only discovered approximately by the RL agent. In that scenario, as I understand, the optimality of the policy is only with respect to that approximate value function and hence the explanation of optimality is also only with respect to that approximate value function. To end user, this values might not mean anything and hence the sufficiency of the explanation might be questioned.

The explanations are only useful when it has less number of templates, the user-study in course-advising domain shown in the paper has less than 3 templates for the explanations. But in complex domain, as seen in handwashing experiments the number of templates will eventually increase and I conjecture that with increase in the number of templates the explanations will be less meaningful to the users.

Also, as raised by Dr. Natarajan in the class discussion the template only explains the action with respect to the states it visits but not with respect to the states it helps avoid. For domain where we want to avoid a certain states, we encode high negative rewards for that state. Even though the optimal policy learns to avoid that negative state, the MSE framework will only look at the occupancy frequency of highly rewarding states and try to explain positive reward states. As the occupancy frequency of the highly negative state is $0$ with the recommended action, it will not come up in the explanation. Although the templates proposed does mention “or lower” in brackets, it is not quite clear how the avoided state will be added in the explanation.

Finally, in RL we usually assume the policy is mapping of states to a distribution over actions. Explaining a distribution might be complex than a single action. This paper concentrates on explaining the best action out of that distribution. However, there might be two equally good actions, or there might be more than one optimal policy. One has to extend this framework for that use cases before using it in a real system.

References

Omar Zia Khan, Pascal Poupart, and James P Black, Minimal Sufficient Explanations for Factored Markov Decision Processes, ICAPS 2009
Pascal Poupart, Exploiting structure to efficiently solve large scale partially observable Markov decision processes, Citeseer 2005.

Previous post Advice based relational learning Next post Active Advice Seeking for Inverse Reinforcement Learning