harsha kokel
Logical Program Policies
My notes on Tom Silver, Kelsey R. Allen, Alex K. Lew, Leslie Kaelbling, and Josh Tenenbaum, AAAI 2020.
This paper introduces a bayesian imitation learning approach to learn policies from few demonstrations. They call these policies Logical Program Policies (LPP) which are essentially policies learnt as combination of logical and programmatic policies. Logical because these are relational and programmatic because they are features are automatically learned.
The bayesian prior used here is the prior probability distribution over the Probablistic Context Free Grammer (P-CFG). Paper proposes to generate a dataset ($\mathcal{D}$) where each state action pair $(s,a)$ is an example. Feature set of each example is obtained by initializing all the P-CFGs for each example. The target variable $y$ for each example is $1$ if $(s,a) \in \mathcal{D}$, $0$ otherwise.
Next, all the features are arranged in decreasing order of their prior probabilities and iteratively decision-trees ($DT$s) are learnt with incremental feature size. So, at iteration $i$, features used are $f_o, f_1 , … f_i$. The DT learned are converted to logical representation (i.e. disjunction of conjunctions of the P-CFG feature) and each DT is evaluated on the dataset $\mathcal{D}$ and finally the top-$K$ DTs are used as weighted mixture model for testing.
Critique
The paper is a difficult read and doesn’t convey the actual procedure followed in the code. Algorithm in the paper suggest that the posterior $q$ is iteratively refined however digging into the code suggests that the posterior $q$ is independently computed for each tree and there is no carry forward from one iteration to another.
The paper introduced a good set of 2D grid domains with generalizable domain-specific language. Code is very neat and easy to read.
The baselines used in the paper CNN and FCN are not meant for few-shot learning so it is not a surprise that they did not work. Some comparision to meta-learning approaches would have been useful.