Augmenting Neural Networks with First-order Logic

Declarative knowledge, first-order rules are used in ILP (a lot) to reduce dependency on the data. Since deep neural network are data hungry, can we use some first-order rules and reduce their data requirement? This post reviews the work by Tao and Srikumar (ACL 2019) which attempts to answer this research question.

Tao and Srikumar, ACL 2019, addresses the problem of incorporating declarative knowledge into a Neural Network. They propose converting the (easily available) first-order logic representation of the knowledge into a network and provide a framework to augment this network to any neural network of choice. The main motivation to use the declarative knowledge as an inductive bias is to reduce the dependency on the data, to achieve comparative performance with less examples.

To convert the FOL rules to a network, each predicate in the rule is mapped to a named neuron. For example, given a rule $A_1 \wedge A_2 \rightarrow B_1$, the network will have 3 named neurons: $a_1, a_2,$ and $b_1$ with arrow from $a_1$ and $a_2$ to $b_1$. The Łukasiewicz T-norm and T-conorm are used as functions for the logical operators, inspired by probabilistic soft logic literature. Auxiliary variables and auxiliary named neurons are included as needed to compute logical operations. For example, $(\lnot A \vee B) \wedge (C \vee D)$ is converted to $P \wedge Q$ with $(\lnot A \vee B) \leftrightarrow P$ and $(C \vee D) \leftrightarrow Q$. The benefit of using Łukasiewicz functions is that they are differentiable. This network doesn’t have any parameters hence do not require any learning.

To ensure that the network is acyclic, the authors recommend using contrapositive statements when needed. For example, if the rule $B_1 \rightarrow A_1$ is introducing cycle in the network, then use its contrapositive equivalent $\lnot A_1 \rightarrow \lnot B_1$ instead.

This rule network is added as constraint to some layer $y=g(\mathbf{Wx})$ of the original neural network. The constrained neural layer is defined as follows with hyperparameter $\rho$ handling the importance factor.

$$y = g(\mathbf{Wx} + \rho \underbrace{d(\mathbf{z})}_{knowledge})$$

Authors empirically evaluate their proposed augmented NN for three tasks: machine comprehension, natural language inference, and text chunking. In each of these tasks the augmentation is performed at different layers. In machine comprehension task, where the use BiDAF as the base neural network, the constrained augmentation is done for attention nodes. In natural language inference task, they use L-DAtt as the base method and augment attention node as well as label nodes. In the text chunking task, they augment the label layer. These experiments confirm their hypothesis that using the knowledge improve the performance, but only when the data is less. With more data, the augmented knowledge does not improve performance significantly.

Critique

  • The framework of augmenting NN proposed is very general and hence can be potentially used in any task where deep neural networks are used.
  • I haven’t quite understood the emphasis on the differentiability of the augmented network since there are no parameters to be learnt there. The hyperparameter $\rho$ is tuned.
  • The right hand side of the rule looks pretty limited. The rules used in the experiments are also very simple.
  • In the text chunking task we would assume that the bidirectional LSTM would be able to learn rules $C_{1:4}$. It is not clear from experiments which rule improves the results in this task.