Applying Formal Methods to Reinforcement Learning


    We report our research on formal methods guided testing of autonomous systems. In particular, we looked at closed-loop control systems that incorporate neural network based reinforcement learning components. Typically, reinforcement learning approaches train good control policies only after millions of experiments due to the sparsity of high reward actions. This may be unacceptable in an online learning setting. Our solution to this problem is inspired by Imitation Learning, a learning from demonstrations framework, in which an agent learns a control policy by directly mimicking demonstrations provided by an expert.

    Our approach automatically generates expert trajectories using a variant of the Monte Carlo tree search method. Unlike traditional methods, which eagerly expand a Monte Carlo search tree and draw samples on tree nodes, our technique is lazy and counterexample driven. This gives us the ability to directly and more efficiently get states and actions that are weighted highly. For each violation of safety properties found by the simulator, we traverse the Monte Carlo search tree induced by the failed trajectory. This search procedure allows us to collect expert trajectories that satisfy our safety properties by forcing the agent to take certain actions in key states to avoid property violation. Our technique proceeds by aggregating an expert trajectory dataset at each training iteration. The intuition is that, over iterations, it builds up the set of inputs that the policy is likely to encounter during its execution based on previous experience, which allows us to use supervised learning algorithms for efficient reinforcement learning. As our solution is counterexample-driven, the available expert trajectories may still be sparse. To combat this challenge, we use a constraint encoding of the neural network policy model and an SMT solver to compute states that are nearby states visited by an expert trajectory, but for which the policy chooses significantly different actions. In a robust policy, such nearby states should behave similarly, and so these “edge cases” are likely to be interesting from a safety property perspective. We then use the above tree based search procedure again to test and improve the policy model using these newly generated trajectories.

    In our experiments, we found that our algorithm enabled policy training with several orders of magnitude less data. We show the effectiveness of our approach via a case study on the game Pong.



      author      = {He Zhu and
                     Stephen Magill
      institution = {Galois, Inc.},
      title       = {Systems support for Hardware Anti-ROP},
      year        = {2017},
      note        = {Available at \url{}}}