Why is learning so effective in software testing?

1 Mar 2017
Joshua Moerman

The Communications of ACM published an article on automata learning in software engineering last February. The techniques described in the article are used to obtain models for the (input/output) behaviour of software. Even without access to source code, one can now use model checking or other bug finding tools on these models. The article shows many successful applications. Why is this possible at all? (This is a copy of a post I wrote on automatonlearning.net.)

Many people I talk to are very surprised that a black box learning technique is even able to find any bugs at all. When explaining the learning algorithm, it seems at first that no guarantees can be given. The algorithm keeps refining some hypothesis and many people ask: “When do you stop learning?” Luckily, the field of computational learning theory is rather mature, and we can give very precise theoretical guarantees. In this post I will highlight some of these results and why they matter to us. Before I do so, let’s recap the theoretical learning framework which is used in the CACM article. (I strongly suggest, you also read the CACM paper at some point! Or at least watch the accompanying video.)

The type of learning we will investigate is called active learning (also called exact learning, or query learning). We suppose there is some alphabet A and a unknown language L ⊆ A* (where A* is the set of words). It is the task of a learning algorithm to infer an automaton accepting the language L. The learner has access to an oracle which can answer two types of queries:

In the context of software, acceptance is generalized to arbitrary output (in particular the membership queries reply with actual output of the software). So instead of working with DFAs, one works with Moore or Mealy type automata.

Dana Angluin showed that any regular language can be learned in this model, using only polynomially many queries. As a consequence we can learn the behaviour of unknown software as long as it has some regular behaviour. However, there are some objections one might have.


For a black box system we don’t have such an oracle.


The great thing about software is that we can in fact run it. It is interactive. This means that membership queries can be answered immediately by the software itself!

For equivalence queries, however, this is a reasonable objection. How can we know whether the hypothesis is correct w.r.t. a unknown system? Luckily any efficient learning algorithm with membership queries and equivalence queries can be transformed into an efficient probably approximately correct (PAC) algorithm. (This is a fun exercise in the book by Kearns & Vazirani. Their book provides a very good introduction in PAC theory.) In practice it means that we can randomly sample test cases (with whatever distribution we seem fit) and obtain a good approximation. The bounds on accuracy are quantitative and can be set arbitrarily high (at polynomial cost only).


Since we are doing machine learning we need big data. It takes ages to gather so many samples.


Indeed, learning neural networks, for example, is very hard in typical learning settings. In fact, in the PAC framework it is proven to be intractable (unless some cryptographic assumptions fail). As a consequence, if you want to increase accuracy you’ll need more than polynomially many more samples. That is, you need big data.

Our setting, however, is slightly different from the typical machine learning. We are saved by the membership queries. The learner does not rely on just any sample, it relies on precisely the samples it chooses to query. This makes all the difference and gives the polynomial bounds mentioned earlier.

I once asked Falk Howar (one of the developers of LearnLib) whether he considers himself as a machine learning scientist. Indeed he does so, “but”, he added, “we do small data, not big data.”


All this talk on polynomial bounds doesn’t apply in practice.


The celebrated survey paper P versus NP by Scott Aaronson already answers this objection. Somehow the gap between P and NP seems to be the gap between feasible and non-feasible. Partly, because algorithms in P often have a low- order polynomial as running time.

I experience that learning algorithms are no exception. The algorithm by Dana Angluin is polynomial (counting number of queries here) with a low-order polynomial. These asymptotics and also the constants have been improved in recent years, as mentioned in the CACM article. It’s just very efficient. All of this is witnessed by the many applications mentioned in the CACM.

A while ago, I posted about the RERS challenge, a challenge where software had to be analysed. This was a white-box challenge, but we used exclusively black- box techniques and got very good results. In particular we used a fuzzing and a learning algorithm. We noticed that their results were very comparable (the fuzzer was slightly better). However, the fuzzer afl-lop executed a couple orders of magnitude more traces than the learner. I have no doubts any more that learning is efficient in practice!


So… When do you stop learning?


One way is to use the PAC framework. Specify the accuracy, compute the number of samples needed, and run the algorithm. This might not be completely satisfactory, though. What does it mean when a hypothesis is epsilon-close to the real software?

There is another way. In testing literature there are many conformance testing algorithms. Instead of giving probabilistic bounds, these algorithms give more concrete bounds on the hypothesis. A typical guarantee is: The black box system is equivalent to the hypothesis, unless it has at least k more states. The parameter k can be set as high as you want (this time at exponential cost).

In our work we often combine the two approaches. We have probabilistic guarantees and additionally, have equivalence up to k states, for a small k. For many experiments we do, we actually obtain the equivalent automaton.


Software is often Turing complete, and cannot be captured by regular behaviour.


This is a deeper problem. The CACM article touches on this and already gives some solutions. There is (at least) two ways to generalize from automata to richer classes of behaviour.

One is to consider richer automata, still having a regular structure as a core. Examples are register automata (which add data-flow and predicates on transitions) and weighted automata (which add some arithmetic to the transitions). Learning algorithms have been described for such automata.

Another direction is going up to Chomsky hierarchy. It is known that visibly pushdown automata can be learned in a similar fashion. Currently, we’re not able to go further up the hierarchy.

However, in the many applications we see that the regular universe is big enough. This is partly because one can choose an alphabet small enough (or abstract enough) to exhibit regular behaviour. By choosing such a alphabet we might get a smaller, more abstract view on the actual behaviour. But it’s often enough for bug finding, and additionally it gives a model for which we have very good tooling, theory and interpretability (see Chris’ post).


Using black box techniques is silly if you actually have source code!


I totally agree! Despite the effectiveness of learning, we can imagine it being much more successful if we incorporate information from the source code. Frits Vaandrager likes to call this “grey box methods”. A good example is found in work by Andreas Zeller. He learns grammars more effectively by instrumenting the source code, tracking certain values through the execution. I hope we will see more of this combined techniques, utilizing best of both worlds.

I hope I gave a nice overview of some of the theoretical guarantees, making clear that computational learning has a good foundation. Maybe you have some ideas of how to apply learning in your own application. Or you have some ideas on how to combine it with white box methods. We are very interested in hearing these ideas! People contacting us with new applications of the theory always inspires us!

I used to have a Disqus comments section here. But everybody seems to simply send me emails instead. I have disabled Disqus, in order to avoid third-party trackers.