Gradient Descent on pWFAs

24 Jul 2019
Joshua Moerman

Given a parametric weighted automaton, we would like to choose good values for the parameters. For example, we want to satisfy a property like “the system eventually reaches state xx with a probability of 0.75 or more.” Can we find values for our parameter pp such that this property holds? This problem is called parameter synthesis.

As shown in previous posts, the semantics of parametric weighted automata are polynomials in pp. In order to compute a reachability probability, we need to consider fixpoints (as automata have loops) and consequently we work with rational functions. So in principle, we can readily solve the parameter synthesis problem by solving an equation of the form f(p)g(p)0.75\frac{f(p)}{g(p)} \geq 0.75.

However, this exact method is often infeasible. The rational function can be a huge formula, and constructing it can already take a lot of time. And once we have the equations, actually solving it is an even harder problem.

Today, I want to show how we can solve this problem with a gradient descent method, trying to avoid the difficulties of the exact method. I am not aware of this being done in the literature, and I am pretty convinced that this is currently not implemented in any of the probabilistic model checkers.

Derivatives

Recall (from a previous post) that the rational function we are interested in is v(i)R(p)v(i) \in \mathbb{R}(p), where ii is the initial state and vv a linear function assigning to each state the reachability propability. For our gradient descent, we want to consider the derivative with respect to pp:

pv(i). \frac{\partial}{\partial p} v(i) .

With several parameters p1,,pkp_1, \ldots, p_k, we can consider the partial derivative in each direction. This would give us a gradient for a gradient descent solver. I stick to one parameter for the moment, and so I simply write \partial instead of p\frac{\partial}{\partial p}. Note that the above expression makes sense, as we are working in a derivation ring (meaning a ring with an R\mathbb{R}-linear map \partial such that the Lebniz rule holds: (ab)=(a)b+a(b)\partial(a \cdot b) = (\partial a) \cdot b + a \cdot (\partial b)).

Of course, simply writing down that quantity won’t help us. We want to avoid calculating v(i)v(i) in the first place. In the end, we only want to compute v(i)\partial v(i) at a specific point pRp \in \mathbb{R}. Can we do that efficiently?

Let’s unfold the (coinductive) equation for vv on an arbitrary state qq:

v(q)=o(q)+aqv(q)δa(q,q), v(q) = o(q) + \sum_a \sum_{q'} v(q') \cdot \delta_a(q, q') , take the derivative (using linearity): v(q)=o(q)+aq(v(q)δa(q,q)), \partial v(q) = \partial o(q) + \sum_a \sum_{q'} \partial (v(q') \cdot \delta_a(q, q')) , and apply the Leibniz rule: v(q)=o(q)+aq(v(q))δa(q,q)+v(q)(δa(q,q)), \partial v(q) = \partial o(q) + \sum_a \sum_{q'} (\partial v(q')) \cdot \delta_a(q, q') + v(q') \cdot (\partial \delta_a(q, q')) ,

So we see that the derivative depends on the derivative of the output, and in a particular way on the derivatives of the successor states. But it also depends on the actual (non-derived) value, by the Leibniz rule. As it currently stands, this is not an easy equation to solve.

Side note: In the previous post I defined the output to be of type o ⁣:QRo \colon Q \to \mathbb{R}, so its derivative will always be zero. I keep it here in the equation, because we can actually generalise the definition to o ⁣:QR(p)o \colon Q \to \mathbb{R}(p).

Solving at a point

Let’s plug in a specific point pRp \in \mathbb{R}. Normally, I would denote the valuation of a polynomial as function application, for example v(q)(p)v(q)(p) is the value of the rational function v(q)v(q) evaluated at pp. But that would require lots of parentheses, so I denote the evaluation of polynomials with a subscript vp(q)v_p(q). By evaluating the above expression at a point pp, we get:

vp(q)=op(q)+aq(vp(q))δap(q,q)+vp(q)(δap(q,q)=op(q)+qvp(q)(aδap(q,q))+qvp(q)(aδap(q,q)) \begin{aligned} \partial v_p(q) &= \partial o_p(q) + \sum_a \sum_{q'} (\partial v_p(q')) \cdot \delta_{a p}(q, q') + v_p(q') \cdot (\partial \delta_{a p}(q, q') \\ &= \partial o_p(q) + \sum_{q'} \partial v_p(q') \cdot \left(\sum_a \delta_{a p}(q, q')\right) + \sum_{q'} v_p(q') \cdot \left(\sum_a \partial \delta_{a p}(q, q')\right) \end{aligned} In the second line, I have regrouped the sums. This is a big messy expression, but if we squint, we can see the following structure: vp=op+vpM+vpM \partial v_p = \partial o_p + \partial v_p \cdot M + v_p \cdot M'

for some matrices MM and MM' (given by the transition matrix and the derivative of the transition matrix w.r.t. pp). Now, this is a system of linear equations over R\mathbb{R}, instead of the field of fractions. Don’t let the symbols confuse you, “vp\partial v_p” is simply a variable in this system of equations! Further, the symbols “op\partial o_p”, “MM” and “MM'” are just constant which depend on the given automaton. Instead of solving this equation exactly, we can do a quick value iteration to get an approximate value. (Note that we also need to approximate the vector vpv_p at the same time.)

The above system of linear equations has a dimension of 2n2n, where nn is the number of states. I expect that solving the system (approximately) can be very efficient, especially if the automaton structure is sparse. Of special interest is the value vp(i)\partial v_p(i) which tells us how vp(i)v_p(i) increases depending on an increase of pp. This way, we can gradually move toward a point pp where the value vp(i)v_p(i) is high enough for our property to be satisfied. (Obviously, this comes with the disclaimer that a gradient descent may only find local optima.) This will also work with multiple parameters, although we will have to do the above work in each parameter.

I haven’t actually done the gradient descent, I have only shown how to obtain the gradient. It would be a nice project to implement such a thing in a probabilistic model checker.

Other iterative methods, such as Newton’s method, to find optimal parameters can also be used. Such methods use the second derivatives (or even higher), which can be computed in a similar way. The system of equations will be bigger and more costly to solve or approximate, but on the other hand, it would require fewer iterations of improving the parameter.

Automaton construction

The above manipulations can also be carried out to construct the “derivative of the automaton.” I haven’t worked out the details yet, but it gives us a construction AA \mathcal{A} \mapsto \partial \mathcal{A}

with the property that [ ⁣[A] ⁣](w)=[ ⁣[A] ⁣](w)[\![ \partial \mathcal{A} ]\!] (w) = \partial [\![ \mathcal{A} ]\!] (w). This would increase the size automaton by a factor of 2 (I guess), just like the system of equations above was of dimension 2n2n. Whether this derived automaton is interesting on its own right, I don’t know.


Below you can leave comments via Disqus. You may need to disable your adblocker temporarily.

comments powered by Disqus