Here is the first result out of the project Verifying Deep Mathematical Properties of AI SystemsTechnical abstract available here. Note that David Dill has taken over as PI from Alex Aiken.^{a } funded through the Future of Life Institute.

Noisy data, non-convex objectives, model misspecification, and numerical instability can all cause undesired behaviors in machine learning systems. As a result, detecting actual implementation errors can be extremely difficult. We demonstrate a methodology in which developers use an interactive proof assistant to both implement their system and to state a formal theorem defining what it means for their system to be correct. The process of proving this theorem interactively in the proof assistant exposes all implementation errors since any error in the program would cause the proof to fail. As a case study, we implement a new system, Certigrad, for optimizing over stochastic computation graphs, and we generate a formal (i.e. machine-checkable) proof that the gradients sampled by the system are unbiased estimates of the true mathematical gradients. We train a variational autoencoder using Certigrad and find the performance comparable to training the same model in TensorFlow.

You can find discussion on HackerNews. The lead author was kind enough to answers some questions about this work.

**Q:** Is the correctness specification usually a fairly singular statement? Or will it often be of the form “The program satisfied properties A, B, C, D, and E”? (And then maybe you add “F” later.)

**Daniel Selsam:** There are a few related issues: how singular is a specification, how much of the functionality of the system is certified (coverage), and how close the specification comes to proving that the system actually does what you want (validation).

*Singular vs plural*. If you want to certify a red-black tree, then you will probably want to prove many different properties about how the core methods (e.g. finding, inserting, deleting) interact, and so the specification will be rather plural. But a different system may *use* the certified red-black tree to do one very specific thing and may have a singular specification. Thus how singular or plural a specification is depends heavily on where we draw the boundaries between systems and is somewhat arbitrary. One way or another, the internals of any proof of correctness will need to make use of many different lemmas; sometimes you can tie them all up in a bow for a particular project and sometimes you cannot.

*Coverage*. In Certigrad, we prove that the sampled gradients are unbiased estimates of the true gradients, which arguably constitutes total functional correctness for the stochastic backpropagation algorithm. We also prove a few other properties (e.g. that two program optimizations are sound). However, there are still some parts of the system about which we prove nothing. For example, a user could enter their model and compute the correct gradients, but then do gradient ascent instead of gradient descent and get nonsense when trying to train it. In principle we could prevent this kind of error as well by also proving that the learning algorithms converge to local optima on models that satisfy certain properties, but we do not do this yet in Certigrad.

*Validation*. In Certigrad, even if we approach full coverage of the system by proving convergence of the learning algorithms, we still would not have any theorems that even come close to saying that Certigrad does what we really want it to, which might be something like “after training a model with Certigrad on a dataset, the model will be extremely accurate on unseen data”. Traditionally, *verification* refers to proving that you implemented a system correctly (e.g. that your gradients are correct), whereas *validation* refers to proving that the system you chose to implement really does what you want (e.g. that your model will be accurate on unseen data). The line is blurry between verification and validation but I find the distinction useful. The limiting factor in proving validation properties for Certigrad is that nobody knows that much yet about the assumptions under which most useful models (such as neural networks) are guaranteed to perform well. Also, the set of properties we might want to prove is open-ended. There may even be multiple interesting properties we could prove about individual Certigrad models.

**Q:** Why concentrate on machine learning?

**D.S.:** Formal verification may be especially useful in machine learning (ML) for several reasons. First, ML systems are notoriously difficult to debug because developers don’t know what the system is supposed to do on a given input (we argue this in the introduction). Second, ML systems involve advanced mathematics that most developers will not be masters of, and formal math lets the ML experts communicate to the developers what the system needs to do in a precise and self-contained way that does not require years of training to be able to make sense of. Third, (math-aware) developers often need to do tedious and error-prone calculations by hand, and many of these calculations can be automated using tools from computer algebra. Fourth, because performance is dominated by matrix multiplication, machine learning algorithms do not require low-level optimizations to be competitive and so it is easier to formally verify real systems.

**Q:** It would seem that theorem provers are only useful insofar as the mathematical statement that specifies what it means to be correct is easier to understand than the code itself. What sort of “compression ratios” does Certigrad achieve? Will this depend on machine learning vs. another application?

**D.S.:** Theorem provers are also useful when the dumbest possible implementation is easier to understand than the most optimized one, which is almost always the case. The specification can simply be “it does the same thing as the naive implementation”. Similarly, a compiler may have millions of lines of code implementing program transformations, and yet the specification for each transformation may only be “it preserves semantics”. Compression ratios can vary though, and we did not get much actual compression in Certigrad. The full specification for backpropagation (including the definitions of all the preconditions) is about as many characters as the code that implements it. But it is much easier to provide than the implementation, and much easier to inspect and confirm that it is correct. I think a better metric may be how much shorter the specification is than the code *plus* the proof that the code is correct, since you would need to understand the proof to really understand the code in the first place.

## Selsam on formal verification of machine learning

Here is the first result out of the project Verifying Deep Mathematical Properties of AI SystemsTechnical abstract available here. Note that David Dill has taken over as PI from Alex Aiken.

^{a }funded through the Future of Life Institute.Developing Bug-Free Machine Learning Systems With Formal MathematicsDaniel Selsam, Percy Liang, David L. DillYou can find discussion on HackerNews. The lead author was kind enough to answers some questions about this work.

Q:Is the correctness specification usually a fairly singular statement? Or will it often be of the form “The program satisfied properties A, B, C, D, and E”? (And then maybe you add “F” later.)Daniel Selsam:There are a few related issues: how singular is a specification, how much of the functionality of the system is certified (coverage), and how close the specification comes to proving that the system actually does what you want (validation).Singular vs plural. If you want to certify a red-black tree, then you will probably want to prove many different properties about how the core methods (e.g. finding, inserting, deleting) interact, and so the specification will be rather plural. But a different system mayusethe certified red-black tree to do one very specific thing and may have a singular specification. Thus how singular or plural a specification is depends heavily on where we draw the boundaries between systems and is somewhat arbitrary. One way or another, the internals of any proof of correctness will need to make use of many different lemmas; sometimes you can tie them all up in a bow for a particular project and sometimes you cannot.Coverage. In Certigrad, we prove that the sampled gradients are unbiased estimates of the true gradients, which arguably constitutes total functional correctness for the stochastic backpropagation algorithm. We also prove a few other properties (e.g. that two program optimizations are sound). However, there are still some parts of the system about which we prove nothing. For example, a user could enter their model and compute the correct gradients, but then do gradient ascent instead of gradient descent and get nonsense when trying to train it. In principle we could prevent this kind of error as well by also proving that the learning algorithms converge to local optima on models that satisfy certain properties, but we do not do this yet in Certigrad.Validation. In Certigrad, even if we approach full coverage of the system by proving convergence of the learning algorithms, we still would not have any theorems that even come close to saying that Certigrad does what we really want it to, which might be something like “after training a model with Certigrad on a dataset, the model will be extremely accurate on unseen data”. Traditionally,verificationrefers to proving that you implemented a system correctly (e.g. that your gradients are correct), whereasvalidationrefers to proving that the system you chose to implement really does what you want (e.g. that your model will be accurate on unseen data). The line is blurry between verification and validation but I find the distinction useful. The limiting factor in proving validation properties for Certigrad is that nobody knows that much yet about the assumptions under which most useful models (such as neural networks) are guaranteed to perform well. Also, the set of properties we might want to prove is open-ended. There may even be multiple interesting properties we could prove about individual Certigrad models.Q:Why concentrate on machine learning?D.S.:Formal verification may be especially useful in machine learning (ML) for several reasons. First, ML systems are notoriously difficult to debug because developers don’t know what the system is supposed to do on a given input (we argue this in the introduction). Second, ML systems involve advanced mathematics that most developers will not be masters of, and formal math lets the ML experts communicate to the developers what the system needs to do in a precise and self-contained way that does not require years of training to be able to make sense of. Third, (math-aware) developers often need to do tedious and error-prone calculations by hand, and many of these calculations can be automated using tools from computer algebra. Fourth, because performance is dominated by matrix multiplication, machine learning algorithms do not require low-level optimizations to be competitive and so it is easier to formally verify real systems.Q:It would seem that theorem provers are only useful insofar as the mathematical statement that specifies what it means to be correct is easier to understand than the code itself. What sort of “compression ratios” does Certigrad achieve? Will this depend on machine learning vs. another application?D.S.:Theorem provers are also useful when the dumbest possible implementation is easier to understand than the most optimized one, which is almost always the case. The specification can simply be “it does the same thing as the naive implementation”. Similarly, a compiler may have millions of lines of code implementing program transformations, and yet the specification for each transformation may only be “it preserves semantics”. Compression ratios can vary though, and we did not get much actual compression in Certigrad. The full specification for backpropagation (including the definitions of all the preconditions) is about as many characters as the code that implements it. But it is much easier to provide than the implementation, and much easier to inspect and confirm that it is correct. I think a better metric may be how much shorter the specification is than the codeplusthe proof that the code is correct, since you would need to understand the proof to really understand the code in the first place.## Footnotes

(↵ returns to text)