Stochastic Grammars

Next: Graph Rewriting Up: Formula Parsers Previous: Procedurally Coded Math Syntax

Stochastic Grammars

Stochastic grammars are reported to yield good results with both typeset and handwritten input. A stochastic approach can be added to any type of grammar, and two examples of systems that use this type of approach are Chou's , and Miller and Viola's . Chou's work processes typeset input, while Miller and Viola are now moving from typeset to handwritten input.

A stochastic grammar has associated with every production a probability that the production is used. Thus, for any given sequence of productions in a given parse, the overall probability of this sequence can be calculated. The correct parsing of a set of symbols is the parsing that has the highest probability.

To use this approach, each production in the grammar needs to be assigned a probability. There are a number of algorithms for assigning probabilities, typically working from a set of example strings which are known to be in the language that the grammar describes. Chou describes how to adapt the ``inside/outside algorithm'', originally designed for linear one dimensional input, to a two dimensional grammar that uses vertical and horizontal concatenation operators. The inside/outside algorithm makes a number of passes over the grammar and examples from that grammar's language, determining the probabilities for each rule in the grammar.

Stochastic grammars cope well, and in a pleasing way, with geometric tests as symbols can be given a probability that they have a particular geometric relation to other symbols or subexpressions. This is in contrast to other approaches which judge arrangements to be either valid or invalid. Miller and Viola model the positions of symbols as Gaussian variables, the probability that two elements are in a particular relationship relative to each other is defined by a two-dimensional Gaussian distribution around the expected position of the second expression. This helps cope with ambiguities, such as those described in Section .

Another advantage of a stochastic approach is that it can take as its input the output of the character recogniser in the form of symbols and possible alternatives, along with confidence values. As a result, the stochastic parser itself can choose from the alternatives in ambiguous cases, or when an error occurs, to get the most likely parse.

To simplify matters, the Miller and Viola sub-class symbols into sets of equivalent symbols:

ascender letters (b,d,h,i,k,l,t,A-Z (except Q), $\delta$ ).
descender letters (g,p,q,y, $\gamma$ ).
small letters (a,c,e,m,n,o,r,s,u,v,w,x,z, $\alpha$ ).
ascender/descenders (f,j,Q, $\beta$ ).
binary operators (+,-,=).
zero (0).
non-zero digits (1-9).
other symbols, each in a class of their own (round, curly and square brackets, fraction symbol).

Chou maintains all possible recognitions of symbols with a probability over a given threshold. Miller and Viola note that all symbols in each subclass are syntactically equivalent, so only keep the best from each subclass.

Miller and Viola use an A^* heuristic to guide their search with the probabilities of each production being calculated during the parsing process. The heuristic provides an estimate of the number of steps between the current state and the solution. For a heuristic to be A^*, it is required to never overestimate the number of steps. Thus, if such a heuristic is used to guide a search, then the shortest path to the solution will be found.

Their paper reports the success their system has, and how much faster their system is with the techniques that they use. They report that the heuristics reduce the parsing times for formulae from minutes to seconds. The paper by Miller and Viola has some very good results: it currently works very well on typeset text and their preliminary results with handwritten input are also very encouraging.

The use of stochastic grammars seems to be one of the two best approaches available for the processing of handwritten input, the other being graph rewriting.

Next: Graph Rewriting Up: Formula Parsers Previous: Procedurally Coded Math Syntax

Steve Smithies
1999-11-13