The Formula Processor

The formula processor implemented is a graph rewriting system, similar to that described by Lavirotte and Pottier . Using an input of symbols and their bounding boxes, a graph is constructed with nodes representing the symbols in the formula. Each node has data associated with it, holding the identity of the symbol, the location of its bounding box and a unique ID. As parsing proceeds, nodes will come to represent subexpressions within the formula. Directed arcs are built in the graph, within predetermined restrictions, signifying the geometric relationships between the symbols in the formula. Each arc has a label that stores what geometric relationship it represents between the two nodes it connects.

By making it possible to build more than one arc between any two given nodes, uncertain relationships can be represented. For example, to represent the possibility a symbol may be either to the ``top-right'' or ``right'' of another, two arcs are built between their respective nodes.

The restrictions on the building of arcs limit the arcs to ``sensible'' ones, based on their types and positions relative to each other and to other symbols. This helps the later parsing process by simplifying the graph built.

After being constructed from the symbols in the input formula, the graph is parsed by a graph grammar parser. The grammar used by the parser describes the syntax and layout of mathematical formulae through a number of small, typically one to five node, graphs that are templates for subexpressions that are found in mathematical expressions. Figure

shows some rules from a graph grammar.

**Figure 3.1:** Sample rules from a graph grammar.
$\includegraphics[scale=0.6]{figures/plusrule.eps}$ (a) Addition. $\includegraphics[scale=0.6]{figures/powerrule.eps}$ (b) Superscript. $\includegraphics[scale=0.6]{figures/integralrule.eps}$ (c) Integral with limits.

In Figure 3.1, we can see boxes representing nodes in the graph. Each box has two parts. The lower part is the ``data'' component, and the upper part is the ``type'' component. When rules are being matched to a graph representing a formula, the ``type'' values are considered. When the rule is being applied, the ``data'' components dictate the data component of the resulting node after application. As nodes are combined, a new bounding box is calculated based on the nodes being combined.

The superscript rule in Figure 3.1(b) shows how a superscript operation is defined. Two nodes of type ``element'', with a ``top-right'' arc between them are found. If the rule is applied, these nodes are then collapsed to a single node. The type of the new node is ``element'' and the data value for the new node is constructed from the data value of the original two nodes as shown, the ``A'' and ``B'' being the original data values of the original two nodes.

It can be seen in these rules that not all the data values of the original nodes are always used. The discarded data values are typically those of the operator symbol. For example, in the integral rule in Figure 3.1(c), the integral is node ``A'', however the integration is represented by the L^ATEX command \int in the final string.

It is possible to build either an abstract representation of the formula, such as a parse tree, or a L^ATEX string as is being done here. The parse tree is a much more versatile output format and can be linearised after parsing to generate L^ATEX or some other notation. Here, the direct generation of L^ATEX is sufficient for this system's purposes. A LISP-like prefix notation is also commonly used , and is more suited as an intermediate representation if you do not want to use L^ATEX or an internal data structure for the parse tree.

As described in Section

, Anderson and Bernstein recommend having the formula parser initially produce a layout description of the formula, then having a later stage of processing to determine the actual formula based on this. As the system aims to generate L^ATEX that represents the formula that the user has entered, it is valid to bypass the intermediate layout representation and generate L^ATEX directly. However, should it prove necessary to generate the positional notation, changing the grammar to produce a LISP-like layout description would be simple, as it is a matter of changing the grammar from producing L^ATEX to producing the LISP-like layout description code.

recommend having the formula parser initially produce a layout description of the formula, then having a later stage of processing to determine the actual formula based on this. As the system aims to generate L^ATEX that represents the formula that the user has entered, it is valid to bypass the intermediate layout representation and generate L^ATEX directly. However, should it prove necessary to generate the positional notation, changing the grammar to produce a LISP-like layout description would be simple, as it is a matter of changing the grammar from producing L^ATEX to producing the LISP-like layout description code.