Input

Next: Noise Versus Small Symbols Up: Issues In Formula Recognition Previous: Issues In Formula Recognition

Input

When designing a parser, a supply of input is required. A lot of systems that process typeset input use input generated by L^ATEX, described in Section 1.1, for several reasons. It is a convenient way to generate input. The process can be easily automated so that an input string can be passed into a system and, after the formula processing, the result can easily be compared to check that it was the same as the input string. The existence of an input string guarantees that there is a ``solution'' to the parsing process. If something was known to be generated by L^ATEX, then it should be possible to regenerate the L^ATEX for it. Generating input from screenshots gives clean input data, free from noise and other artifacts that would be introduced as part of a printing and scanning process.

For handwritten input, data is gathered as the user writes with a pen and tablet. In contrast to typeset input, it is quite possible that there will be input that, although it is quite reasonable for the person who wrote it, cannot have L^ATEX generated for it, no matter how good the underlying formula processor is. This may be as a result of L^ATEX not being powerful enough to represent the user's input or, more likely, the formula processor not being programmed to anticipate a particular user's style for laying out formulae.

Each individual author will have a fairly consistent style that they use, allowing for variations in the positions and sizes of symbols. Mathematicians also invent their own notations to improve the brevity and readability of their formulae. To accommodate this, an online handwriting based entry tool would ideally be easily extensible by the user, possibly through some sort of GUI tool.

To simplify the problem of recognising mathematical formulae, the consistency of input can be improved by restricting oneself to a certain style of mathematical notation. For example, this can be achieved for typeset input by taking all input from a single source, such as L^ATEX, or an individual publication. Within this single source, the style (i.e.: fonts, sizes, spacings, etc.) will be relatively consistent.

Both typeset and handwriting based systems have to recognise the characters that are input to the system. There are issues of segmentation and recognition, then dealing with errors arising from these steps.

If we are using handwritten input, the fact that the user may give sloppy input, erroneous input, or an incomplete formula must be faced. While books can have mistakes as well, the likelihood of a user giving erroneous input is much higher.

Next: Noise Versus Small Symbols Up: Issues In Formula Recognition Previous: Issues In Formula Recognition

Steve Smithies
1999-11-13