Neural networks provide robust character recognition for Newton PDAs

Larry Yaeger, Apple Computer

While on-line handwriting recognition is an area of long-standing and
ongoing research, the recent emergence of portable, pen-based
computers (personal digital assistants, or PDAs) has focused urgent
attention on usable, practical solutions. Pen-based PDAs depend wholly
on fast and accurate handwriting recognition, because the pen serves
as the primary means for inputting data to the devices. To meet this
need, we have combined an artificial neural network (ANN) character
classifier with context-driven search-over character segmentation,
word segmentation, and word recognition hypotheses to provide robust
recognition of hand-printed English text in new models of Apple
Computer's Newton MessagePad.

Earlier attempts at handwriting recognition used strong, limited
language models to maximize accuracy. However, this approach failed in
real-world applications, generating disturbing and seemingly random
word substitutions known colloquially within Apple as "The Doonesbury
Effect" (due to Gary Trudeau's biting satire based on first-generation
Newton recognition performance). We have taken an alternative
approach, using bottom-up classification techniques based on trainable
ANNs, in combination with comprehensive but weakly applied language
models. By simultaneously providing accurate character-level
recognition, via the ANN, with dictionaries exhibiting very wide
coverage of the language (as well as special constructs such as date,
time, and phone numbers), plus the ability to write entirely outside
those dictionaries (at a low probability), we have produced a
hand-print recognizer that some have called the first usable
handwriting recognition system.

The core of Apple's print recognizer is the ANN character classifier.
We chose ANN technology at the outset for a number of key attributes.
First, it is inherently data-driven--it learns directly from examples
of the kind of data it must ultimately classify. Second, ANNs can
carve up the sample space effectively, with nonlinear decision
boundaries that yield excellent generalization, given sufficient
training data. This results in an ability to accurately classify
similar but novel patterns, and avoids certain classic, subtle data
dependencies exhibited by hidden Markov models (HMMs),
template-matching, and other schemes, such as over-sensitivity to
hooks on tails, pen-skips, and the like. In addition, there is a rich
literature demonstrating the applicability of ANNs to producing
accurate estimates of a posteriori probabilities for each class, given
the inputs.

In some respects, our ANN classifier is quite generic, being trained
with standard error backpropagation (BP). Our network's architecture
takes advantage of previous work, indicating that combined, multiple
recognizers can be much more accurate than any single classifier.
However, we combine those parallel classifiers in a unique fashion,
tying them together into a single, integrated multiple-representations
architecture, with the last hidden layer for each otherwise
independent classifier connected to a final, shared output layer. We
take one classifier that sees primarily stroke features (tangent slope
resampled to a fixed number of points), and another classifier that
sees primarily an anti-aliased image, and combine them only at the
final output layer. This architecture allows standard BP to learn the
best way to combine the multiple classifiers, which is both powerful
and convenient.

Training an ANN character classifier for use in a maximum-likelihood
word recognition system has different constraints than would training
such a network for stand-alone character recognition. In particular,
we have devised several innovative network training techniques, all of
which modestly degrade the accuracy of the network as a pure character
classifier, yet dramatically improve the accuracy of the word
recognition system as a whole.

The first of these techniques we refer to as NormOutErr, short for
"normalized output error." Training an ANN to classify 1-of-N targets
with standard BP produces a classifier that does a fine job of
estimating p(class|input) for the top-choice class. However, BP's
least mean-squared error solution, together with typical
classification vectors--that consist of all 0s except for a single 1
corresponding to the target class--results in a classifier that does
not estimate second- and third-choice probabilities well. Rather, such
classifiers tend to make unambiguous single-choice classifications of
patterns that are, in fact, inherently ambiguous. The result is a
class of recognition errors involving a single misclassified letter
(where the correct interpretation is assigned a zero or near-zero
probability) that causes the search to reject the entire, correct

We speculated that this effect might be due to the preponderance of 0s
relative to 1s in the target vectors, as seen at any given output
unit. Lacking any method for accurately reflecting target ambiguity in
the training vectors, we tried partially normalizing this "pressure
towards 0" relative to the "pressure towards 1." We did this by
modulating the error seen at nontarget output units by a scale factor,
while leaving the error at the target output unit unmodified. This
generally increased the activation levels of the output units, and
forced the network to allocate more of its resources to the modeling
of low probability samples and classes. Most significantly, it allowed
the network to model second- and third-choice probabilities, thus
making the ANN classifier a better citizen in the larger recognition
system. While this technique reduced top-choice character accuracy on
the order of a percent, it dramatically increased word-level accuracy,
resulting in approximately a 30% reduction in word-level error rate.

Another of the techniques we apply routinely in our ANN training is
what we call frequency balancing. Training data from natural English
words and phrases exhibit very nonuniform priors for the various
character classes, and ANNs readily model these priors. However, as
with NormOutErr, we find that reducing the effect of these priors on
the net, in a controlled way, and thus forcing the net to allocate
more of its resources to low-frequency, low-probability classes
significantly benefits the overall word recognition process. To this
end, we explicitly (partially) balance the frequencies of the classes
during training. We do this by probabilistically skipping and
repeating patterns, based on a precomputed repetition factor. (Each
presentation of a repeated pattern is "warped" uniquely, as discussed
later.) This balancing of class frequencies is conceptually related to
a common method for converting from ANN estimates of posterior
probability p(class|input), to the value needed in an HMM or Viterbi
search p(input|class), which is to divide by p(class) priors. However,
our approach avoids potentially noisy estimates of low-probability
classes resulting from division by small numbers, and eliminates the
need for subsequent renormalization. Again, character-level accuracy
suffers slightly by the application of this technique, but word-level
accuracy improves significantly.

While frequency balancing corrects for under-represented classes, it
cannot account for under-represented writing styles. We use a
probabilistic skipping of patterns to address this problem as well,
but this time for just those patterns that the net correctly
classifies in its forward/recognition pass, which results in a form of
error emphasis. We define a correct-train probability for use as a
biased coin to determine whether a particular pattern, having been
correctly classified, will also be used for the backward/training
pass. This only applies to correctly segmented, or positive patterns,
and misclassified patterns are never skipped. Especially during early
stages of training, we set this parameter fairly low, thus
concentrating most of the training time and the net's learning
capability on patterns that are more difficult to correctly classify.
This is the only way we were able to get the net to learn to correctly
classify unusual character variants, such as a three-stroke "5" as
written by only one training writer.

Other special training techniques include negative training--presenting
missegmented collections of strokes as training patterns, along with
all-zero target vectors--and stroke warping--deliberate random
variations in stroke data, consisting of small changes in skew,
rotation, and x and y linear and quadratic scalings. During
recognition, the ANN classifier will necessarily encounter both valid
and invalid combinations of strokes, and must classify them as
characters. Negative training helps by tuning the net to suppress its
output activations for invalid combinations, thus reducing the
likelihood that those missegmentations will find a place in the
optimum search path. Stroke warping effectively extends the data set
to similar, but subtly different writing styles, and enforces certain
useful invariances.

Two practical considerations in building an ANN-based system for a
hand-held device are speed and memory limitations. Especially for the
ARM 610 chip that drives the Newton MessagePad 120 and 130 units,
8-bit integer operations are much faster than either longer-integer or
floating-point operations, and cache coherency benefits from reduced
data sizes. In addition, memory is at a premium in these devices. So,
despite previous work that suggests ANN training requires roughly
16-bit weights, we were highly motivated to make 8-bit weights work.
We took advantage of the fact that the ANN's forward/recognition pass
is significantly less demanding, in terms of precision, than is the
backward/learning pass. It turns out that 1-byte (8-bit) weights are
sufficient if the weights are properly trained. We limit the dynamic
range of floating-point weights during training, and then round to the
desired precision after convergence. If the weight limit is enforced
during high-precision training, the net's resources will adapt to
compensate for the limit. Because bias weights are few in number,
however, and very important, we let them use two bytes with
essentially unlimited range. Performing our forward/recognition pass
with low-precision, one-byte weights (a q3.4 fixed point
representation, ranging from almost -8 to +8 in 1/16 increments), we
find no noticeable degradation relative to floating-point, 4- or
2-byte weights using this scheme. We have also developed a net
training algorithm based on 8-bit weights, by appending an additional
2 bytes, during the backward/training pass only, that accumulate
low-order changes, only occasionally carrying over into the primary
8-bit range, which affects the forward/recognition pass.

So, in summary, we have devised several techniques for using and
training an ANN classifier that is to be embedded in a higher-level
recognition system. Some, such as limited precision weights, are a
direct result of physical limitations of the device. Others derive
from the fact that an ANN classifier providing class probability
estimates to a search engine necessarily has different constraints
than such a classifier operating alone. Despite the seemingly
disparate nature of the various techniques we've described, there does
seem to be a unifying theme, which is that reducing the effect of a
priori biases in the data on network learning significantly improves
the system's overall accuracy. Normalization of output error prevents
over-represented nontarget classes from biasing the net against
under-represented target classes. Frequency balancing prevents
over-represented target classes from biasing the net against
under-represented target classes. And error emphasis prevents
over-represented writing styles from biasing the net against
under-represented writing styles.

One could even argue that negative training eliminates an absolute
bias towards properly segmented characters, and that stroke warping
reduces the bias towards those writing styles found in the training
data, although these techniques provide wholly new information to the
system as well. The general effect may be related to the technique of
dividing out priors, as is sometimes done to convert from
p(class|input) to p(input|class). In any event, it is clear that
paying attention to such biases and taking steps to modulate them
represent a vital component of effectively training a neural network
serving as a classifier in a maximum-likelihood recognition system. It
is also clear that ANN classifiers in conjunction with optimal search
strategies provide a degree of accuracy and robustness that is
otherwise difficult to obtain.

This work was performed in collaboration with Richard Lyon (Apple),
Brandyn Webb (The Future), Bill Stafford (Apple), and Les Vogel (Angel
Island Technologies). We are also indebted to many supportive and
contributing colleagues at Apple and in the connectionist community. A
more detailed, technical discussion of our recognition system will
appear in an upcoming issue of AAAI's AI Magazine, and is available
through my web page at <>.