Software seeks to read the writing on the wall — and elsewhere
- By Brian Robinson
- Mar 24, 2008
Making timely sense of information contained in printed documents, handwritten letters and even graffiti scrawled on a wall can be of huge value to warfighters, but doing that with English sources is hard enough, let alone with Arabic script.
The Defense Advanced Research Projects Agency is trying to overcome those barriers with a new language technology program called Multilingual Automatic Document Classification Analysis and Translation (MADCAT), whose goal is to develop ways to automatically convert foreign-language text images into English transcripts.
Such a system would reduce the military’s dependence on linguists and analysts who are now needed to help decide what information is valuable and what is not. Often, the value of information is drastically reduced by the time the experts arrive on the scene and sort through it all.
But researchers face a number of significant technical challenges, according to Prem Natarajan, the principal MADCAT investigator at BBN Technologies, which was recently awarded a $5.7 million DARPA grant for work on the project.
“This is the first organized attempt to go after this kind of hard-copy document processing,” he said. “It’s similar to the problems associated with [optical character recognition scanning] which works well for English-language, well-structured documents but not at all well for degraded, real-world documents.”
BBN has recently shown that the kind of vocabulary training that current OCR systems can be given to recognize and translate English documents can be used with handwritten documents also, and that it can probably be applied to similar Arabic and Chinese documents, he said.
However, a big problem is the variability of language and script used by writers, he said.
“We’re talking about handwritten messages here, of various orientations, with certainly less-than-perfect lettering and spelling,” said Howard Bender, chairman of Any Language Communications. “The image software has to recognize individual letters so they can be expressed in Unicode. If the image software can’t do it, no language analysis can be done.”
DARPA is setting a goal of being able to accurately translate 90 to 95 percent of the content in 95 percent of the material scanned, which is “quite a high bar,” Bender said.
On the other hand, Natarajan said, if the problems that DARPA has set out are solved over the next four or five years, “it will revolutionize the field.”
Brian Robinson is a special contributor to Defense Systems.