2 min read

BUFFALO, N.Y. (AP) – Computer scientists are at work on software to scan Arabic documents, even handwritten ones, for specific words or phrases, technology its developers say could aid in intelligence gathering.

The development of optical character recognition software for written and machine-printed documents will fill a void in technology evident as study of the language post-9/11 has increased, the computer scientists said.

The software also is meant to expand access to modern and ancient Arabic manuscripts.

“The whole Internet is skewed toward people who speak English,” said Venu Govindaraju, director of the Center for Unified Biometrics and Sensors (CUBS) at the University at Buffalo, where the software is being developed. “The fear is that if an OCR is not developed for a particular language, then all the classic texts in that language will disappear into oblivion.”

Spoken by roughly 240 million people, Arabic is among the most spoken languages in the world and for millions of Muslims, is the language of religious texts.

“Suppose you have several thousand Arabic documents and you want them scanned for specific keywords so that you can narrow down the number of documents that must be reviewed manually,” Govindaraju said. “Right now, this cannot be done.”

The Buffalo researchers have received $240,000 in funding from the federal Director of Central Intelligence Postdocotoral Research Fellowship Program for the project, which also will allow Arabic documents to be digitized and posted on the Web.

OCR software trains the computer to interpret the images of an alphabet based on scanned images of characters or words recorded by humans who have examined the original images. OCR systems can read text in large variety of fonts, but handwritten text has proved challenging.

Arabic presents its own challenges, Govindaraju said, because characters may take different forms depending on where within a word they appear, and Arabic vowels are pronounced but often not written.

Govindaraju helped develop OCR software for handwritten addresses in English.

AP-ES-01-20-05 1545EST


Comments are no longer available on this story