Making Scanned Content Accessible – Building a Digital Newspaper Archive Using OCR Scanning
Posted by Ritesh Sanghani | Posted on: September 5th, 2014
Digitization has come-of-age and gained pace, as a structured and continuous process. With the requirements for large scale newspaper conversion projects, increasing, mass digitization is gaining popularity.
OCR, i.e. optical character recognition is a widely used technique that offers better search and retrieval functionality enabled newspaper digitization.
We are living in that age of technology where scanning objects has become much easier and large-scale digitization has become routine for many organizations. But this has posed a new challenge to us which is to generate Meta data needed to ensure these items are discover-able. Most of us, or should we say all of us now a days use search engines to find relevant information on tropics of our choice. However the carefully produced and meticulously preserved images in TIFF file formats are not visible for text based searches.
OCR scanning comes in as the ultimate savior; helpful in mining machine searchable text from an image. We can display original images to human readers. It also means that the search hit rate will be lower as compared to perfect text but it is still a better scenario against completely unsearchable results, resulting in significant improvement. The current trend is to put in public domain information and images – online.
This is one of the reasons why newspaper digitization has become so popular. This is also because newspapers appeal to a large audience. But archiving hard-copies and preserving these copies is a mammoth task. Digitization allows storing realms of news paper prints and making them accessible to a worldwide audience. Moreover retrieving data from digital copies is much easier than sifting through stacks of newspapers – it’s just a click away!
Digital Newspaper Archive Using OCR Scanning:
Here, it is important to understand that newspaper digitization is invariably different and more challenging than digitizing a book. This is because books have a simple format, with consistent font type and font size. In newspapers however, the formatting is more complex, there are multiple columns, images, text wrapping, headings, advertisements and more negative space than in books. Hence producing excellent OCR output for a newspaper conversion is challenging, the level of difficulty increases manifolds for historic newspaper digitization.
The quality of digitized newspapers largely depends on the condition of the newspaper copy that is digitized. If the newspaper has stains, is worn out due to years of wear, is badly printed, and the letters have faded then the quality of OCR is affected. Gutter shadow on newspapers that are bundled up can also result in a poor quality digitized copy.
Making Newspaper Images Searchable:
Newspapers have a lot of images and advertisements in it. These also need to be indexed and made searchable. OCR text can be generated from pictures using commercial software packages and custom made software modules individually or in combination. Raw OCR data can also be post processed and improved by integrating lexicons, terminology lists or dictionaries.
Now as OCR does not yield the best quality results for historic newspapers, that are frayed and worn out, a proofreading and manual correction process can be carried out to ensure highly accurate and error free digital newspaper copies. This is a straightforward approach however; it requires continuous human intervention (of experienced and expert proof readers), is laborious and as a result it is also expensive.
OCR for Foreign Languages:
When newspapers are scanned using OCR technology, the text is recognized using pattern recognition software. This software compares scanned characters with the character shapes in a built in dictionary. Choose OCR software that supports the regional language newspaper you want to get digitized. However, in this case it becomes very difficult to get your digitized copies proof read, as you need to find regional language experts to get the job done.
Benefits of OCR Enabled Scanning For Newspapers:
OCR i.e. optical character recognition is a technology that converts the newspaper copy into text by recognizing the shape of individual characters. This means that there is no need to invest in manual transcriptions. With OCR Scan of Newspaper Pages, you can get well formatted digital news papers that can be indexed, searched and required data can be easily retrieved. Newspapers, especially historic newspapers are a rich source of information that is hard to retrieve. Scanning them using OCR technology and archiving these newspapers enables dishing out this rich information source to numbers readers and enthusiasts.
Image Credit: http://www.erecordsusa.com/wp-content/uploads/2012/05/Newspaper-Scanning.jpg