The whole conference was blogged and photographed with presentations uploaded to Slideshare and videos to Vimeo.
These are also embedded within the blogs on this site.
This post contains direct links to all posts made at the Final Conference. Please do feel free to add comments or thoughts below the posts.
BLOCK 2: FRAMEWORK AND EVALUATION
BLOCK 3: TOOLS FOR IMPROVED TEXT RECOGNITION
As you might guess from the slides below, the information packed into these presentations could fill a whole new two-day conference! But for now, a brief summary will have to suffice and I will implore you to visit the tools section of the freshly launched Impact: Centre of Competence Website for more details.
A video of the session is available here:
Impact Tools Developed by NCSR (Basilis Gatos)
The folks at the Computational Intelligence Laboratory over at the National Centre of Scientific Research (DEMOKRITOS) in Athens focus their activity around “research & development methods, techniques and prototypes in the areas of telecommunication systems, networks, and informatics”. Involved with IMPACT since 2008 they have partnered in the production of nine software tools to support binarisation, border removal, page split, page curl correction, OCR result, character segment, word spotting.
OCR for Typewritten Documents (Stefan Pletschacher)
Stefan explained that typewritten documents from roughly 1870-1970′s pose a unique challenge to OCR recognition. He points out that each character is actually produced on the page independently of the rest and they can appear with different weights do the mechanical nature of the process, even within the same word. Typical typewritten documents in archives are actually carbon copies with blurred type and a textured background, and administrative documents at that, rife with names, abbreviations, numbers, which render typical lexicon based recognition approaches less useful. A system was developed in IMPACT to tackle these unique issues by incorporating background knowledge of typewritten documents, and through improved segmentatio and enhancement of glyph images, while ”performing language independent character recognition using specifically trained classifiers”.
Image Enhancement, Segmentation and Experimental OCR (A. Antonacopoulos)
Representing the work of PRImA, Pattern Recognition & Image Analysis Research at the University of Salford Apostolos demonstrated their approach to the digitisation workflow and the tools developed for Image Enhancement (border removal, page curl removal, correction of arbitrary warping) as well as segmentation (recognition-based and stand alone).
Named Entity Work in IMPACT: Frank Landsbergen
Frank began by defining named entities (NE). They are a word or string referring to a proper location, person or organisation (or date, time, etc). Within IMPACT the term is limited to location, person or organisation. The extra focus on these words is primarily because they may be of particular interest to end users and because they are usually not in dictionaries, so there is more improvement in the lexicon and ultimately the OCR. Note that a lexicon is a list of related entities in the database that are linked e.g. Amsterdams, Amsteldam, Amsteldamme = Amsterdam.
He then spent the majority of his talk walking us through the 4 step process of building a NE Lexicon:
So far the majority of IMPACT’s NE work has been on creating a toolkit for lexicon building (NERT, Attestation tool) and creating NE-lexica for Dutch, English and German.
Special Resources to Access 16th Century Germany: Annette Gotscharek
The 16th Century German printed book collection was a special case because the resources were so old, and therefore very challenging. There were a number of special features of the historical language on a word-level. The historical variants were all added to the modern lemma.
The diachronic ground truth corpus was text files of what appeared on the scans. It was collected from different resources on the Web and non-public electronic corpora. They worked on areas including creation of a hypothetical lexicon and manually verifying IR-lexicon.
Polish Language Resources in IMPACT: Janusz S. Bień
Janusz and his team faced a number of challenges when working with Polish text. They did not use the oldest Polish dictionary but focused on later historical dictionaries and sources used by these dictionaries. The earlier dictionaries were rejected because they had relevant information but it was too difficult to extract. They also struggled to use other later dictionaries because of copyright issues. In the end they did manage to use a selection of dictionaries and texts including the Benedykt Chmielowski encyclopedia, which is famous for its memorable “definitions”: “Horse is as everyone can see.”
They looked at a number of tools including the lemmatization in INL Lexicon tool (SAM, SpXViiw). More information is available at http://bc.klf.uw.edu.pl
Slovene Language Resources in IMPACT: Tomaž Erjavec
Tomaž previously worked on the AHLib project looking at transcription correction and markup. At the same time as their work on the IMPACT project they also won a Google award so have been able to develop language models for historical Slovene.
Their methodology has been to develop 3 resources: transcribed texts, hand-annotated corpus and a lexicon of historical words. They have also developed the annotation tool, ToTrTaLe which aids in tagging and lemmatising historical Slovene. The main issues have been tokenisation (words were split differently in historical language), variability and extinct words. During the project they have transcribed over 10 million words, these comprise of the AHLib corpus/DL, NUK GTD, Google Books and Wiki source –all are freely available.
A video of the session is here:
Here are some of the questions and issues addressed: