UKOLN Informatics Research Group » Final Conference 2011

IMPACT Final Conference – Blog-index

impacteib — Wed, 26 Oct 2011 08:00:10 +0000

The whole conference was blogged and photographed with presentations uploaded to Slideshare and videos to Vimeo.

These are also embedded within the blogs on this site.

This post contains direct links to all posts made at the Final Conference. Please do feel free to add comments or thoughts below the posts.

Monday 24 October 2011

BLOCK 1: OPERATIONAL CONTEXT

Welcome and Opening of the Conference by Hildelies Balk-Pennington de Jongh (IMPACT Project Director, KB National library of the Netherlands)
Strategic Digital Overview by Richard Boulderstone (Director, e-Strategy and Information Systems, The British Library)
Digitisation challenges & IMPACT achievements so far by Hildelies Balk-Pennington de Jongh (KB)
Case study: Scanning parameters by Apostolos Antonacopoulos (USAL)
Applied IMPACT: Does the new FineReader Engine and Dutch lexicon increase OCR accuracy and production efficiency? A case study by KB and CCS by Claus Gravenhorst (CCS)

BLOCK 2: FRAMEWORK AND EVALUATION

Experiences in mass digitisation: examining OCR quality by Paul Fogel (University of California, California Digital Library)
The IMPACT Interoperability Framework – Workflows for OCR and beyond by Clemens Neudecker (KB)
IMPACT Evaluation tools, ground truth and datasets by Stefan Pletschacher (University of Salford)

BLOCK 3: TOOLS FOR IMPROVED TEXT RECOGNITION

ABBYY FineReader: IMPACT improvements by Michael Fuchs (ABBYY Europe)
IBM Adaptive OCR engine and CONCERT Cooperative Correction by Asaf Tzadok (IBM Haifa Research Lab)
Crowdsourcing in the Digitalkoot project by Majlis Bremer-Laamanen (National library of Finland)
The Functional Extension Parser: A Document Understanding Platform by Günter Mühlberger (University of Innsbruck)
Postcorrection in IMPACT by Ulrich Reffle (CIS group, University of Munich)

Tuesday 25 October 2011

Keynote: OCR and the transformation of the Humanities by Gregory Crane (Professor & Chair, Department of Classics, Tufts University)

BLOCK 4: LANGUAGE TOOLS AND RESOURCES

Overview of language work in IMPACT by Katrien Depuydt (INL)
Evaluation of lexicon supported OCR and Information Retrieval by Jesse de Does (INL)
CLARIN and IMPACT: Crossing Paths by Steven Krauwer (CLARIN coordinator, University of Utrecht)

BLOCK 5: IMPACT CENTRE OF COMPETENCE

The EC Digital Agenda and official launch of the IMPACT Centre of Competence by Khalil Rouhana (European Commission – Director for digital content and cognitive systems in DG Information Society and Media)
Introduction to the IMPACT Centre of Competence by Aly Conteh (BL) and Hildelies Balk-Pennington de Jongh (KB)

PARALLEL SESSIONS

Research Session: Presentation and discussion of state of the art research tools for document analysis and OCR, hosted by Apostolos Antonacopoulos (University of Salford).
Language Session: Presentation and demonstration of the IMPACT language tools & resources in further detail, hosted by Katrien Depuydt (INL)
Digitisation Tips Session: Meet the expert: questions & answers on digitisation issues, hosted by Aly Conteh (The British Library)

IMPACT Final Conference – Research Parallel Sessions Brief Summary

Nora Daly — Tue, 25 Oct 2011 18:27:52 +0000

Parallel Session 1: Research was dedicated to presentations and discussions around the state of the art research tools for document analysis developed via the IMPACT Project.

As you might guess from the slides below, the information packed into these presentations could fill a whole new two-day conference! But for now, a brief summary will have to suffice and I will implore you to visit the tools section of the freshly launched Impact: Centre of Competence Website for more details.

A video of the session is available here:

Impact Tools Developed by NCSR (Basilis Gatos)

The folks at the Computational Intelligence Laboratory over at the National Centre of Scientific Research (DEMOKRITOS) in Athens focus their activity around “research & development methods, techniques and prototypes in the areas of telecommunication systems, networks, and informatics”. Involved with IMPACT since 2008 they have partnered in the production of nine software tools to support binarisation, border removal, page split, page curl correction, OCR result, character segment, word spotting.

OCR for Typewritten Documents (Stefan Pletschacher)

Stefan explained that typewritten documents from roughly 1870-1970′s pose a unique challenge to OCR recognition. He points out that each character is actually produced on the page independently of the rest and they can appear with different weights do the mechanical nature of the process, even within the same word. Typical typewritten documents in archives are actually carbon copies with blurred type and a textured background, and administrative documents at that, rife with names, abbreviations, numbers, which render typical lexicon based recognition approaches less useful. A system was developed in IMPACT to tackle these unique issues by incorporating background knowledge of typewritten documents, and through improved segmentatio and enhancement of glyph images, while ”performing language independent character recognition using specifically trained classifiers”.

Image Enhancement, Segmentation and Experimental OCR (A. Antonacopoulos)

Representing the work of PRImA, Pattern Recognition & Image Analysis Research at the University of Salford Apostolos demonstrated their approach to the digitisation workflow and the tools developed for Image Enhancement (border removal, page curl removal, correction of arbitrary warping) as well as segmentation (recognition-based and stand alone).

IMPACT Final Conference – Language Parallel Session

Marieke Guy — Tue, 25 Oct 2011 14:23:54 +0000

The language parallel session consisted of a series of presentations and demonstrations of the IMPACT language tools and hosted by Katrien Depuydt (INL).

Named Entity Work in IMPACT: Frank Landsbergen

Frank began by defining named entities (NE). They are a word or string referring to a proper location, person or organisation (or date, time, etc). Within IMPACT the term is limited to location, person or organisation. The extra focus on these words is primarily because they may be of particular interest to end users and because they are usually not in dictionaries, so there is more improvement in the lexicon and ultimately the OCR. Note that a lexicon is a list of related entities in the database that are linked e.g. Amsterdams, Amsteldam, Amsteldamme = Amsterdam.

He then spent the majority of his talk walking us through the 4 step process of building a NE Lexicon:

Data collection
NE tagging – Possibilities include NE extraction software, use of the Stanford University module, statistical NE recognition (the software ‘trains’ itself) or manual tagging. Many of these tools currently work best with contemporary data.
enrichment (POS tagging, lemmatizing, adding the person name structure, variants)
database creation

So far the majority of IMPACT’s NE work has been on creating a toolkit for lexicon building (NERT, Attestation tool) and creating NE-lexica for Dutch, English and German.

Special Resources to Access 16th Century Germany: Annette Gotscharek

The 16th Century German printed book collection was a special case because the resources were so old, and therefore very challenging. There were a number of special features of the historical language on a word-level. The historical variants were all added to the modern lemma.

The diachronic ground truth corpus was text files of what appeared on the scans. It was collected from different resources on the Web and non-public electronic corpora. They worked on areas including creation of a hypothetical lexicon and manually verifying IR-lexicon.

Polish Language Resources in IMPACT: Janusz S. Bień

Janusz and his team faced a number of challenges when working with Polish text. They did not use the oldest Polish dictionary but focused on later historical dictionaries and sources used by these dictionaries. The earlier dictionaries were rejected because they had relevant information but it was too difficult to extract. They also struggled to use other later dictionaries because of copyright issues. In the end they did manage to use a selection of dictionaries and texts including the Benedykt Chmielowski encyclopedia, which is famous for its memorable “definitions”: “Horse is as everyone can see.”

They looked at a number of tools including the lemmatization in INL Lexicon tool (SAM, SpXViiw). More information is available at http://bc.klf.uw.edu.pl

Slovene Language Resources in IMPACT: Tomaž Erjavec

Tomaž previously worked on the AHLib project looking at transcription correction and markup. At the same time as their work on the IMPACT project they also won a Google award so have been able to develop language models for historical Slovene.

Their methodology has been to develop 3 resources: transcribed texts, hand-annotated corpus and a lexicon of historical words. They have also developed the annotation tool, ToTrTaLe which aids in tagging and lemmatising historical Slovene. The main issues have been tokenisation (words were split differently in historical language), variability and extinct words. During the project they have transcribed over 10 million words, these comprise of the AHLib corpus/DL, NUK GTD, Google Books and Wiki source –all are freely available.

A video of the session is here:

IMPACT Final Conference – Digitisation Tips Parallel Session

Greta — Tue, 25 Oct 2011 14:14:17 +0000

Aly Conteh (BL) hosted the parallel Q&A session on Digitisation tips. Pannel members included Astrid Verheusen (Manager of the Digital Library Programme and Head of the Innovative Projects Department, KB National Library of the Netherlands), Geneviève Cron (OCR expert, Bibliothèque nationale de France), Christa Müller (Director Digital Services Department, Austrian National Library), Majlis Bremer-Laamanen (Director of the Centre for Preservation and Digitisation, National Library of Finland) and Alena Kavčič – Colic (Head of the Research and Development Unit, National and University Library of Slovenia).

Here are some of the questions and issues addressed:

Q: How do Digital Libraries exist without METS/ALTO and how do you support retrospective production of OCR? A: You always have to try to find the best solution in terms of time, cost, scope and user needs. Currently, some libraries use only ALTO as it better suits the project’s needs. Standards like ALTO, however, don’t always support certain features. While new standards releases are being reviewed and will soon be published, it is paramount that libraries evaluate their data and resources and adopt the necessary measure accordingly. The problem of continuous migration due to updated standards will always remain for as long as we digitise. If, however, OCR is in raw, plain text, retrofitting it into METS/ALTO is encouraged as it transforms the users’ experience with working with that information. It is relatively straightforward, not highly technical but it does, of course, need some financial support.
Q: Many libraries digitising newspaper collections clean the head titles of their documents. Will this still happen in the future? Why insist on cleaning head titles rather than invest in digitising more pages? A: Good point! Some libraries have already interrupted the process of cleaning headings in favour of a larger number of digitised pages. However, the higher accuracy in article titles, the higher the relevance of the article to search terms. On the other hand, OCR of headings does cost more and it limits the number of pages you can digitise. It comes down to choices. A possible solution: invest the money on good scans, let the software do the OCR and live with the automated results. Do not spend money and time in manual correction. And remember to always consult your user community.
Q: How do you measure capture quality when you lack ground truth? A: It is impossible to ground truth everything in a digitisation project but what you could do is sample some of the pages rather than check every single one. However, OCR machines do come with a certain level of confidence.
Q: What are library priorities for the next 10 years? A: To obtain copyright exceptions and extended licensing so that we can publish material currently protected. As regards the British Library, current and near future projects include theatre playbills, newspaper collections, maps and medieval manuscripts.