UKOLN Informatics Research Group » British Library http://irg.ukoln.ac.uk Expertise in digital information management Mon, 09 Dec 2013 15:09:09 +0000 en-US hourly 1 http://wordpress.org/?v=3.5.2 Working with commercial partners http://feedproxy.google.com/~r/Research360/~3/TIWkeX7nuzY/?utm_source=rss&utm_medium=rss&utm_campaign=working-with-commercial-partners http://feedproxy.google.com/~r/Research360/~3/TIWkeX7nuzY/#comments Thu, 08 Nov 2012 16:11:33 +0000 Jez Cope http://blogs.bath.ac.uk/research360/?p=352 On Monday 29 October, Cathy Pink gave an invited presentation on research data management with commercial partners at the latest DataCite workshop. The workshop was jointly run by JISC and the British Library, and focused on issues around citing sensitive data.

Cathy’s slides are now available for download from our institutional repository:

]]>
http://blogs.bath.ac.uk/research360/2012/11/working-with-commercial-partners/feed/ 0
Reflections on data sharing http://www.dcc.ac.uk/blog/reflections-data-sharing?utm_source=rss&utm_medium=rss&utm_campaign=reflections-on-data-sharing http://www.dcc.ac.uk/blog/reflections-data-sharing#comments Tue, 03 Jul 2012 10:41:43 +0000 Michael Day http://irg.ukoln.ac.uk/?guid=20231be8b793d7f69bf446b0c93da8b3 In its outreach and training activities, the DCC has had a long-term interest in reaching new generations of research staff, including doctoral students and those classed as early career researchers. 

read more

]]>
http://irg.ukoln.ac.uk/2012/07/03/reflections-on-data-sharing/feed/ 0
IMPACT Final Conference – Blog-index http://impactocr.wordpress.com/2011/10/26/impact-final-conference-links-to-presentations/?utm_source=rss&utm_medium=rss&utm_campaign=impact-final-conference-blog-index http://impactocr.wordpress.com/2011/10/26/impact-final-conference-links-to-presentations/#comments Wed, 26 Oct 2011 08:00:10 +0000 impacteib http://impactocr.wordpress.com/?p=1221 ]]>

The whole conference was blogged and photographed with presentations uploaded to Slideshare and videos to Vimeo.

These are also embedded within the blogs on this site.

This post contains direct links to all posts made at the Final Conference.  Please do feel free to add comments or thoughts below the posts.

Monday 24 October 2011

BLOCK 1: OPERATIONAL CONTEXT

BLOCK 2: FRAMEWORK AND EVALUATION

BLOCK 3: TOOLS FOR IMPROVED TEXT RECOGNITION

Tuesday 25 October 2011

BLOCK 4: LANGUAGE TOOLS AND RESOURCES

BLOCK 5: IMPACT CENTRE OF COMPETENCE

PARALLEL SESSIONS

  • Research Session: Presentation and discussion of state of the art research tools for document analysis and OCR, hosted by Apostolos Antonacopoulos (University of Salford).
  • Language Session: Presentation and demonstration of the IMPACT language tools & resources in further detail, hosted by Katrien Depuydt (INL)
  • Digitisation Tips Session: Meet the expert: questions & answers on digitisation issues, hosted by Aly Conteh (The British Library)

]]>
http://impactocr.wordpress.com/2011/10/26/impact-final-conference-links-to-presentations/feed/ 0
IMPACT Final Conference – Research Parallel Sessions Brief Summary http://impactocr.wordpress.com/2011/10/25/impact-final-conference-%E2%80%93-research-parallel-sessions-brief-summary/?utm_source=rss&utm_medium=rss&utm_campaign=impact-final-conference-research-parallel-sessions-brief-summary http://impactocr.wordpress.com/2011/10/25/impact-final-conference-%E2%80%93-research-parallel-sessions-brief-summary/#comments Tue, 25 Oct 2011 18:27:52 +0000 Nora Daly http://impactocr.wordpress.com/?p=1119 ]]> Parallel Session 1: Research was dedicated to presentations and discussions around the state of the art research tools for document analysis developed via the IMPACT Project.

As you might guess from the slides below, the information packed into these presentations could fill a whole new two-day conference!  But for now, a brief summary will have to suffice and I will implore you to visit the tools section of the freshly launched Impact: Centre of Competence Website for more details.

A video of the session is available here:

Impact Tools Developed by NCSR (Basilis Gatos)

The folks at the Computational Intelligence Laboratory over at the National Centre of Scientific Research (DEMOKRITOS) in Athens focus their activity around “research & development methods, techniques and prototypes in the areas of telecommunication systems, networks, and informatics”. Involved with IMPACT since 2008 they have partnered in the production of nine software tools to support binarisation, border removal, page split, page curl correction, OCR result, character segment, word spotting.

OCR for Typewritten Documents (Stefan Pletschacher)

Stefan explained that typewritten documents from roughly 1870-1970′s pose a unique challenge to OCR recognition. He points out that each character is actually produced on the page independently of the rest and  they can appear with different weights do the mechanical nature of the process, even within the same word. Typical typewritten documents in archives are actually carbon copies with blurred type and a textured background, and administrative documents at that, rife with names, abbreviations, numbers, which render typical lexicon based recognition approaches less useful. A system was developed in IMPACT to tackle these unique issues by incorporating background knowledge of typewritten documents, and through improved segmentatio and enhancement of glyph images, while ”performing language independent character recognition using specifically trained classifiers”.

Image Enhancement, Segmentation and Experimental OCR (A. Antonacopoulos)

Representing the work of PRImA, Pattern Recognition & Image Analysis Research at the University of Salford Apostolos demonstrated their approach to the digitisation workflow and the tools developed for Image Enhancement (border removal, page curl removal, correction of arbitrary warping) as well as segmentation (recognition-based and stand alone).


]]>
http://impactocr.wordpress.com/2011/10/25/impact-final-conference-%e2%80%93-research-parallel-sessions-brief-summary/feed/ 0
IMPACT Final Conference – Language Parallel Session http://impactocr.wordpress.com/2011/10/25/language-parallel/?utm_source=rss&utm_medium=rss&utm_campaign=impact-final-conference-language-parallel-session http://impactocr.wordpress.com/2011/10/25/language-parallel/#comments Tue, 25 Oct 2011 14:23:54 +0000 Marieke Guy http://impactocr.wordpress.com/?p=1026 ]]> The language parallel session consisted of a series of presentations and demonstrations of the IMPACT language tools and hosted by Katrien Depuydt (INL).

Named Entity Work in IMPACT: Frank Landsbergen

Frank began by defining named entities (NE). They are a word or string referring to a proper location, person or organisation (or date, time, etc). Within IMPACT the term is limited to location, person or organisation. The extra focus on these words is primarily because they may be of particular interest to end users and because they are usually not in dictionaries, so there is more improvement in the lexicon and ultimately the OCR. Note that a lexicon is a list of related entities in the database that are linked e.g. Amsterdams, Amsteldam, Amsteldamme = Amsterdam.

He then spent the majority of his talk walking us through the 4 step process of building a NE Lexicon:

  1. Data collection
  2. NE tagging – Possibilities include NE extraction software, use of the Stanford University module, statistical NE recognition (the software ‘trains’ itself) or manual tagging. Many of these tools currently work best with contemporary data.
  3. enrichment (POS tagging, lemmatizing, adding the person name structure, variants)
  4. database creation

So far the majority of IMPACT’s NE work has been on creating a toolkit for lexicon building (NERT, Attestation tool) and creating NE-lexica for Dutch, English and German.

Special Resources to Access 16th Century Germany: Annette Gotscharek

The 16th Century German printed book collection was a special case because the resources were so old, and therefore very challenging. There were a number of special features of the historical language on a word-level. The historical variants were all added to the modern lemma.

The diachronic ground truth corpus was text files of what appeared on the scans. It was collected from different resources on the Web and non-public electronic corpora. They worked on areas including creation of a hypothetical lexicon and manually verifying IR-lexicon.


Polish Language Resources in IMPACT: Janusz S. Bień

Janusz and his team faced a number of challenges when working with Polish text. They did not use the oldest Polish dictionary but focused on later historical dictionaries and sources used by these dictionaries. The earlier dictionaries were rejected because they had relevant information but it was too difficult to extract. They also struggled to use other later dictionaries because of copyright issues. In the end they did manage to use a selection of dictionaries and texts including the Benedykt Chmielowski encyclopedia, which is famous for its memorable “definitions”: “Horse is as everyone can see.

They looked at a number of tools including the lemmatization in INL Lexicon tool (SAM, SpXViiw). More information is available at http://bc.klf.uw.edu.pl

Slovene Language Resources in IMPACT: Tomaž Erjavec

Tomaž previously worked on the AHLib project looking at transcription correction and markup. At the same time as their work on the IMPACT project they also won a Google award so have been able to develop language models for historical Slovene.

Their methodology has been to develop 3 resources: transcribed texts, hand-annotated corpus and a lexicon of historical words. They have also developed the annotation tool, ToTrTaLe which aids in tagging and lemmatising historical Slovene. The main issues have been tokenisation (words were split differently in historical language), variability and extinct words. During the project they have transcribed over 10 million words, these comprise of the AHLib corpus/DL, NUK GTD, Google Books and Wiki source –all are freely available.

A video of the session is here:


]]>
http://impactocr.wordpress.com/2011/10/25/language-parallel/feed/ 0
IMPACT Final Conference – Digitisation Tips Parallel Session http://impactocr.wordpress.com/2011/10/25/digitisation-tips-parallel-session/?utm_source=rss&utm_medium=rss&utm_campaign=impact-final-conference-digitisation-tips-parallel-session http://impactocr.wordpress.com/2011/10/25/digitisation-tips-parallel-session/#comments Tue, 25 Oct 2011 14:14:17 +0000 Greta http://impactocr.wordpress.com/?p=1029 ]]> Aly Conteh (BL) hosted the parallel Q&A session on Digitisation tips. Pannel members included Astrid Verheusen (Manager of the Digital Library Programme and Head of the Innovative Projects Department, KB National Library of the Netherlands), Geneviève Cron (OCR expert, Bibliothèque nationale de France), Christa Müller (Director Digital Services Department, Austrian National Library), Majlis Bremer-Laamanen (Director of the Centre for Preservation and Digitisation, National Library of Finland) and Alena Kavčič – Colic (Head of the Research and Development Unit, National and University Library of Slovenia).

Here are some of the questions and issues addressed:

  • Q: How do Digital Libraries exist without METS/ALTO and how do you support retrospective production of OCR? A: You always have to try to find the best solution in terms of time, cost, scope and user needs. Currently, some libraries use only ALTO as it better suits the project’s needs. Standards like ALTO, however, don’t always support certain features. While new standards releases are being reviewed and will soon be published, it is paramount that libraries evaluate their data and resources and adopt the necessary measure accordingly. The problem of continuous migration due to updated standards will always remain for as long as we digitise. If, however, OCR is in raw, plain text, retrofitting it into METS/ALTO is encouraged as it transforms the users’ experience with working with that information. It is relatively straightforward, not highly technical but it does, of course, need some financial support.
  • Q: Many libraries digitising newspaper collections clean the head titles of their documents. Will this still happen in the future? Why insist on cleaning head titles rather than invest in digitising more pages? A: Good point! Some libraries have already interrupted the process of cleaning headings in favour of a larger number of digitised pages. However, the higher accuracy in article titles, the higher the relevance of the article to search terms. On the other hand, OCR of headings does cost more and it limits the number of pages you can digitise. It comes down to choices. A possible solution: invest the money on good scans, let the software do the OCR and live with the automated results. Do not spend money and time in manual correction. And remember to always consult your user community.
  • Q: How do you measure capture quality when you lack ground truth? A: It is impossible to ground truth everything in a digitisation project but what you could do is sample some of the pages rather than check every single one. However, OCR machines do come with a certain level of confidence.
  • Q: What are library priorities for the next 10 years? A: To obtain copyright exceptions and extended licensing so that we can publish material currently protected. As regards the British Library, current and near future projects include theatre playbills, newspaper collections, maps and medieval manuscripts.

]]>
http://impactocr.wordpress.com/2011/10/25/digitisation-tips-parallel-session/feed/ 0
Introducing the partners: The British Library http://blogs.ukoln.ac.uk/patientsparticipate/2011/03/30/introducing-the-partners-the-british-library/?utm_source=rss&utm_medium=rss&utm_campaign=introducing-the-partners-the-british-library&utm_source=rss&utm_medium=rss&utm_campaign=introducing-the-partners-the-british-library http://blogs.ukoln.ac.uk/patientsparticipate/2011/03/30/introducing-the-partners-the-british-library/?utm_source=rss&utm_medium=rss&utm_campaign=introducing-the-partners-the-british-library#comments Wed, 30 Mar 2011 09:24:52 +0000 Monica Duke http://blogs.ukoln.ac.uk/patientsparticipate/?p=103 The stakeholders for Patients Participate! are spread across different communities: patients, charities, funders of research, researchers and policy makers (among others). The partnership that has been put together to carry out the project reflects the diverse interests that are represented in this project. Although all the organisations involved are household names in their own community, we recognise that they may not be so well known in other disciplines. The first posts in this blog represent a mini-series that we have put together to introduce the different parties, and their interest in the project. Lee-Ann Coleman starts us off by introducing The British Library.

One of the British Library’s strategic priorities is to ‘Enable access to everyone who wants to do research’; this project adds another dimension to the role of an information provider in achieving that aim. The Library has been the lead partner in a consortium delivering UK PubMed Central – the free-to-access archive of biomedical research papers – for the past five years. This service is supported by the eight major funders of biomedical research in the UK, with the intention that the research that they support (either through the tax payer or the donating public) should be available to anyone without the barrier of a further charge to access the information. There are over 2 million journal articles in UKPMC and it provides access to a further 24 million abstracts; all anyone needs is a computer and internet access – and a degree in biomedical science! Even those working in the field struggle to keep up with the volume of new information and if the research area is outside one’s expertise, it can often take significant time and effort to understand the context and the key messages contained within a research article.

So the question is can we involve the public and researchers in making sense of scientific concepts and terminologies to help them to understand the current state of research in specific areas? Stem cell research is an area that has a high profile in the media, with promises of cures for a range of diseases and the possibility of regenerating or replacing tissues and even whole organs. We are looking forward to working with patients, charities and researchers to better understand the barriers to extracting information that is useful to them from scientific papers and assessing the feasibility of building a corpus of quality content that is truly open to all.

Lee-Ann Coleman is Head of Scientific, Technical and Medical Information at the British Library.

]]>
http://blogs.ukoln.ac.uk/patientsparticipate/2011/03/30/introducing-the-partners-the-british-library/feed/atom/ 0