Creating a historical newspaper archive

The digitisation of archives is helping democratise the study of the past. But the process is more lengthy and complex than you might think.


Transforming historical research

Digital archives have transformed the landscape of historical research, opening up access to rare material which in physical form can be locked away in the special collections of a single library.

But digitising a newspaper archive is not simply a case of scanning the pages and putting them on the internet.   Behind each digital newspaper archive is a mammoth project involving editorial selection, content processing conundrums and a wide variety of bespoke technical decisions.

The Daily Mail Historical Archive – key challenges

Sourcing the material

To begin with, simply reaching a decision about which edition to digitise can be complex.  Not only do newspapers publish several hard copy editions every day, but they publish regional issues too, including in the case of The Daily Mail an Atlantic Edition for ocean going vessels!  Which edition do you decide is ‘authoritative’?

Other editorial and technical questions you have to resolve include:

  • how do you plug content gaps (e.g. the newspaper was not printed in England during the General Strike);
  • do you use microfilm or hard copy originals?
  • what’s the best resolution for image capture?
  • what is your content cut-off point?

Automated processes and human intervention

For the next stages, you need large teams of people to assist with the creation of digital data (OCR, metadata, XML) as well as quality assurance processes.  Scanned pages of a newspaper are simply a form of photograph – a picture of the text – and consequently of limited use in of themselves.  Without data that supports those images, the scanned pages are not searchable or discoverable in a digital environment. The creation of such data is the key component of any digital archive project. It powers the functionality that allows users to search, retrieve and browse the hundreds of thousands of pages.

To render the text on a scanned newspaper page searchable, we put it through a process known as optical character recognition (OCR). The text produced by the OCR process is what is actually being checked when a user enters a search term. OCR software analyses the light and dark areas of the scanned image in order to identify each alphabetic letter and numeric digit. When it recognises a character, it converts it into regular text.

No small undertaking!

Creating a digital newspaper archive is no small undertaking. Creating an archive with the careful content selection, appropriate imaging quality and richness of data required for the modern researcher is a taller task still.

Such projects are consequently a huge investment in time and resources, but by making our newspaper heritage more widely accessible and discoverable, they rescue the thoughts, words and deeds of past generations from crumbling, unnoticed, to dust. That seems a price worth paying.

Seth shared the story of the creation of the Daily Mail Historical Archive at Internet Librarian International.  You can read a more detailed account on the Cengage website here.