Back in 1938, H G Wells made a prediction:
“The time is close at hand when any student, in any part of the world, will be able to sit with his projector in his own study at his or her convenience to examine any book, any document, in an exact replica.”
It’s odd to think that his ideas on space travel were actually adopted far more quickly, but, 75 years later, we do finally have something resembling his vision. In the post I’m going to discuss some of the most significant digitisation projects, what they offer researchers, and also explain a few of the limitations. This is based partly on a talk I gave at the Digital Dickens Conference last year.
Given its might and scope (over 30 million texts), Google Books is perhaps closest to Wells’s vision. It offers full-text searching, and collaboration with specialist libraries ensures that niche areas are represented, too. There’s no question that Google has the funds to pursue such an ambitious digitisation project, but there have been notable problems.
Firstly, publishers aren’t happy. There was, of course, the high-profile court case a couple of years ago, and Google’s potential copyright liability has been estimated at $3.6 trillion. Surely even their seemingly limitless coffers wouldn’t support a more robust class action?
Secondly, quality issues are well-publicised. In the Victorian novels I enjoy reading, key plot twists are often obscured by sloppy scanning or sinister condom-clad fingers. If it’s worth doing, it’s worth doing properly – especially when the physical artefact is sometimes destroyed or discarded after digitisation.
Run by librarians through partner institutions, the HathiTrust generally offers good quality scans and metadata of over 11m texts. Although there is some overlap with Google Books and Internet Archive, they do hold some unique material, too. The main problem is their ultra-conservative approach to copyright. Following a major court case in 2011, HathiTrust restricts access to most of the content. While US visitors can view the full text of anything published before 1923, those of us outside the US can mostly see only content from before 1873. It would simply cost too much to check the copyright status of every item, so they’re (understandably) erring on the side of caution.
The not-for-profit Internet Archive offers unrestricted access to over 6m books. They work with both libraries and individuals to scan a wide range of material and make it available to all. Given that anybody can upload a book, the quality does vary, but the range is impressive – particularly if you’re seeking different editions of the same book. The Internet Archive has faced a few problems with copyright infringement, but it hasn’t as yet affected accessibility. I’ve found it it by far the best online repository and the built-in viewer makes it easy to browse and search the texts. Long may it continue.
Remarkably, Project Gutenberg was founded back in 1971, which makes it a Methuselah by web standards. Their collection is relatively small at around 45,000 texts, and it does tend to be weighted towards more famous authors. However, all the content is scanned using OCR (Optical Character Recognition) and cleaned up, giving readers readable text that is easily transferred to Kindles and other devices. This cleaning-up process includes proofreading by a huge team of human volunteers, ensuring good quality throughout. Problems do persist, though, such as modified spelling (e.g. British spelling ‘corrected’) and random insertions/deletions. Project Gutenberg offers the most flexible format, and users can easily send texts to Dropbox, Google Drive, or SkyDrive using the icons:
My favourite scanning error was a sentence that should have read “Mrs Henry Chetwynd has a great fondness for Burns”; after OCR, it became: “Mrs Henry Chetwynd has a great fondness for bums.” And the poor woman is no longer here to defend herself.
Finally, I’d like to draw attention to Book Traces, a repository very different from those described above, but just as important. Founder Andrew Stauffer is on a mission to collect unique copies of nineteenth- and twentieth-century library books (published before 1923). Some copies conceal unique material – such as annotations, letters, and pictures – and we’re in danger of losing them as libraries opt to go digital. Individuals are invited to take photos and upload them to the website, along with details of the source. A poignant example is a collection of poetry by Felicia Hemans. A grieving mother has added her own poem, in the style of Hemans, to mark the loss of her daughter.
Obviously, the range is quite small on Book Traces, but this material is invaluable for those of us interested in how books were read, and all the other uses to which they were put.
I’m sure H G Wells would be pleased that his vision has become reality. But he’d probably be disappointed that we haven’t yet travelled back in time to tell him what has been achieved.