Archiving the UK web
Every time a book is published in Britain, a copy is kept for posterity in the British Library. Now the library wants to do the same with websites, but is that possible?
The British Library has asked for the law to be changed to allow the UK's five copyright libraries to be allowed to store and make available copies of UK websites, in the same way that they can currently store printed materials. It says that at first it would do an annual sweep of UK websites, with some important sites such as the Government's Number 10 website being archived more frequently, and reckons that would give it about 220 TB of data which would cost it about £4,000 in storage. On the face of it, that looks like a feasible and cost-effective plan.
But what does archiving the web really mean? Does it mean archiving the data inside those websites, or does it mean archiving the look and feel of the websites as they were at the time? Both computers and browsers have changed a great deal over the last few years, and a site which looked great in the year 2000 will look strange and dated today. It will probably have minuscule graphics because screen sizes were so much smaller back then, and the wider screens today will completely change the layout of the sites.
Too often sites have been "optimised" for particular browsers, and some of the sites from 1999 would be a garbled mess if used with with the browsers of 2010. Whilst websites today tend to be better at adhering to standards, we still have to ask if contemporary websites will be usable on the computers of ten or twenty years into the future, or will the Library also have to archive copies of each era's browsers to view them with. And if they do that, will they also need to mothball some of today's computers with today's operating systems, to ensure the archived browsers will still work.
But websites are more than the things you see on the screen. These days, many websites are dynamic. Perhaps your website contains a searchable directory, and produces indexes in various formats according to what you are searching for. There could be thousands of permutations of data, but if you archive snapshots of web pages you can never reflect that complexity or be sure you are archiving all the information, and you cannot archive a working website unless you archive the server software and databases which drives it.
For archivists these are thorny questions. More and more of our information now appears only in electronic form, its is transient, intangible, and much of it can disappear and be lost forever. Equally, the web is awash with pages such as Twitters, Blogs, Facebooks, flame wars in forums, not to mention the pornography and scams. Should those also be archived for ever and a day? Who is to say what the next generation will want to find out about us? And if such an archive is going to be truly useful to future historians, it is also going to need some sophisticated search indexing comparable to Google.
The more I think about this problem, the more I think the British Library is badly underestimating the cost, and the more I worry that, in years to come, it will grow into an expensive white elephant that is hard to maintain but scarcely used.
26th March 2010
This article comes from the SKILLZONE email newsletter, published monthly since January 2008, and covering topics related to technology and the internet. All articles and artwork in the SKILLZONE newsletter are orignal content.