Magazine

Saving the Internet

Since the early 2000s, the Bentley Historical Library has collected 2,255 websites comprising more than 15 million individual web pages. Dallas Pillen, the Bentley Archivist for Metadata and Digital Projects, explains this important aspect of archiving in the modern world.

by Rob Havey

Q: The Bentley is mostly known for its photos, letters, and the pieces of history you can touch. Why is the Library now archiving websites?

A: Many of the things we have traditionally collected are just no longer produced in paper form. We do get electronic materials from network transfers or people giving us hard drives, CDs, and DVDs, but a lot of content is only published on the web. We’re also able to supplement the things we have traditionally collected by archiving websites and to fill in some gaps in our collecting; the work of student organizations, for example, is more readily documented on the web.

Q: Can you explain more about what web archiving is?

Dallas Pillen, Bentley Library website archivist.

Dallas Pillen, Bentley Library website archivist.

A: Web archiving is collecting, preserving, and providing access to data that is/was on the web. The web archiving process involves saving embedded images, files, style sheets, JavaScript, and other things that make up the look and feel and functionality of a page, then providing a way for people to view the saved website as it was when we captured it.

Q: How does the Bentley pick websites to archive?

A: That’s the job of our Field Archivists, the ones that make the decisions about the material the Bentley collects. They identify websites that are important to preserve, that fit into the Bentley’s collection development priorities. A University of Michigan student organization website or a Michigan congressional campaign page are good examples of what they look for. Once they decide that we should preserve the site, they send it over to me.

Q: What do you do after the Field Archivists hand you the site?

A: First I add what is called a “seed URL” into Archive-It, the program we use to do web archiving. The seed URL is normally the homepage of a website (www.umich.edu, for example) and from there a web crawler will traverse the web of interrelated content that makes up the site. The goal is to capture and display the archived version of the site that’s as true to the original as possible. After this is done, I write the metadata for the collection, which might include a description of the website, what group it was for, when we archived the site, etc.

Q: What is something that’s surprised you doing this job?

A: When you do web archiving you learn the internet isn’t very “neat.” There are a lot of messy things behind the scenes that makes it all work. It’s impressive that the internet functions so well every day. Impressive and a bit terrifying.

To see all of the sites the Bentley has archived, please click here.