Digital Humanities :: Final Paper :: Ben Goldman
The Preservation of Born-Digital Literary Archives
by Ben Goldman
Introduction
In March 2008, the International Data Corporation (IDC) released a report detailing the size and growth of digital information. The report found that in 2007, the size of the digital universe – all information that exists in digital form – to be 281 exabytes, or 45 gigabytes for every person on earth. i This number was ten percent greater than the IDC previously projected, indicating unprecedented growth. Using this data as a guide, the IDC predicted that in 2011, less than three years from now, the size of the digital universe would grow to 1.8 zettabytes. The report also found that the amount of data created in 2007 exceeded storage capabilities, and projected that half of the digital universe in 2011 will not find permanent storage. ii
The IDC attributed much of this growth to the proliferation of personal digital imaging and video equipment. Digital text was not explicitly mentioned, but it should be clear to everyone now the ways in which technology has transformed personal written communications. The first popular WYSIWYG word processing software was released to the public on the Macintosh 128K in 1984 iii, and was soon followed by the ubiquitous Microsoft Office suite of text production software. Email has largely replaced handwritten letters, and blogs were first introduced to mainstream culture as a personal online journaling tool.
Matthew Kirschenbaum, a digital humanities researcher, pointed out in an article for The Chronicle of Higher Education, "nearly all [modern] literature is born digital in the sense that at some point in its composition … the text is entered with a word processor, saved on a hard drive, and takes its place as part of a computer operating system." iv Archives and special collections departments of libraries are beginning to work with the artifacts of the authoring process described by Kirschenbaum, as more authors begin to donate, as part of their literary papers, not just the digital files that contain their works, but also the computers and storage devices used to produce and store them. Libraries and archives have over the decades developed sound processes for preserving the material artifacts of print culture, but preservation of born-digital works requires new skills and processes, many of which have been developed and subsequently described by technologists in the corpus on digital preservation, and may be of use to institutions working with born-digital works. There is an increasing emphasis, however, being placed on digital preservation processes that are harmonious with archival processes. This has inevitably led many researchers to the conclusion that preservation needs to begin much earlier in the life cycle of digital documents, even before they are transferred to archives for preservation.
This paper will examine the work that has been done with existing digital and hybrid (print and digital) archives. One of the more prominent examples is that of the Michael Joyce Papers at the Harry Ransom Center in Texas. Much has been written about this hybrid archive, by archivists and researchers alike. The Salman Rushdie Papers at Emory University also provide insight into the methods for accessioning a large digital archive and making it available for use to the public. In both cases migration was chosen as the preservation strategy, but this paper will touch on the future prospects of other strategies proposed, including emulation and persistent object preservation.
Two long-term research projects have explored different approaches to preservation of personal digital archives. Two in particular, the Personal Archives Accessible in Digital Media (PARADIGM) project, and the International Research on Permanent Authentic Records in Electronic Systems (InterPARES), have focused on developing a set of best practices for both creators and preservers to follow in managing digital documents. This paper will also examine these recommendations.
What becomes apparent after reviewing the literature on digital preservation and personal archives is not that we lack the technical processes or expertise to maintain the growing universe of digital documents, but that doing so: a) requires far more advanced skills than are currently possessed by many practicing archivists, and b) necessarily has an impact on the authenticity of a document.
That our creation of digital information far outstrips our ability to store it should surprise nor alarm anyone. Archivists and curators are trained in the processes of selection and retention. Moore's Law continues to advance the abilities of storage, while technologists and theorists continue to develop technical solutions for digital preservation. The questions that will have to be addressed are: how reliable will the digital universe that future generations inherit be? What can archivists do to ensure that their preservation activities enhance – not compromise – authenticity? Will digital archives require unconventional archival processes? And finally, are digital records really less trustworthy than their analog counterparts?
Using the Born-Digital Archive
In his book, Mechanisms: New Media and the Forensic Imagination, Matthew Kirschenbaum describes his experience accessing the digital archive of hypertext fiction author Michael Joyce at the Harry Ransom Center in Austin, Texas, which does not deviate much from the experience of most scholars using print archives. v For starters, he was required to visit the archive in person and access the digital files from a dedicated laptop computer in one of the archive's reading rooms. All of the files were stored in DSpace, a digital repository for electronic documents, along with metadata about the files and the activities undertaken to make them accessible to researchers. Kirschenbaum describes how, in order to use the files in the repository, he had to download local copies to the Center's laptop computer. He describes wanting to know more about the history of these files, and to see them in their native software environments. Despite the files being electronic, he is not allowed to freely make copies of them; even screenshots must be paid for. vi
The Michael Joyce collection is what archivists have come to call a hybrid archive: it consists of both digital and print materials. In some cases, the print and digital materials duplicate one another. The digital part of the archive was donated to the Harry Ransom Center by Joyce in the form of 371 floppy discs. vii It is considered an important collection partly because of the author's stature as one of the original and preeminent hypertext fiction authors, but it should be noted that even more traditional – and in some cases, less technically-inclined – authors have considerable portions of their personal archives that are digital or born-digital. Another noteworthy archive held by the Ransom Center is the Norman Mailer Papers, which included three laptop computers and nearly 400 disks of correspondence. All digital files contained in these media were created by an assistant to Mailer. viii The Woodruff Library at Emory University has received one of the most significant digital archives yet. The Salman Rushdie computers – three laptops, one desktop computer, and one external hard drive – contained approximately eighteen gigabytes of data from over 40,000 electronic files. ix
These three literary archives are some of the most notable, but by no means the only, examples. What is important to recognize here is the ubiquity of electronic text production by today's authors. Kirschenbaum captures the scope of the issue in an article on born-digital literature in the Chronicle of Higher Education:
Today nearly all literature is "born digital" in the sense that at some point in its composition, probably very early, the text is entered with a word processor, saved on a hard drive, and takes its place as part of a computer operating system. Often the text is also sent by e-mail to an editor, along with ancillary correspondence. Editors edit electronically, inserting suggestions and revisions and e-mailing the file back to the author to approve. Publishers use electronic typesetting and layout tools, and only at the very end of this process almost arbitrarily and incidentally, one might say is the electronic text of the manuscript (by now the object of countless transmissions and transformations) made into the static material artifact that is a printed book. x
Elsewhere in the article, Kirschenbaum ponders the extent of an author's online creative output – what the IDC refers to as one's digital shadow. xi Possible activities include updating one's blog or Facebook page, editing an entry in Wikipedia, or creating an avatar for Second Life. The breadth of personal and creative information that might be captured on a computer donated to an archive is potentially endless, and realizes in the extreme, as Kirschenbaum notes, Foucault's exploration of what constitutes an author's complete works. xii
A 2005 article in the New York Times confirms what any reasonable person might already suspect: today's best-known, critically acclaimed authors – Annie Proulx, T. C. Boyle, Margaret Atwood, Dave Eggers, to name a few – use technology to communicate. xiii The author of the article, Rachel Donadio, surveyed several writers and literary agents of today and found that all correspond with friends, family, and other writers a great deal using email. Author Amy Tan, in a recent Flair Symposium on literature and archives at the Harry Ransom Center, claimed that her personal email output, if ever compiled, would amount to 300 books. xiv A writer's correspondence is considered by scholars to be important material evidence of authorial activity. Most literary archives likely contain correspondence of a traditional variety – notes, postcards, letters. Donadio's article ponders the potential loss of this important literary evidence in the electronic age.
The widespread use of technology to support authorial activities is not surprising, but what this means for archives – and the researchers that rely on them – remains to be determined. These first forays into preservation of born-digital literary archives have been tentative. That Kirschenbaum did not access the Joyce digital archive over the Internet is telling. Despite the affordances of technology, the ways that researchers are using these first digital archives is very much based on procedures developed to support the use of print archives. Partly, this can be attributed to copyright – Michael Joyce happily enjoys continued ownership rights over his creative works, thereby limiting the ability of the Ransom Center to freely replicate his digital archive – but partly this is also due to the cautious and practical choices made by archivists in the processing and preservation of the Joyce archive.
Processing the Born-Digital Archive
Much of what is known about the processing of the Michael Joyce Papers comes from an unpublished case study written by Thomas Kiehne, Vivian Spoliansky, and Catherine Stollar, who collaborated on the project as students in the information studies program at University of Texas in Austin. xv Kirschenbaum called their work, “the most comprehensive and systematic act of digital preservation in the field of born-digital literature to date, and the insights gleaned from the process will be relevant for some time to come.” xvi This section will examine the case study on processing the Michael Joyce floppy disks and compare it the choices made by Emory University in processing the Salman Rushdie computers.
The first challenge the Joyce archivists faced was the storage media donated by the author. An initial attempt at reading the data on the 3.5 inch floppy disks was unsuccessful because they attempted to use modern Macintosh computers with USB-connected floppy drives, which resulted in data corruption. xvii Hardware obsolescence presents many of the significant obstacles to preserving digital information, but in the case of the Michael Joyce Papers, hardware obsolescence also dictated many of their processing decisions, including, apparently, the decision to migrate all of the material off of the old hardware and onto newer media, which they were able to accomplish by using older Macintosh computers that contained built-in floppy drives. xviii
While it is tempting to think that digital archives are entirely electronic and not in any way physical, this would be a mistake. In fact, digital archives are both physical and electronic. In addition to processing the files contained on the disks, the archivists at the Ransom Center also processed and preserved the original, unaltered disks, which later were stored, according to the arrangement of the physical collection, in Hollinger boxes with their corresponding print materials. xix According to Erika Farr at Emory University, who led their library's project on the Salman Rushdie computers, this strategy allows the archive to process the digital files without compromising integrity of the digital originals. xx In digital preservation parlance, this is known as bitstream preservation.
So with the bitstream preserved on the original media, the archivists at Ransom began processing the duplicated files. They describe this process as two-fold: checking for viruses, and file recovery (as necessary). xxi Emory labeled this “triage,” and notably started the overall collection processing with this step (Rushdie donated computers rather than disks). xxii File recovery becomes important because of the possibility of file or disk errors, especially when dealing with obsolete media. The authors note that, “errors and crashes must be met with persistence as they are often surmountable.” xxiii
Emory's approach perhaps differed due to nature of the media. Instead of merely copying and preserving files, Emory was in a position to copy and preserve an entire desktop environment – several, actually. Their triage process included assessing the health of the hardware, performing system diagnostics, identifying file formats, and checking for encryption. Instead of copying the files, as Ransom did, Emory opted to image the entire machines, creating “an exact duplicate of the original computers and their files.” xxiv
The Michael Joyce floppy disks probably represented a fraction of the data found on his computers over the years. In all likelihood, he carefully chose which files to retain on the disks and which to leave unpreserved, perhaps even using the disks as backups to his primary copies. Salman Rushdie's computers provide a much large store of data for the archive to manage, including information that he did not want made public, such as personal addresses or bank account numbers (Rushdie at one time had a fatwa placed on him by Muslim clerics angry with his written depiction of Islam; privacy is likely a great concern of his) xxv. One of Emory's major tasks with the processing of the Rushdie computers was to scrub such personal information, a process known as “embargo” in archival circles. xxvi
The available information on Emory's experience with the Rushdie computers is largely focused on the technical challenges of copying the data and choosing a method for making it accessible. The Joyce archivists, by contrast, have firmly placed their case study within an archival context. Great detail is provided about the steps undertaken to document their work – for example, the description, appraisal, and preservation activities involved.
Some of the description activities are served by the steps already discussed – what the authors call the digital archaeology process. After copying and checking all the files on the floppy disks, the authors created an inventory of all the copied files, with the goal of having item-level metadata, including file names, file sizes, and formats. As much as possible, they sought to capture any system-created metadata about the files, including creation and modification dates. One significant problem they encountered was the different provision of creation dates found between Macintosh – the native operating system of the files – and Windows – the operating system of the digital repository that would provide public access to the digital archive: “when a Macintosh file is copied, the creation date of the original is maintained while a copy performed in Windows resets the file creation date to the date of copy.” xxvii They also noted that some creation dates were quite obviously wrong (e.g. 1/26/1904). Attributing this to either file corruption or an incorrect system clock, the authors noted that all system-generated metadata cannot simply be assumed to be accurate. xxviii
Appraisal in an archival context is typically understood to mean the process of assessing the archival value of records and eliminating those that lack value. Bitstream preservation prevented the Joyce archivists from disposing of any records found on floppies, but they did evaluate the replicated files and disposed of ones that were not the work of Joyce (e.g. student work, software created by others). They also note the difficulty in determining the relevance of electronic files, where the absence of familiar cues make it difficult to ascertain whether Joyce in fact was the author:
[w]hen the author did not explicitly insert his name in the text and the document was not clearly perceived as being his, we could not count on handwriting analysis, letterhead, type of paper, ink color, ink type, smell, type of copy, and other clues that are generally applied in the paper contexts. We were completely dependent on the language that was used and the content of the file, which sometimes could be quite misleading. xxix
Technology does, however, provide affordances not found in the print domain. The Joyce archivists used MD5 file hashes to compare files and dispose of any redundancies. Hashes and other similar functions such as checksums can be used to compare files, and even verify a file's integrity as it is transmitted between systems or applications.
The ultimate goal of the Joyce archivists was to ingest the files into a DSpace repository, where they would later be accessible to researchers, like Kirschenbaum, visiting the reading room. Some of Joyce's work was also written in a software application known as Storyspace, which Joyce helped to create. Ransom's original goal was to provide an installation of this software in the reading room to render Joyce's Storyspace works. They found, however, that newer versions of Storyspace were not backwards-compatible with the older files found on the floppies. xxx
Storyspace is a software for creating hypertext fiction. While it provides a method for exporting a text as a website, DSpace, according to the authors, was unable to properly render webpages. The result seems to be that the collection's Storyspace hypertext manuscripts are insufficiently accessible to users. The authors identify collection access as being “medium-level,” a phrase that is intended as much to describe the persistence of the files as the level of service provided by the archive. xxxi In the case of Joyce's digital files, partial persistence means the content of the files is preserved, but not the format. The collection's submission information package agreement provides additional information on what this medium-level of persistence and access means to the end-user: “Partially persistent materials that enable medium confidence. Preserves the content of the materials with degradation of form allowed … Checks will be made to verify that the intellectual content is the same.” xxxii
The authors admit to some discomfort with this policy, especially since Joyce's hypertext fiction relies so heavily on the aesthetics of presentation. Emory University also touches on this concern. In formulating researcher access to the Rushdie digital archive, they considered the value of context to the materials. Copying the files out of their native environments allowed Emory to maintain the intellectual integrity of the files regardless of technology advances, but they recognized that “an object-oriented view,” as they called it, may be somewhat out-of-line with traditional archival practice. xxxiii With computers from four different eras and over 40,000 files to consider, Emory could not simply choose to preserve one context; many would have to have been considered.
The deliberations over preservation techniques detailed in these case studies mirror the digital preservation trends found in all types of information institutions. The technique chosen by both the Ransom Center and Emory University is one commonly known as migration, an option that provides some relief from the constraints of specific hardware and software, while sacrificing the context and format. Preserving the context and format of digital information is typically accomplished using a technique known as emulation. Kenneth Thibodeau of the National Archives notes that neither of these strategies is altogether satisfying: “the closer one stays to the original technology and original digital format of the records, the less the problem of authenticity; however it is also obvious that the closer one stays to original technology, the more complex and more impractical the approach becomes over time.” xxxiv This is one of the biggest dilemmas in the field of digital preservation, especially for institutions faced with preserving literary records of cultural or historical significance. Context and medium are important aspects of archival materials to researchers. The processes undertaken by Emory University and the Harry Ransom Center are trying to walk this fine line: securing simultaneously the bits, their many formats and contexts, and the record of their credibility.
Digital Preservation Theory
What are the established digital preservation processes and theories from which the decisions of institutions like Emory University and the Harry Ransom Center are derived? The literature on the subject is vast, comprising technical approaches, frameworks and models, best practices, and metadata standards. This section will explore some of the theories and processes that might be relevant to institutions seeking to preserve born-digital archives.
Migration and emulation, previously discussed, are the two most common approaches to digital preservation, but by no means the only options. Migration and emulation embody opposite ends of a range of preservation strategies described by Kenneth Thibodeau in a seminal paper on digital preservation as “preserving technology” versus “preserving objects.” xxxv Migration as a strategy for object preservation is often chosen because of its feasibility and practicality. Migration is technically feasible because the infrastructure required is often within the means of preserving institutions; it is practical because it less taxing on human and financial resources. The clear drawback to migration as a preservation strategy is its sustainability. The most common form of migration is version migration, which entails porting a document from an obsolete format to a current one. As Thibodeau points out, formats always expire and are eventually replaced by new ones. xxxvi It should be understood, then, that Emory University and the Harry Ransom Center, in choosing migration, have ensured that their preservation responsibilities with the Michael Joyce and Salman Rushdie archives will be an ongoing concern well into the future.
The concurrent strategy of bitstream preservation enables both institutions to respond to future technology preservation (emulation) developments, though Thibodeau is quick to point out that even emulation is a form of migration: “while proponents of emulation argue that it is better than migration because at every data migration there is a risk of change, emulation entails a form of migration. Emulators themselves become obsolete; therefore, it becomes necessary either to replace the old emulator with a new one or to create a new emulator that allows the old emulator to work on new platforms.” xxxvii
Still, there is little doubt that electronic files such as Joyce's Storyspace fiction would benefit from the use of an emulator, despite the risks of future obsolescence. Thibodeau makes the point that choosing a preservation strategy must be done “on the basis of a specific concept or definition of the essential characteristics of the object to be preserved. The intended use of the preserved objects is enabled by the articulation of the essential characteristics of those objects.” xxxviii While Joyce' interactive fiction may require some form of emulator so that researchers can effectively experience a Storyspace work as intended, it is probably not necessary for Emory University to emulate Salman Rushdie's circa-1995 email program for researchers wishing to view his electronic correspondence.
A 2005 study by Michele V. Cloonan and Shelby Sanett determined that emulation and migration were the two most common strategies being pursued by thirteen prominent archival institutions. xxxix It's interesting that emulation strategies were as common as migration, despite the perception that they are more difficult and resource-intensive. Emulation, in fact, is making great strides at the moment. A joint project between the National Library of the Netherlands and the National Archives of the Netherlands highlights the potential of emulation technologies. The focus of the Netherlands project, known as Dioscuri, was the creation of a modular emulator: instead of one emulator that attempts to recreate a whole environment, Dioscuri recreates the individual components that together construct the hardware environment. The strength of modularity, as with actual hardware, is the ability to create various configurations and customizations. The prototype emulator they built was based on Intel x86 architecture, which refers to the era of 32-bit personal computing that began in the late 1970s and early 1980s. The prototype emulates several versions of MS-DOS, which enables, in turn, a variety of software applications. xl
The emulator works by virtualizing the hardware environment on another, more modern piece of hardware, but what happens when, as Thibodeau suggested, the emulator becomes obsolete? The authors of the study chose Java Virtual Machine (JVM) as the interface between their emulator software and the hardware it runs on. But it is not inconceivable that one day JVM may be incompatible with future operating systems, or that Java may ultimately be replaced as the object-oriented programming language of choice by something new and more powerful.
The emulation of a configurable hardware installation clearly affords more opportunities than emulating a particular software version. The extensibly of the solution makes more sense for complex archives like the Rushdie computers, while developing an emulator for an obscure software like Storyspace might not be economically feasible. But one wonders if an emulator like Dioscuri can do enough for archives. It took two years of persistent research and development to create Dioscuri, an emulator that still only recreates a very small portion of the hardware and software technology created over the past couple decades. Can it do enough for the Rushdie digital archive, which comprises over 40,000 files from four different computer models that likely have technologies and versions of an intimidating variety?
Thibodeau clearly is skeptical of technology preservation tactics, even noting that the market tends to exacerbate the obsolescence of software and hardware. xli Instead, he proposes a more conceptually robust approach to object preservation: disassembling a document's essential parts – its content, context, structure, and appearance – and separately storing these parts as stripped down text-based files. xlii Structure can be preserved using XML document type definitions (DTD); context can likewise be preserved in the form of a DTD for collections. The appearance can be stored using eXtensible Stylesheet Language (XSL). Thibodeau even proposes using multivalent documents to store different aspects or renderings of documents as remitted layers of a single document, a complex approach cobbled together in the computer science laboratory that has been around for many years with few examples of practical application. The problems with this approach are obvious: how to maintain the relational integrity of these document parts that form a whole, and whether it is ideal to take a single document requiring preservation and turn it into several that require preservation, turning Rushdie's 40,000 files into hundreds of thousands. This abstraction, however, of the constituent document parts may actually provide an extensible, agile method for preserving electronic documents regardless of the technical evolution of software or hardware advances, which is clearly a benefit.
What all digital preservation approaches have in common, Thibodeau says, is “the objective of solving technological problems related to the passage of time.” xliii This is also what makes them so undesirable. None of the established methods, he says, actually preserve digital records. Nor are they entirely relevant to the goals of archival professionals. “The ultimate criterion for success in the preservation of electronic records,” says Thibodeau, “is not whether they remain true to some given technological materialization, but whether they continue to provide authentic evidence of the activities in which they were created.” xliv This focus on authenticity has led to a greater emphasis on the development of processes and standards, which arguably can do as much or more for the preservation of digital archives than technical advances.
Authenticating Born-Digital Archives
A document is authentic, according to InterPARES, when it is what it purports to be. xlv InterPARES, mentioned at the beginning of this essay, has been instrumental in exploring the conditions for authenticity in the electronic age, and has been doing so in a manner that is firmly grounded in the theories and practices of archival science. The InterPARES website describes their work in greater detail:
[InterPARES] aims at developing the knowledge essential to the long-term preservation of authentic records created and/or maintained in digital form and providing the basis for standards, policies, strategies and plans of action capable of ensuring the longevity of such material and the ability of its users to trust its authenticity. xlvi
Two of the three project phases have been completed. The first focused on creative responsibility and long-term preservation, while the second phase highlighted the role of preservation professionals. One of the more useful tools that came out of the first two phases was a set of guidelines for both creators and preservers to follow to ensure authenticity.
The guidelines are formulated based on the life cycle of a record. The delineation of responsibility is best described by Hackett in an article summarizing the work of InterPARES: “the reliability and authenticity of active and semi-active records are best ensured by the creator, while the authenticity of inactive records is best ensured by the 'preserver'.” xlvii InterPARES guidelines state that both Salman Rushdie and Michael Joyce have significant responsibilities for maintaining the integrity of their records while they are working with them. PARADIGM, a parallel research project out of the United Kingdom that reached many of the same conclusions, is somewhat more direct with its assessment: “the nature of the digital environment requires creators to become curators in their own right.” xlviii This is not revolutionary thought; plenty of writers, concerned with their own legacy, have meticulously maintained personal archives, but probably not at the level of technical expertise now required by modern technology. Authors may also not be used to thinking of the documents they preside over as being potentially inauthentic. As Hackett says, “If records are relied on by their creator in the usual and ordinary course of business, they are presumed to be authentic. But with electronic systems, the presumption of authenticity must be supported by evidence.” xlix
The evidence required to demonstrate authenticity is laid out in the aforementioned InterPARES guidelines. l They are broken down into ten main categories:
- Select hardware, software, and file formats that offer the best hope for ensuring that digital materials will remain easily accessible over time. [Choosing open source software, keeping product documentation, commenting code, documenting system configurations, adhering to standards.]
- Ensure that digital materials maintained as records are stable and fixed both in their content and in their form. [Converting to PDF, for example.]
- Ensure that digital materials are properly identified. [Use of “identity metadata”: creators, dates, title/subject, documentary form, context, rights information. The proposed elements are similar to simple Dublin Core.]
- Ensure that digital materials carry information that will help verify their integrity. [InterPARES calls this “integrity metadata,” but it is essentially what the metadata community would call administrative metadata.]
- Organize digital materials into logical groupings [Not only doing so, but keeping a record of the classification scheme somewhere.]
- Use authentication techniques that foster the maintenance and preservation of digital materials. [Password protecting files, for example.]
- Protect digital materials from unauthorized action.
- Protect digital materials from accidental loss and corruption. [Seek redundancy, in other words.]
- Take steps against hardware and software obsolescence. [Keep software updated and migrate documents as needed.]
- Consider issues surrounding long-term preservation.
Recommendations for this final best practice include identifying a trusted custodian – someone who can manage and preserve inactive documents, someone like an archivist.
One possibility this practice suggests is the immediate transfer of finished documents to a preserver or preserving institution. Thinking of this life cycle and these transfer requirements, it becomes clear that for all their efforts, both Joyce and Rushdie may have compromised authenticity (however slightly) by retaining their records for so long after completing them. The life cycle of technology and the digital records created with it would seem to suggest a move away from the major accession events traditionally associated with archives, and instead call for a process of continuous, ongoing accession.
Would writers be willing to take on this extraordinary responsibility? As the events of the recent Flair Symposium suggest, writers are loathe to maintain their personal archives (or admit to it) at this level of detail. It's interesting to consider, however, that in their case study on the Michael Joyce Papers, the Ransom archivists actually considered establishing a DSpace ingest process for Michael Joyce himself to contribute files to the digital repository. These plans were scrapped when the authors realized they could not adequately control the level of metadata description contributed by the author. li
Still, it is difficult to escape the fact that digital archives may require a greater level of collaboration between the writer and archivist. A recent Society of American Archivists (SAA) Conference session exploring hybrid archives touched on this issue repeatedly. lii One presenter from the University of Connecticut suggested the idea of an “embedded archivist.” Another presenter explained that his institution, Yale University, distributed a modified version of the InterPARES Creator Guidelines to potential donors. Both PARADIGM and InterPARES suggest that more active collaboration would allow writers to learn about more about how they can affect preservation and protect authenticity of their records before donating their personal archives.
The Rushdie and Joyce case studies (and even the SAA Conference session) also demonstrate how writer-archivist collaboration can inform the archival institutions' processing and arrangement activities. Erika Farr notes Emory's work with Rushdie to prepare for the donation of his archive, work that was no doubt instrumental in the digital archaeology process that occurred later. liii The Joyce archivists describe their difficulty in arranging Joyce's files; they ended up relying heavily on a copy of Joyce's Curriculum Vitae downloaded from the World Wide Web, and perhaps could have benefited from more direct interaction. liv The Connecticut archivist at the SAA Conference described her first pass at accessioning a digital archive as being an utter failure: she could identify nothing that looked like a record. She could only proceed after meeting with the author in person and seeing how he worked with his own files. lv
The InterPARES Creator Guidelines have much in common with the Preserving, Archiving and Disseminating (PAD) initiative from the Electronic Literature Organization (ELO). The organization promotes electronic literature in the vein of Michael Joyce's work. Hypertext, interactive, programmed or programmable fiction and poetry are some examples of electronic literature. But the organization also identifies itself as centered on “born-digital literature,” and the “current generation of readers for whom the printed book is no longer an exclusive medium of education or aesthetic practice.” lvi
The Electronic Literature Organization's PAD initiative developed best practices for creators of electronic literature and published them in a paper called, “Acid-Free Bits.” It is notable that although the creators of electronic literature are expected to be, perhaps, technically more fluent than traditional writers, the best practices themselves almost mirror those elucidated by InterPARES for use by all creators of digital documents. They include: keeping software updated, migrating files, commenting code, preferencing open source software and open formats, and documenting one's practices. lvii
The Acid-Free Bits recommendations also include some practices that are not mentioned by InterPARES that may nevertheless by relevant – such activities as retaining source files, validating code, or encouraging duplication and republication. lviii This last notion is one that traditional writers or archivists may not readily consider, but there is, as we will discuss later, an essential nature of digital information that encourages duplication, which may be antithetical to traditional notions of preservation and protection.
A second ELO project looked at the prospects for preserving electronic literature through migration, and even proposes interpretors (emulators) for common electronic literature software applications lix, which would certainly be useful to the Harry Ransom Center and users of the Michael Joyce Papers. But the crux of this second initiative, titled “Born-Again Bits,” is the development of an XML-based metadata schema for describing and transforming works of electronic literature. Because much of established electronic literature is interactive, the metadata schema, known as X-Lit, identifies “behavior metadata” to go along with descriptive and administrative metadata. ELO aims to develop a tool that will migrate works of electronic literature to the X-Lit format, preserving them while making them accessible beyond the limitations of proprietary software formats such as Storyspace. lx The proposal reminds one of Thibodeau's persistent object preservation, which also relied on XML formats for preserving and rendering text.
It's important to recognize that the Electronic Literature Organization is staking the long-term preservation of electronic works on the development of a metadata scheme, especially in light of the InterPARES guidelines for preservers, which essentially asserts that all preserver activities related to digital archives must be documented. The key word in these guidelines is evidence. In addition to documenting evidence of the creator's responsibilities in preserving the records, the archival institution must also prove that the digital archive has effective controls for the transfer of the archive to the preserving institution, document all reproduction activities, as well as any changes to electronic records in the archival description. The Joyce archivists at the Harry Ransom Center attended to this requirement by maintaining an Excel spreadsheet with entries for every file as well as documentation for every activity associated with each file. This included the initial migration of files from the floppies to the Macintosh computer, and the transfer of documents to the DSpace repository. Ideally, this information – the digital provenance – should be inextricably linked to the file itself, but it is not clear whether the Harry Ransom Center makes this information accessible with the Michael Joyce Papers in their digital repository.
This information cannot be arbitrarily recorded and stored. To be effective and reliable, it requires a formal method for capturing this data. Preservation metadata is the term used to describe this formal information model, and the most robust preservation metadata scheme available is Preservation Metadata: Implementation Strategies (PREMIS) lxi, which is based on the Reference Model for an Open Archival Information System (OAIS). lxii The OAIS Reference Framework identifies basic functions like ingest, storage, access, and preservation planning, how these functions correspond to one another, and establishes archival terms and concepts. lxiii
PREMIS is one implementation of the OAIS framework. The metadata elements of this schema “record information that supports and documents the digital preservation process, and the information that supports the viability, renderability, understandability, identity, and authenticity of digital objects over time.” lxiv Those elements are divided into five entities: objects, intellectual entities, events, agents, and rights. In the Michael Joyce Papers, an object would be identified as one of the individual files found on the floppy disks. An example of an intellectual entity would be one of Joyce's hypertext works, comprising a number of distinct objects. An event is any activity associated with the preservation of the object, such as making copies of Joyce's digital files or ingesting the files into DSpace. Examples of agents would be the Harry Ransom Center, its student archivists, DSpace: any entity associated with events in the object's life cycle.
The problem with preservation metadata is that capturing all of the objects, events, entities, and agents associated with a digital archive can be immensely time-consuming. Those who advocate the merits of preservation metadata and the need for verification of authenticity also recognize that in order for this to be feasible, metadata creation must be simplified. The National Library of New Zealand, an organization active in the digital preservation field, has said, “the more digital preservation activities can be undertaken by means of automation, the more achievable our objectives will become.” lxv Their contribution to automation is a tool that extracts metadata from a number of different file types and exports the information in XML. The output can also be written directly to a repository such as DSpace.
The National Library of New Zealand considers the extraction tool to be just one part of an overall community of software and tools that will ultimately be needed to support digital preservation lxvi (they recently released an open source program for archival websites and blogs, similar to the Internet Archive's Wayback Machine). Another relevant tool is JHOVE (JSTORE/Harvard Object Validation Format), which identifies object formats, validates them, and identifies relevant properties associated with certain file types. lxvii
Assuming the creator of a born-digital archive takes appropriate preservation steps and follows the correct steps in transferring the digital objects to an archive, and assuming the archive conducts the appropriate preservation techniques and documents their work in the form of an OAIS-modeled implementation such as PREMIS, can it then be guaranteed that the digital archive is a true, reliable, authentic representation of the creator's work, of the activities that led to their creation? Perhaps. Textual bibliographers concerns themselves primarily with identifying the alterations that are considered unavoidable in the transmission of text. Part of the distress over authenticity is borne out of the concern by many preservationists that electronic documents are more easily transmittable, and hence more susceptible to alteration, intentional or otherwise.
Creators and archivists, the key agents in the preservation of digital archives, are not without their biases and misrepresentations. The recent Flair Symposium featured one author, Denis Johnson, who has admitted to lying in personal journals and letters to mislead his first wife. It is also tempting to perceive archivists as beholden to the authenticity of the archives they maintain and free from bias in making preservation decisions, but D.C. Greetham, a noted textual scholar, has argued that “all conservational decisions are contingent, temporary and culturally self-referential, even self-laudatory: we want to preserve the best of ourselves for those who follow.” lxviii
Heather MacNeil and Bonnie Mak, in an article on constructions of authenticity, note that “digital resources are comparable to traditional cultural resources such art works, literary texts, and business records; they are in a continuous state of becoming and their authenticity is contingent and changeable.” lxix These conditions, in other words, are always negotiable. They go on to describe three types of authenticity: a philosophical one, as in being true to oneself; originality, a concept that is explored through the lens of conservation and textual theory; and actuality, the definition that digital preservation emphasizes, that a document is authentic by virtue of being what it purports to be. The authors point out that all definitions of authenticity are social constructs, and that a better approach for libraries and archives trying to preserve authentic digital materials is by documenting the work done in support of digital provenance – in other words keeping meticulous preservation metadata. lxx
MacNeil and Mak also point out the fundamental tension between conservation and authenticity that exists in archives and libraries: “to restore an art object to its original condition requires that the conservator destroy the evidence of the passage of time on the object; to preserve that evidence, on the other hand, is to obscure the object's origins and, therefore, the artist's intentions.” lxxi While this is a reasonable concern with regard to paper, electronic files do not naturally degrade in the same way: alterations to digital documents are initiated, to borrow the phrase from PREMIS, by agents, either purposefully or unintentionally. But as discussed earlier, efforts at digital preservation do introduce alterations.
The authors further assert that librarians and archivists can exert little control over the enabling technologies of electronic text – the constantly evolving software and hardware specifications. These conditions suggest that:
the criteria for assessing the authenticity of digital materials must tolerate a range of variability befitting the situation. In the same way that the criteria for assessing the authenticity of traditional materials account for the transformative effects of time, such as the natural decomposition of paper, so too should the criteria for assessing digital materials acknowledge the inevitability of change. lxxii
Object preservation strategies are not considered desirable because of the loss of context and format. As noted, both Emory University and the Harry Ransom Center voiced concern with implementing migration solutions. But perhaps these are the natural transformations and degradations we must come to accept with digital documents, though this “inevitability of change” is not likely to sit well with most archivists. The publishing of digital information, the authors argue, further enables the ready duplication and alteration of the work, a distinct break from the tradition of print, where publication is considered the final instantiation of the work, one that can be idealized as stable and official – so long as it can be preserved – for purposes of reference or preservation. The nature of digital resources, by contrast, is allographic: “it is part of the character of these resources to be copied and interpreted in different contexts.” lxxiii The Electronic Literature Organization is clearly in tune with these characteristics of electronic text, as evidenced by their advice to freely share digital works as a method of preservation. These assertions also recall fluid text theories, and even some electronic textual editions, such as the Walt Whitman Archive, which does not privilege any particular edition of Leaves of Grass, but instead, by providing all editions, “emphasizes variability over fixity of meaning, open-ended representation over closed representation, and the process of editing over its product.” lxxiv
Conclusion
It's important to recognize the problems with authenticity while maintaining policies that support it. The notion of authenticity may be negotiable, but this is still a relevant goal of archives and their missions as trusted repositories, as keepers of cultural record. The most important policy that archives can implement is one that requires an organized accounting of all work undertaken to preserve digital documents. Archivists should also be actively pursuing a policy of engagement and collaboration with creators to ensure a fluid preservation process down the road, but also to aid creators in the self-curation of archives. The work of InterPARES and PARADIGM can be adopted and disseminated by archival institutions in service of this policy. Strategies that in the past may have seemed unusual to archives might need to be considered, such as creating processes for active and persistent accessions of digital archives, even at the item-level. Archives should be fluent in the technical advances and solutions being proposed, but understand that technology will not fundamentally address the problem of preserving authentic digital archives. What archival institutions do will not fundamentally change, but the tools and processes they use to accomplish their institutional missions will.
References
i John F. Gantz, et al.,The Diverse and Exploding Digital Universe, Executive Summary, International Data Corporation, March 2008, http://www.emc.com/collateral/analyst-reports/diverse-exploding-idc-exec-summary.pdf
ii Ibid.
iii Apple MacWrite. See http://en.wikipedia.org/wiki/MacWrite for the history.
iv Matthew Kirschenbaum, “Hamlet.doc,” The Chronicle of Higher Education, August 17, 2007, B8.
v Matthew Kirschenbaum, Mechanisms: New Media and the Forensic Imagination (London: MIT Press, 2008), 207-210.
vi Ibid.
vii Thomas Kiehne, Vivian Spoliansky & Catherine Stollar, "From Floppies to Repository: A Transition of Bits. A Case Study in Preserving the Michael Joyce Digital Papers at the Harry Ransom Center," May 2005, https://pacer.ischool.utexas.edu/bitstream/2081/941/1/Joyce_project-paper-final-draft.doc (accessed October 18, 2008).
viii Jeanne Kramer-Smyth, "SAA2008: Preservation and Experimentation with Analog/Digital Hybrid Literary Collections (Session 203)," September 6, 2008, http://www.spellboundblog.com/ (accessed October 19, 2008).
ix Erika Farr, “Rushdie’s Born Digital Archive: Updates and Prospects,” PowerPoint Presentation.
x Matthew Kirschenbaum, “Hamlet.doc,” The Chronicle of Higher Education, August 17, 2007, B8.
xi Ibid.
xii Ibid.
xiii Rachel Donadio, “Literary Letters, Lost in Cyberspace.” The New York Times, September 5, 2005, http://www.nytimes.com/2005/09/04/books/review/04DONADIO.html.
xiv Kevin Endres, Leigh Patterson & Jesse Cordes Selbin, “A Flair for Archives. Undergraduate Comments on the Flair Symposium at the Harry Ransom Center,” November 2008, http://flairforarchives.wordpress.com/ (accessed December 2, 2008).
xv Thomas Kiehne, Vivian Spoliansky & Catherine Stollar, "From Floppies to Repository: A Transition of Bits. A Case Study in Preserving the Michael Joyce Digital Papers at the Harry Ransom Center," May 2005, https://pacer.ischool.utexas.edu/bitstream/2081/941/1/Joyce_project-paper-final-draft.doc (accessed October 18, 2008).
xvi Matthew Kirschenbaum, Mechanisms: New Media and the Forensic Imagination, (London: MIT Press, 2008), 207-210.
xvii Thomas Kiehne, Vivian Spoliansky & Catherine Stollar, "From Floppies to Repository: A Transition of Bits. A Case Study in Preserving the Michael Joyce Digital Papers at the Harry Ransom Center," May 2005, https://pacer.ischool.utexas.edu/bitstream/2081/941/1/Joyce_project-paper-final-draft.doc (accessed October 18, 2008).
xviii Ibid.
xix Ibid.
xx Erika Farr, “Rushdie’s Born Digital Archive: Updates and Prospects,” PowerPoint Presentation.
xxi Thomas Kiehne, Vivian Spoliansky & Catherine Stollar, "From Floppies to Repository: A Transition of Bits. A Case Study in Preserving the Michael Joyce Digital Papers at the Harry Ransom Center," May 2005, https://pacer.ischool.utexas.edu/bitstream/2081/941/1/Joyce_project-paper-final-draft.doc (accessed October 18, 2008).
xxii Erika Farr, “Rushdie’s Born Digital Archive: Updates and Prospects,” PowerPoint Presentation.
xxiii Thomas Kiehne, Vivian Spoliansky & Catherine Stollar, "From Floppies to Repository: A Transition of Bits. A Case Study in Preserving the Michael Joyce Digital Papers at the Harry Ransom Center," May 2005, https://pacer.ischool.utexas.edu/bitstream/2081/941/1/Joyce_project-paper-final-draft.doc (accessed October 18, 2008).
xxiv Erika Farr, “Rushdie’s Born Digital Archive: Updates and Prospects,” PowerPoint Presentation.
xxv See The Satanic Verses: http://en.wikipedia.org/wiki/The_Satanic_Verses
xxvi Erika Farr, “Rushdie’s Born Digital Archive: Updates and Prospects,” PowerPoint Presentation.
xxvii Thomas Kiehne, Vivian Spoliansky & Catherine Stollar, "From Floppies to Repository: A Transition of Bits. A Case Study in Preserving the Michael Joyce Digital Papers at the Harry Ransom Center," May 2005, https://pacer.ischool.utexas.edu/bitstream/2081/941/1/Joyce_project-paper-final-draft.doc (accessed October 18, 2008).
xxviii Ibid.
xxix Ibid.
xxx Ibid.
xxxi Ibid.
xxxii Ibid.
xxxiii Erika Farr, “Rushdie’s Born Digital Archive: Updates and Prospects,” PowerPoint Presentation.
xxxiv Kenneth Thibodeau, “Preservation and Migration of Electronic Records: The State of the Issue,” U.S. National Archives & Records Administration, http://www.archives.gov/era/papers/preservation.html (accessed December 2, 2008).
xxxv Kenneth Thibodeau, “Overview of Technological Approaches to Digital Preservation and Challenges in the Coming Years,” The State of Digital Preservation: An International Perspective, Council on Library and Information Resources, Conference proceedings, 2002, http://www.clir.org/pubs/reports/pub107/thibodeau.html (accessed December 1, 2009).
xxxvi Kenneth Thibodeau, “Preservation and Migration of Electronic Records: The State of the Issue,” U.S. National Archives & Records Administration, http://www.archives.gov/era/papers/preservation.html (accessed December 2, 2008).
xxxvii Kenneth Thibodeau, “Overview of Technological Approaches to Digital Preservation and Challenges in the Coming Years,” The State of Digital Preservation: An International Perspective, Council on Library and Information Resources, Conference proceedings, 2002, http://www.clir.org/pubs/reports/pub107/thibodeau.html (accessed December 1, 2009).
xxxviii Ibid.
xxxix Michele V. Cloonan & Shelby Sanett, “The Preservation of Digital Content,” Libraries and the Academy (5, no. 2, 213-237), April 2005 (accessed November 17, 2008 from ProjectMUSE database).
xl Jeffrey van der Hoeven, Bram Lohman & Remco Verdegem, “Emulation for Digital Preservation in Practice: The Results,” The International Journal of Digitial Curation (2, no. 2, 123-132), December 2007 (accessed November 17, 2008 from ProjectMUSE database).
xli Kenneth Thibodeau, “Preservation and Migration of Electronic Records: The State of the Issue,” U.S. National Archives & Records Administration, http://www.archives.gov/era/papers/preservation.html (accessed December 2, 2008).
xlii Ibid.
xliii Ibid.
xliv Ibid.
xlv InterPARES 2 Project, “Creator Guidelines. Making and Maintaining Digital Materials: Guidelines for Individuals,” International Research on Permanent Authentic Records in Electronic Systems, http://www.interpares.org/display_file.cfm?doc=ip2(pub)creator_guidelines_booklet.pdf (accessed November 17, 2008).
xlvi See The InterPARES Project at http://www.interpares.org/.
xlvii Yvonne Hackett, “The Search for Authenticity in Electronic Records,” The Moving Image (3, no. 2, 100-107), Fall 2003 (accessed November 17, 2008 from ProjectMUSE database).
xlviii Susan Thomas, “A Practical Approach to the Preservation of Personal Digital Archives,” PARADIGM, March 2007, http://www.paradigm.ac.uk/projectdocs/jiscreports/ParadigmFinalReportv1.pdf (accessed October 12, 2008).
xlix Yvonne Hackett, “The Search for Authenticity in Electronic Records,” The Moving Image (3, no. 2, 100-107), Fall 2003 (accessed November 17, 2008 from ProjectMUSE database).
l InterPARES 2 Project, “Creator Guidelines. Making and Maintaining Digital Materials: Guidelines for Individuals,” International Research on Permanent Authentic Records in Electronic Systems, http://www.interpares.org/display_file.cfm?doc=ip2(pub)creator_guidelines_booklet.pdf (accessed November 17, 2008).
li Thomas Kiehne, Vivian Spoliansky & Catherine Stollar, "From Floppies to Repository: A Transition of Bits. A Case Study in Preserving the Michael Joyce Digital Papers at the Harry Ransom Center," May 2005, https://pacer.ischool.utexas.edu/bitstream/2081/941/1/Joyce_project-paper-final-draft.doc (accessed October 18, 2008).
lii Jeanne Kramer-Smyth, "SAA2008: Preservation and Experimentation with Analog/Digital Hybrid Literary Collections (Session 203)," September 6, 2008, http://www.spellboundblog.com/ (accessed October 19, 2008).
liii Erika Farr, “Rushdie’s Born Digital Archive: Updates and Prospects,” PowerPoint Presentation.
liv Thomas Kiehne, Vivian Spoliansky & Catherine Stollar, "From Floppies to Repository: A Transition of Bits. A Case Study in Preserving the Michael Joyce Digital Papers at the Harry Ransom Center," May 2005, https://pacer.ischool.utexas.edu/bitstream/2081/941/1/Joyce_project-paper-final-draft.doc (accessed October 18, 2008).
lv Jeanne Kramer-Smyth, "SAA2008: Preservation and Experimentation with Analog/Digital Hybrid Literary Collections (Session 203)," September 6, 2008, http://www.spellboundblog.com/ (accessed October 19, 2008).
lvi Nick Montfort & Noah Wardrip-Fruin, "Acid-Free Bits. Recommendations for Long-Lasting Electronic Literature," The Electronic Literature Organization, June 14, 2004, http://eliterature.org/pad/afb.html (accessed October 10, 2008).
lvii Ibid.
lviii Ibid.
lix Alan Liu, et al., "Born-Again Bits. A Framework for Migrating Electronic Literature," The Electronic Literature Organization, August 5, 2005, http://eliterature.org/pad/bab.html (accessed October 10, 2008).
lx Ibid.
lxi See Preservation Metadata: Implementation Strategies (PREMIS) at http://www.oclc.org/research/pmwg/
lxii See Open Archival Information Systems (OAIS) Framework at http://public.ccsds.org/publications/archive/650x0b1.pdf
lxiii Marcia Lei Zeng & Jian Qin, Metadata (New York: Neal Schuman Publishers, 2008), 60-64.
lxiv Ibid.
lxv Steve Knight, “Preservation Metadata: National Library of New Zealand Experience,” Library Trends, (17, no. 2, 91-110), September 2007 (accessed November 17, 2008 from ProjectMUSE database).
lxvi Ibid.
lxvii See JSTOR/Harvard Object Validation Environment at http://hul.harvard.edu/jhove/
lxviii Marlene Manoff, “Theories of the Archive from Across the Disciplines,” Libraries and the Academy (4, no. 1, 9-25), 2004 (accessed November 17, 2008 from ProjectMUSE database).
lxix Heather MacNeil & Bonnie Mak, “Constructions of Authenticity,” Library Trends (56, no. 1, 26-52), eds. Michele V. Cloonan and Ross Harvey, Summer 2007 (accessed November 17, 2008 from ProjectMUSE database).
lxx Ibid.
lxxi Ibid.
lxxii Ibid.
lxxiii Ibid.
lxxiv Ibid.