The Walt Whitman Archive – Whitley 4
                                                                                                By 
                                                                                                                        						    	Edward Whitley            							            						                                                        
                            January 2011                        
4What other issues or questions relating to The Walt Whitman Archive most intrigue you?
Edward Whitley
Associate Professor of English and Director of American Studies – Lehigh University
¶ 1 Leave a comment on paragraph 1 0 Ed Folsom has written that one of the goals of the Whitman Archive is to “grow the database so that the surprises of searching and juxtaposing will become richer and more frequent.” Folsom’s vision is that the Whitman Archive will become not only a centralized repository for all of Whitman’s texts, but also a discovery tool for exploring what he calls “the wild and unpredictable intersections of the data that the interface allows us to generate.”[ref]Ed Folsom, “Database as Genre: The Epic Transformation of Archives,” PMLA 112.5 (2007): 1609-10.[/ref] The prospect of having the Archive uncover meaningful relationships between texts is, indeed, an exciting one. As Jonathan Freedman notes, the “ability to link heterogeneous subjects and find once-occulted connections and interconnections makes scholarship invigoratingly fun.”[ref]Jonathan Freedman, “Whitman, Database, Information Culture,” PMLA 112.5 (2007): 1597.[/ref] As a frequent user of the Whitman Archive, I’ve experienced this “invigorating fun” first-hand, and I look forward to the Archive including more texts by and about Whitman, and incorporating more discovery tools like Brian Pytlik Zillig’s TokenX.
¶ 2 Leave a comment on paragraph 2 0 As someone working on digital humanities projects of my own, however, I’m as interested in understanding what it takes to create such tools for discovery as I am the end-user experience of having digital tools make “unpredictable” and “invigorating” discoveries on my behalf. For the remainder of this essay, I’m going to share some thoughts on what has to happen on the back end of a digital project in order to make an otherwise rigorously structured digital artifact feel “wild and unpredictable” to its users. I’ll start with Kenneth M. Price’s description of the necessary labor that goes into structuring data such that it will yield surprising results for end users: “A database is not an undifferentiated sea of information out of which structure emerges,” Price writes. “Argument is always there from the beginning in how those constructing a database choose to categorize information.”[ref]Price, “Edition.”[/ref] To an end user who searches a database and finds an interesting juxtaposition between a handful of the hundreds (or thousands) of total documents on that database, it can seem as if those documents have emerged out of “an undifferentiated sea of information.” The reality, however, is far different.
¶ 3 Leave a comment on paragraph 3 0 When Price alludes to the process that the creators of a database go through as they decide how to “categorize information,” he makes it clear that whenever a user generates a surprising set of juxtapositions between documents it is because someone, somewhere, at some point, has meticulously categorized those documents in such a way that a search engine (or other text analysis tool) will be able to process those documents in a meaningful way. Jerome McGann, who has also commented on the rigorous process of structuring data that is required to create a digital archive, writes that “databases and all digital instruments require the most severe kinds of categorical forms. The power of database—of digital instruments in general—rests in its ability to draw sharp, disambiguated distinctions.”[ref]Jerome McGann, “Database, Interface, and Archival Fever,” PMLA 112.5 (2007): 1590.[/ref] As both Price and McGann indicate, what can feel like a surprising juxtaposition for the end user has, to a certain extent, already been determined in advance through a time- and labor-intensive process of marking-up, encoding, or otherwise categorizing data.
¶ 4 Leave a comment on paragraph 4 0 This is not to say that those who structure the data for a digital archive have already anticipated every possible search string that users could ever imagine. On the contrary, what interests me as both the user and the creator of digital archives is the effect by which this “sharp, disambiguated” structuring of data can, indeed, produce juxtapositions that archivists could never have predicted. The fundamental preconditions for cultivating such serendipitous discoveries, it seems to me, are a large enough data set and a sufficiently intelligent schema for categorizing that data (not to mention the time, money, software, hardware, human resources, and expertise required to process and categorize the data in the first place).
¶ 5 Leave a comment on paragraph 5 0 Allow me to cite an example from outside the realm of digital archiving to illustrate what’s involved in structuring data that yields serendipitous juxtapositions. Pandora Radio is an Internet music service whose goal is to create customized, online radio stations for its users. This customization is based on the work of the Music Genome Project, which has categorized the musical attributes of thousands of different songs. Taking into consideration “everything from melody, harmony and rhythm, to instrumentation, orchestration, arrangement, lyrics, and . . . singing and vocal harmony,” the Music Genome Project assigns anywhere between 150 and 400 different descriptors to a single song. These descriptors are based on both objective criteria such as the types of instruments used in the song or whether the song is in a major or a minor key, as well as more subjective criteria such as whether or not a song features “Acoustic Sonority,” “Prevalent Use of Groove,” or a “Wet Recording Sound.”[ref]“The Music Genome Project,” Pandora Radio, accessed December 2, 2010 http://www.pandora.com/mgp.shtml.[/ref] (These descriptors are referred to as musical “genes,” but they could just as easily be called metadata or tags.) Visitors to Pandora Radio type in the name of a song that they like, and then based on the attributes that have been pre-assigned to that song, Pandora will play other songs with similar attributes. The promise of Pandora is that once a database of songs is categorized according to a controlled vocabulary capable of being processed by a computer, listeners will be pleasantly surprised to discover songs that they never knew they already liked. (These songs are, of course, available for purchase so that no serendipitous discovery goes un-capitalized.)
¶ 6 Leave a comment on paragraph 6 0 I cite this example within the context of the Whitman Archive not necessarily because Pandora Radio provides a model for digital scholarship of nineteenth-century literature, but because the process of categorizing songs for the Music Genome Project is remarkably similar to the archivist’s effort of structuring data in such a way that it can produce surprising results for the end user. For both Pandora and the Whitman Archive, then, the goal is to reach the tipping point where methodical structuring becomes unexpected discovery, where the left hand doesn’t know what the right hand is doing and structured data takes on forms that the structure-ers never quite imagined. I like to think that this tipping point where structure yields serendipity is similar to the moment in a fractal pattern where geometry becomes art, the moment where a rigorous, mathematically defined structure takes on an aesthetic quality that was never explicitly intended by the mathematical code that undergirds it, but that is nevertheless absolutely dependent upon that code for its very existence.
¶ 7 Leave a comment on paragraph 7 0 My hope as both the user and creator of digital archives is that we can come to a better understanding of the processes of classification and naming that are involved in these efforts structure data in meaningful ways. Scholars of literature are already comfortable thinking about the power of language to represent—and even create—the world around us. We already have at our disposal a variety of theoretical apparatuses for thinking about the power of language to name and define the world, from the formalism of the New Critics to the semiotics of the Post-Structuralists and even the focus on the materiality of language in Book History scholarship. What we need to add to this list of theories and methodologies is a more sophisticated understanding of taxonomy and categorization. This will not only require an interdisciplinary rendezvous with librarians and scholars of information architecture, but it will also demand that we do the cultural history of such seemingly mundane phenomena as filing cabinets and accounting ledgers. Compared to the work being done by digital humanities scholars in the areas of data visualization, text mining, social networking, and geo-spatial analysis, studying the history of cataloging, sorting, and organizing seems to be the intellectual equivalent of a trip to Target to buy hard-plastic storage bins when you’d rather be at the Apple Store picking up an iPad. Nevertheless, everything that we want digital technology to do to help us better understand literary texts depends upon our ability to properly name, sort, and categorize the information that we put into a computer. We do ourselves a disservice to ignore or even minimize this crucial stage of the process.
¶ 8 Leave a comment on paragraph 8 0 One of the most frequently quoted statements about the future of digital scholarship in the humanities comes from Jerome McGann, who wrote that “the digital technology used by humanities scholars has focused almost exclusively on methods of sorting, accessing, and disseminating large bodies of materials. In this respect the work has not engaged the central questions and concerns of the [humanities, which is] . . . . the exploration and explanation of aesthetic works.”[ref]Jerome McGann, Radiant Textuality: Literature after the World Wide Web (New York: Palgrave, 2001), xi-xii.[/ref] I don’t want to make a straw-man out of McGann’s powerful insight, but I can’t help but take issue with his claim that the “sorting” of texts will not aid us in our goal of using digital technology to explore and explain aesthetic works. Before a digital tool of any kind can work with a text in any way, the elements of that text must be sorted, classified, and categorized according to a schema that the computer will understand. This is not to say that scholars in the digital humanities should put aside their text visualization projects and their geo-spatial analyses in order to do the Dewey Decimal system one better, but rather that the foundational tasks of organizing and classifying should more completely theorized and more intimately woven into the fabric of digital scholarship.
¶ 9 Leave a comment on paragraph 9 0 By attending to the tedium of taxonomy we may begin to feel like nineteenth-century scriveners rather than twenty-first century techies. Like those scriveners of old, we may prefer not to spend our time with such mundane things. Nevertheless, if we wish to generate the kinds of “wild and unpredictable intersections of . . . data” that “makes scholarship invigoratingly fun” we should embrace what actually promises to be a fascinating conversation about the history of taxonomy, structure, classification, and form. This conversation will take us through the Library of Alexandria, the Great Chain of Being, medieval bestiaries, Wunderkammer, the Mundaneum, the Six Degrees of Kevin Bacon, cryptography, tarot cards, and Vannevar Bush’s Memex. Every time we tag, code, or classify a text for a digital archive, we are participating in the long history of human classification systems. The better able we are to understand that history and to situate ourselves within it, the better understanding we will have of our digital archives and what they can do.


The sometimes so-called information professions of libraries and archives have engaged in a great deal of this work already, although it has not been particularly in favor in the US for the better part of 30 years (the professional literature still holds the record of the earlier work, of course). In addition to taxonomy, the impact of “technology” (such as letter-press books and filing cabinets) on the way content was created and organized, there is also the entire field of diplomatics, which examines the structural semiotics of individual documents, such as letters, in terms of how the form shapes the function and content. In my field diplomatics enjoyed a very brief resurgence in the 1990s in examining the structure of electronic communications, though for the most part it had long been relegated to the realm of studying ancient manuscripts.
“listeners will be pleasantly surprised to discover songs that they never knew they already liked.”
This point is worth thinking about: the irony is that by matching songs based on such “genes,” Pandora allows people to make serendipitous discoveries that are primarily based on similarity. The site encourages exploration within a certain range, but maybe it also creates an environment that discourages new musical discoveries outside of the listener’s comfort zone. Does this make the listener’s world smaller, instead of larger? This dual restriction and facilitation of exploration enables the user to tightly control what kinds of “surprises” he/she encounters. Is that discovery at all? I think this kind of tension exists in archives and databases as well. Part of the challenge of designing such a resource seems to be asking oneself how to best manage information in a way that negotiates between chaos and rigid order.
Also I think that the study of categorisation is very much part of DH, and a demanding and exciting one at that. My colleagues at UCLDIS who study classification theory in the context of Libraries would certainly agree. It’s ironic that they are suddenly very much in demand in this digital age: what used to be considered dry and old fashioned is suddenly cool. What could be more cutting edge than ontologies, linked data and semantic tagging after all, and what underlies all this is an understanding of the theory of categorisation that you discuss here.