All in the Family: A Dinner Table Conversation about Libraries, Archives, Data, and Science
Kristen A. Yarmey, Lynn R. Yarmey
¶ 1 Leave a comment on paragraph 1 0 Based on our experiences in the academic library, archives, and data curation communities, we present an informal discussion of the relationships between these branches of information science. In this conversational exchange, we decipher concepts of data curation for librarians and archivists while also translating library and archival theory and tradition for the data environment. Beginning with basic questions about the day-to-day realities of managing scientific data in comparison to digital library collections, we discuss roles and practices, seeking both to identify common goals and functions as well as to explore where and how our cultures diverge. The whole is greater than the sum of its parts, and together our fields support the diverse and ever-changing needs of researchers and information consumers in distinct but complementary ways.
¶ 3 Leave a comment on paragraph 3 0 Kristen: In 2006, when I decided to go to library school to be an archivist, Lynn was working at Scripps Institution of Oceanography as an information manager. I wasn’t totally clear on what her job was, but I knew it involved data. The first time one of my library school professors brought up “metadata” in class I got excited: “Lynn always talks about metadata!” Since I took my position as Digital Services Librarian at the University of Scranton in 2008, I have come across more and more data issues in both the archival side of my job (digital collections) and in the librarian side of my job (supporting digital scholarship). I started calling Lynn for help.
¶ 4 Leave a comment on paragraph 4 0 Lynn: When I started library school, one of the most exciting aspects for me was learning about this mature, sophisticated ecosystem that already existed for gathering, describing, categorizing, and managing information. There was a system of roles, infrastructure, services, and expertise that conceptually mapped so well to the data world. Kristen would send me archives articles that really resonated with my data issues. It became clear that our respective realms of archives, libraries, and data are fundamentally linked.
¶ 5 Leave a comment on paragraph 5 0 K: We quickly noticed, though, that we often have different philosophies about information management and that we tend to use different vocabulary to describe what we do. As we started to untangle the complex relationships between archives, libraries, and data, we found value in each other’s knowledge and expertise.
¶ 6 Leave a comment on paragraph 6 0 L: I see data curation as a discipline that cuts across the Library and Information Sciences field. Data curation combines skills from librarianship, archives, information literacy, metadata, technical services, and many other areas. By identifying and articulating similarities and differences between data curation and recognized aspects of LIS, we can plan better for more integrated systems, services, and roles.
¶ 7 Leave a comment on paragraph 7 0 K: One of the common (though admittedly simplistic) distinctions drawn between archivists and librarians is a difference in how we prioritize our goals. Both professions are interested in information access and preservation, but librarians prioritize immediate, affordable access while archivists prioritize long-term preservation. Where does data curation fit in? Do your goals as a curator of scientific data differ from the goals of librarians and archivists?
¶ 8 Leave a comment on paragraph 8 0 L: The broad goal of data curation is to support scientific research through enabling long-term reuse of data. On the ground, that means data curators work across multiple levels: from defining and supporting strategic metadata creation early in the research cycle, through documentation about research processing, to capturing research “products” and preserving them for reuse. It seems like this is a blend between library and archives goals with support services, immediate access, and preservation all included in data curation. Given that we share the same ultimate goal of making information accessible now and into the future, I think the data community can learn a lot from both the archives and library communities.
¶ 10 Leave a comment on paragraph 10 0 K: Likewise, I think archivists and librarians have a lot to learn from data curators. For starters, I’m not sure I understand what “data” curation really is. When I think “data,” I think of numbers in a spreadsheet. Beyond that, I don’t have a good mental picture of what data looks like.
¶ 11 Leave a comment on paragraph 11 0 L: A good working definition of “data” for LIS professionals is “workable/usable chunks of content,” whether values in a printed journal article table, numbers in a spreadsheet, pixels in an image, etc. For example, you can think of a digital humanities scholar viewing the text inside a book as data. But I think your question hints at a broader difference between library and data work—data are the content, libraries are usually more interested with the container. I think it’s a granularity issue.
¶ 12 Leave a comment on paragraph 12 0 K: That’s a good point. At my library, the way we manage our collections, and the way that users access our collections, does center on physical or digital “containers.” Our books are stored in a different place than our microfilm, for example, and our streaming media is stored in a different place than the text and image files in my digital collections. I think some of my confusion about data curation stems from not understanding what kind of “container” your data is in. What formats do data usually come in? What kinds of files do data curators need to manage and preserve?
¶ 13 Leave a comment on paragraph 13 0 L: Data come in many containers, from different types of files to databases and paper logbooks. In some ways we need to manage and preserve all of them. From a scientific perspective, though, it doesn’t matter what kind of container is used, so we have a lot more flexibility in moving content around. There are exceptions to this rule, but in general it is the content—the measurements, observations, model predictions, and values, for instance—that is important.
¶ 14 Leave a comment on paragraph 14 0 K: Doesn’t format play a role in long-term access, though? For archivists, format is a major concern in digital preservation planning. We worry about how to preserve different kinds of files over time in the face of software and hardware obsolescence.
¶ 15 Leave a comment on paragraph 15 0 L: In data, the bigger question is “what functionality do we need to offer?” Long-term format decisions will lean heavily on that answer. While we absolutely need to preserve access and are looking for strategies for avoiding obsolescence, the primary aim of scientific data curation is to enable future use of the data, not necessarily to enable access to the original scientist’s files. Carrying forward the data format and container is not nearly as crucial as capturing the data context (for example, the conditions under which the data were collected).
¶ 16 Leave a comment on paragraph 16 0 K: Preserving context is a priority for archivists as well, but it seems like we have different understandings of what “context” is or should be. Archivists want to be able to ensure that future users can access the content of archival documents, but we also want to preserve the “look and feel” of the information environment in which the file was created. To me, file format is a part of the context in which a record was created.
¶ 17 Leave a comment on paragraph 17 0 L: There is a trade-off here for scientific data; sometimes context can get in the way. For instance, we often see Excel files with multiple tables per sheet, color-coded cells, and notes mixed with values. While all of that formatting can show the process used by the original researcher and may be valuable to a science historian, other researchers looking to reuse the data need to manually sift through the file to find the content they need.
¶ 18 Leave a comment on paragraph 18 0 K: That’s an interesting difference from traditional archival practice. In order to capture the context in which records were created, archivists typically organize our collections based on provenance (what office or person they came from), and we usually maintain original order, even if that makes individual documents more difficult to find—that is, even if that makes the collection less functional or less accessible to the end user. That said, one of the things I like about working with digitized materials is that I have more freedom with arrangement. I can organize my digital collections in a more accessible way, noting in the metadata how the original, physical documents are arranged. What aspects of context do you try to capture for your data?
¶ 19 Leave a comment on paragraph 19 0 L: With scientific data, the scientist and the scientist’s intent are important aspects of the data context. For field scientists, they (and sometimes only they) know what actually happened out in the field, and what oddities might emerge during data processing (for example, “That reflectance value looks strange. Remember on that cruise, though, the captain was new to the ship and wasn’t very good at keeping the boat steady while we deployed? It might be a shadow signature”). Maybe the researcher switched their sampling plan partway through the study because their experience told them that the really interesting thing was happening slightly outside of the planned area. Their data are a reflection of all of the physical, intellectual, and instinctual decisions made, capturing parts of the resulting experience. All of that context is often essential to interpreting the data, but it can’t be easily included in metadata. Our standards aren’t yet mature enough to be able to communicate to an end user the context in which the data were created, and data creators don’t have the habit of including that information in documentation.
¶ 21 Leave a comment on paragraph 21 0 K: So it seems like both data curators and digital archivists face a common challenge of creating and capturing complex, contextual metadata in systematic, effective, and (hopefully!) standardized ways.
¶ 22 Leave a comment on paragraph 22 0 L: Absolutely! I would add that metadata need to be useful, sharable, and discoverable, along with machine-, human-, and increasingly browser- and search engine-readable. I would definitely consider metadata to be an area needing more research.
¶ 23 Leave a comment on paragraph 23 0 K: While we’re talking about metadata, are there “levels” of metadata in data curation? That is, archivists can describe information at the collection level, series level, item level, and even page level, but librarians don’t usually think this way—it’s more like one object (for example, one book), one bibliographic record.
¶ 24 Leave a comment on paragraph 24 0 L: Levels are a hugely important part of data in the sciences. Data need description at multiple levels, though our levels and terminology are not yet well defined. Even the term “dataset” is poorly defined; the content and nature of datasets vary widely. One dataset might contain one type of data from a single field survey, while another might aggregate years of field studies with multiple data types into a dataset. “Collection,” “project,” “study,” “file,” “record,” and “granule” are other important terms relating to levels, though there is still discussion about what these mean. Despite the lack of consensus in language, data curators are ideally collecting metadata at many different levels. There is always a struggle to capture as much metadata as possible given limited resources and an evolving understanding of what metadata we actually need.
¶ 25 Leave a comment on paragraph 25 0 K: There’s a similar tension between quantity and quality in digital collections. Given limited resources, is it better to make more content available more quickly, with less metadata? Or is deep and detailed metadata more valuable to potential users, even if it means that it takes us longer to publish new content? A well-known paper by Mark Greene and Dennis Meissner, called “More Product, Less Process,” [PDF]1 encouraged archivists to decrease the time they spent processing physical collections in order to maximize the quantity of archival resources available to the public, at the expense of in-depth description and intellectual control. The idea of “more product, less process” (MPLP) is often applied to digital collections as well as traditional archives.2 On a practical level, I determine how deeply we’ll describe a collection based on how much value I think that description will add for our users. What kind of metadata do users tend to look for in the data world?
¶ 26 Leave a comment on paragraph 26 0 L: We keep metadata about a project (very high level, related to funding awards or research theme) for contextual information and for discovery, since many researchers know about projects in their fields; however, that project-level description doesn’t really help with understanding the data themselves. So we use metadata at the “dataset” level to provide context for the actual data, to help researchers who aren’t familiar with the project find what they are looking for, and to give credit to the data creators. Beyond this discovery level, metadata creation and prioritization get tricky; usability requirements start to depend on what the data (re)user is trying to do, and domain knowledge starts to become more and more helpful. I add the “(re)” here to point out that especially observational data available online have often already been used; acknowledging the original intent and use of the data is important when trying to carry forward data context.
¶ 28 Leave a comment on paragraph 28 0 K: That brings us to the idea of “users.” All three fields are service-oriented—we want to help people—but I wonder if we are serving different “customers.” My library is open to the public, but our mission is to support the scholarly needs of the University community; that is, our students and faculty are our most important users. My digital collections, though, have a more nebulous user base. Web analytics indicate that these collections are being used off campus, but unless we get a reference question, I don’t have a concrete idea of who those external users are or what needs they may have.
¶ 29 Leave a comment on paragraph 29 0 L: This is true for data as well. Making data openly available online means we have less information about our users. We are more in line with public-library patron-privacy policies. “User” is a particularly loaded term for my project right now. So far, our main system users are data providers as opposed to data (re)users. Once our system and metadata mature a bit, we expect to be working with data (re)users as well, but for now we are mostly working with scientists originally gathering and depositing data.
¶ 31 Leave a comment on paragraph 31 0 L: From a field-science perspective, there is often a deeply personal connection between a scientist and their data. These scientists fought tooth and nail to get funding to go into the field; they trained a group of people to help them do the data gathering; they worked long hours, often in less-than-ideal conditions, to collect the data they needed. For many researchers, each dataset is an extension of their career and understanding.
¶ 32 Leave a comment on paragraph 32 0 K: I’m just starting to get questions about best practices for information management from faculty members on my campus, primarily humanities scholars seeking more effective ways to take, store, search, and organize their notes, and I suspect they feel similarly, although they may not refer to their notes as “data.” Anecdotally, they are most interested in technology as a way to improve their data collection; they seem far less concerned with (and in some cases strongly against) sharing their aggregated data with other scholars. That said, there’s been an explosion of digital humanities research using public or shared data sets (for example, textual analyses using Google’s Ngram viewer). Is this kind of data reuse happening with scientific data?
¶ 33 Leave a comment on paragraph 33 0 L: I think scale has to be part of the reuse discussion. Data are reused all the time within the context of an individual researcher or strongly coordinated lab. Researchers who go out into the field regularly build long-term datasets from observations and reuse those “living” datasets over the course of a career, in some cases. In the field sciences, though, this kind of data is really difficult to analyze once you are outside of that personal or lab context. At a larger scale, some data tend to be more homogeneous and collected from the start with reuse in mind. For example, in the Earth sciences, satellite data are regularly reused in many types of research. One of the challenges in data curation is bridging between scales, between domains with complex and heterogeneous data and those with more homogeneous data. There is often a disconnect between researchers who focus on gathering unique observational data and researchers looking to reuse those observations in other research. Data creators don’t necessarily have the same incentives as data reusers.
¶ 34 Leave a comment on paragraph 34 0 K: That’s something I see in my archival work as well. There’s sometimes a disconnect between the information needs of campus departments that create content with archival value versus the departments that use our archival collections most frequently. For example, our public relations office produces wonderful publications and fantastic photographs of the campus, but in their work, photos or documents more than a few years old are no longer valuable. They need fast access to fresh content, so adding detailed descriptions or captions to those files isn’t a good use of their staff time. On the other hand, the alumni office wants to put together slideshows and exhibits for reunions, and to them, being able to search for and access a twenty-year-old photograph of a certain student (or especially a donor) is incredibly important.
¶ 37 Leave a comment on paragraph 37 0 L: We haven’t yet completely defined all of the components of data curation, nor who should manage each. There aren’t yet clear roles. What is clear though is that data curation can’t be done in a vacuum. To create a useful and usable data ecosystem including infrastructure, expertise, services, and longevity, we need multiple perspectives at the table. Data gathering scientists, data reuse scientists, technologists, documentation experts, administrators, data managers, data organization gurus, access experts, and preservationists all need to be part of the conversation. For now at least, I see data curators as generalists and facilitators. How does this compare to LIS roles?
¶ 38 Leave a comment on paragraph 38 0 K: I think our roles really depend on our institutions and the other resources available to scholars. At major research universities, subject librarians and even entire libraries can be subject specialized, but at my (masters-level) institution, I have to wear multiple hats and fill multiple roles. I run the library’s digital collections, but I also serve as a subject specialist for chemistry, biology, physics, and environmental science; provide information-literacy instruction; and support library initiatives in emerging technologies and digital scholarship. In the course of a single day, I might be teaching students about Google’s search algorithm, helping a faculty member with Mendeley, describing old photos in our digital collections, and using social media to promote library services and collections.
¶ 39 Leave a comment on paragraph 39 0 L: Data Information Literacy is emerging as an extension and adaptation of more traditional Information Literacy programs. Some are saying libraries should engage with this type of instruction, while others suggest that it should come from the departments. What do you think?
¶ 40 Leave a comment on paragraph 40 0 K: I’ve been thinking about data literacy a lot lately, especially considering the high visibility of data analysis during the 2012 presidential campaign. There’s clearly a need to educate end users about data, and I think librarians and departments have a role to play. Like most librarians, I emphasize the importance of evaluation in my information-literacy instruction, but at this point, I worry that I don’t know enough about data to help a student or colleague evaluate or manage it. The advantage of working in a small university is that my department can be somewhat of a one-stop shop for scholarship support, but the downside is that I don’t have the level of in-depth expertise or experience in each role that my students and faculty might need. For example, as much as I’d like to help my chemistry faculty with their data, I feel like a librarian specializing in chemistry would have a much better understanding of their data-management needs and options.
¶ 41 Leave a comment on paragraph 41 0 L: Many people share your concern, but I think the skills and perspective that you and other library and archives experts would bring to the data table would be so valuable. For instance, from a collection-development point of view, the data community could benefit from lessons learned about crafting collection-development policies and keeping them up to date over time. Evaluating data quality is immensely difficult, but a great place to start might be to look at the factors you use to judge quality for collection materials. Deaccessioning is a relatively new but important conversation in the data community. Understanding the value and mechanics of a catalog, and of multiple levels of metadata, is a really valuable skill in the data world as well. We are working on data search and access, and are running into the issues of systems not playing nicely together, differing metadata, and issues with access policies, just like in the library realm. Conversations with researchers about data end up sounding a whole lot like reference interviews. I don’t at all think that you need to have “data” in your title to be qualified to be part of the data discussion. There is so much you and others can add just based on your knowledge of the conceptual and practical LIS realms. The work of data curation is relatively new, I hope we can take the best of all of the information fields and create something even better.
¶ 43 Leave a comment on paragraph 43 0 K: One of the hardest things about my job is persuading stakeholders to fund and support digital projects with long-term rather than immediate benefits. There are so many competing demands on institutional resources, and the complexity and uncertainty of digital preservation makes it hard for me to articulate how investing time and money now will help us down the road. How do we convince people to value what we do?
¶ 44 Leave a comment on paragraph 44 0 L: That is certainly the question for the ages! I think there are big cultural, economic, and political issues that need to be addressed, but that doesn’t mean we should wait around for a societal green light. If we had limitless funds, are we all clear on what we would create? I think that getting our own house in order is a good place to start. Let’s define and clearly articulate our evolving roles, and recognize contributions and potential relationships within our communities and in related groups. I think we still have a ways to go to know how to integrate more with our communities and promote our own value, but this is a learning process, a research process.
¶ 45 Leave a comment on paragraph 45 0 K: Char Booth wrote a while back that the “outsider” perspective is one of the most important things librarians bring to the scholarly table.3 Our neutrality and distance adds an important perspective to specialized research.
¶ 46 Leave a comment on paragraph 46 0 L: The LIS community is not really neutral, though. We’re representing future generations of scholars; essentially, we are arguing for the greater good. The whole of scholarly knowledge is greater than the sum of its individual researcher parts, and we’re the ones trying to bring the pieces together.
¶ 48 Leave a comment on paragraph 48 0 This conversation on the relationships between libraries, archives, and data curation is still in its early stages. Many of our questions need deeper exploration, and there are many topics (social media, versioning, identifier schemes, repository certification, branding, sustainability, scaling, and more) that we haven’t tackled here. Most importantly, we know there are issues, perspectives, and opportunities outside our areas of expertise (such as data curation in the humanities) that should be brought into the discussion. Far from having the last word, then, we hope instead that we’ve given all of our librarian, archivist, and data colleagues some common ground from which to kick off your own conversations—we’ll save you a seat at the dinner table!
- ¶ 49 Leave a comment on paragraph 49 0
- Mark A. Greene and Dennis Meissner, “More Product, Less Process: Revamping Traditional Archival Processing,” American Archivist 68 (2005): 208-63. [↩]
- See, for example, Mark A. Greene, “MPLP: It’s Not Just for Processing Anymore,” American Archivist 73 (2010): 175-203. [↩]
- Char Booth, “Librarians as __________: Shapeshifting at the Periphery,” In the Library with the Lead Pipe, July 21, 2010, http://www.inthelibrarywiththeleadpipe.org/2010/librarians-as-__________-shapeshifting-at-the-periphery/. [↩]
Kristen A. Yarmey
Digital Services Librarian – University of Scranton
Lynn R. Yarmey
Lead Data Curator – National Snow and Ice Data Center