Digital+Research=Blog

Finding a research question

2013-05-05T13:37:00.001-07:00

Introduction

Students often ask me about finding a research question, since I require one for theses of all sorts, and this blog post is an attempt to provide an answer.

Many students start with a topic that they would like to research. This is natural, but in some ways secondary to the process of scholarly writing. I recommend that students start with: 1) a method based on some well-established discipline, and 2) a source of data. Let me explain.

Method

A method is the tool you use for research. If you were a painter, you might choose a variety of scenes to paint, but the brush and oils would be an essential part of how you approached it. As scholars we build up skills using certain discipline-based methods. In effect, we learn how to paint with a particular set of intellectual tools. If we ignore the tools or never learn how to use them, the chances of painting a satisfying scene are significantly less.

In graduate school I settled on a set of ethnographic tools, which I have used and reused over the decades. This does not mean that my approach has been unchanging, but it has had a consistency based on long practice. I am not a trained ethnographer in the German sense of having a degree in it or even having taken classes, but cultural anthropology was part of the atmosphere of my graduate school environment, and I just keep reading and practicing it in my dissertation and beyond. Having a method means absorbing a way of thinking. This is essential to formulating a research question.

Data

Access to data is the equivalent to providing a painter with a scene to paint. If the painter has no one who will sit for a portrait, it becomes much harder to paint a portrait. People try sometimes with varying success, but without sufficient experience with real subjects, it is hard for a painter to create a portrait in the abstract.

Some students want to rely entirely on existing published results and to comment on them. This is more like copying a painting than creating a new one. It can be a reasonable approach if they can do a new analysis, but for a beginner merely to comment on other people's work without actually analysing the data anew risks superficial results.

Data are hard to get. Many desirable sources are closed to the public, and many public sources are overworked or unreliable. Data do not have to be perfect to be used in a scholarly study, but they do need to be available and the author does need to understand and be able to explain their imperfections.

The question

Once the method and a source of data are clear, the student can then reasonably begin to formulate the research question. It needs a grounding in the scholarly discourse in the field to explain why that particular question is interesting. Many students want a completely new question so that they can do something original, but wise students often take a well-researched existing research question and approach it with new data or a new method. The advantage of an existing research question is that its importance is already clear.

The best research questions for a thesis are ones with a straightforward answer. I generally recommend a yes/no question, or one that has a quantitative answer, or one that is a choice among reasonable alternatives. These are not the only possible research questions, but questions involving complex issues about "why" or even "how" tend to be beyond the scope and experience of even the cleverest doctoral students. The virtue of a yes/no type question is that the student can make a clear choice. A thesis with a vague answer is not a contribution to knowledge, while even a very narrowly stated and highly qualified yes/no answer can be a reasonable step forward.

Choosing a research question is hard, but it is probably the most important step in writing a thesis. The topic matters only in so far as data are available and the research method can reasonably apply. Topics are temporary and can change with the seasons. Good research questions grow ultimately out of the intersection of scholarly methods and quality data.

Prize Selection

2012-12-22T10:13:00.001-08:00

I am involved with a number of paper prize awards and find myself wondering how effectively the selection process works. In the end the selection comes down to a small group of people with both personal and cultural preferences. In some fields the quality of the mathematics makes for a fairly even playing field, but information science today has little clarity about its core topics or methods, and a cultural diversity that makes consensus hard. Do we really know how well the process works? Several questions come to mind that a masters student could answer in a thesis.

Is there evidence, that papers that get an award are more influential over time than other papers? Influence might be measured via the number of citations. The population of prizes should be restricted to awards with a long enough history to allow for publication and public reaction. The ASIS&T / Proquest awards, for example, list 15 years of winners. The JCDL student paper award is only 8 years old, but the Vannevar Bush best paper award goes back to 1998. The iConference awards are newer and there is no single list of winners. Nonetheless this data is generally available. Citations could be counted in a number of databases, or tested via Google Scholar, which would then include open source citations.

A related question is whether authors who win awards also get more citations on other papers, regardless of the success of the winning paper, and whether the authors become notable figures in the field. I recognized four of the 15 winners of the ASIS&T award immediately, and they are certainly active in the field.

An number of other research questions revolve around factors that influence reviewers. I see a lot of reviewer comments in my work and so many reviewers make errors in their comments on statistical analyses that I wonder whether a moderately complex statistical analysis actually hurts a paper's chances of winning prizes. A related issue is the use of popular buzzwords. There are years when certain topics generate intense interest that is not sustained over time. Buzzwords associated with these topics may give an impression of cutting-edge work and give these papers an edge. Finding measurable answers to both of these research questions would be harder than doing a simple citation analysis, but it would give useful information both to applicants and to prize committees.

Do not track...

2012-10-16T00:04:00.000-07:00

According to a New York Times article by Natasha Singer (13 October 2012), 9 members of the US House of Representatives questioned the Federal Trade Commission's "involvement with an international group called the World Wide Web Consortium, or W3C, which is trying to work out global standards for the don’t-track-me features." What apparently has them distressed is that "do not track" may become the default in, for example, Microsoft's new version of its Internet Explorer browser.

Defaults are important, as Richard Thaler and Cass Sunstein explain in their Nudge.blog. People tend to accept defaults rather than change them, and this is true for a wide variety of topics including pension plans, health care, and privacy. Those who control the choice of the default arguably determine what a majority will decide. This is not surprising, since we accept cultural defaults all the time in matters as basic as food and clothes.

After years of working on operating systems and on network applications, I am perhaps less concerned about privacy than many of my colleagues for quite contradictory reasons: first, because I realize that anyone with the right technical skills can break ordinary privacy protections in the Internet, and second, because I realize that it is a lot of work and mostly not worth the trouble. Nonetheless some regard for privacy seems basic to how free and democratic societies operate in the HTTP environment. Microsoft seems to have consumer interests more at heart than the nine US congressmen.

Information Technology history

2012-10-07T06:45:00.000-07:00

Some while ago I began to put together an historical timeline on digital library developments. The timeline began relatively informally, but lately I have started to add references to source materials. It is very much a work in progress, but I would be happy to have suggestions for more entries. Anyone may view the timeline, but only I can update it at the moment.

Digital libraries and in a broader sense the world of information technology is relatively young, but it has become old enough that some attention to its history seems increasingly warranted. ASIS&T has, for example, a webpage devoted to the history of information science and technology. Professional historians are starting to take an interest as well, including colleagues at Humboldt-Universität zu Berlin.

The social and legal issues are complex and interesting, and increasingly students need enough historical background in the history of technology to discuss topics like copyright or censorship or even the effect of technology on elections (such as the current US presidential election). We also have an imperfect understanding about the interaction between innovation and users, except that in some cases users quickly adopted new developments (HTML, for example) and in other cases innovations like the mouse sat fallow for years. Questions about the innovation/demand cycle play a key role in discussions about the industrial revolution. Whether the dynamics are similar or not I have too little evidence to judge.

This blog has been quiet for some time, but I plan to use it more regularly to discuss issues about the history of information technology precisely because I hope for comments from readers.

30 Years of Information Technology

2012-10-04T00:49:00.003-07:00

Library Hi Tech has been celebrating its thirtieth year with a number of special issues, and the latest issue looks back on the development of information technology for libraries. Below is the structured abstract for my editorial. I will include a link when the issue is available online.

Purpose: This issue of Library Hi Tech offers a retrospective over the last thirty years of information technology as used in libraries and other memory institutions, particularly archives and museums. This editorial will add the editors’ reflections.

Method: The method uses historical documentation and relies heavily on personal recollection.

Findings: Thirty years ago information technology in libraries largely had to do with ways in which libraries could make their ordinary operations more efficient. Today the information science frontier has broken out of the comfortable institutional paradigm of the past and made libraries aware that they need to redefine themselves in a world where their buildings no longer represent a storehouse of knowledge unavailable elsewhere.

Implications: Information technology advances have not made libraries obsolete, but they have made it imperative that libraries redefine their role to be digital information managers and service providers for their readers.

ReCAPTCHA - a post by Estelle Shumann

2012-07-11T23:27:00.000-07:00

Note: I have removed this post by Estelle Shumann after a number of negative comments and requests. The topic was interesting and it seemed harmless enough. Recently I received the following message:

You currently have a link on your site pointing to our OnlineSchools.org website. We have recently received warning from Google that they are suspicious of link trading schemes surrounding this, and we want to make sure that you are taking the necessary precautionary measures so that your site is not adversely affected.

We are requesting that you remove the link back to our site.

I do not know that her post was part of this effort, but I am removing it as a precaution.

Writing on the iPad

2012-02-01T09:58:00.000-08:00

This Blog entry is an experiment, as was my writing a full scale article (3500 words) on the iPad.

The article was on measuring reliability in long term digital archiving. I based it on talks given in Tallinn, Estonia, and at a workshop here in Berlin, and I copied the text from the slides onto the iPad using Dropbox, though I could easily have mailed them to myself as well. Then I purchased the Apple Pages app and imported the text into Pages so that I had a ready-made outline. Actually I almost never write from an outline, so in some ways this was a bad idea, but not one for which the iPad bears any guilt.

The Pages app is very easy to use once one recognizes where one has to tap to change styles, get fonts, or send backups in the form of email copies. The backups may not have been necessary, since iCloud is sharing copies among my various Apple devices, but it seemed like useful extra protection, since I could not be sure that the iCloud would not instantly and automatically change every copy if I accidentally deleted a key portion of text. That is not a theoretical but real issue. It is easy to tap the UNDO button a bit too often. Only later did I learn that I could REDO an UNDO by holding down the UNDO longer. I found it was also surprisingly easy to highlight far too marge a segment of text and then brush a key that deleted it accidentally. UNDO and REDO are really valuable options.

The touchpad keyboard as such gave me no particular problems. I am a decent multiple-finger typist, but not especially fast. Nonetheless I do find that I often hit one of the keys in the bottom row rather than the space bar. The spelling checker and word-suggestion system is unexpectedly good, but the price for corrections in multiple languages is that I must constantly switch keyboards, since the spelling and keyboard choices are linked. With the English and German keyboards this is family simple, since only the Y and Z keys shift, but I am very accustomed to the German keyboard (all of my other devices have German keyboard) and sometimes my finger strays. The correction facility is fairly good at catching and fixing this.

One advantage that I had hoped for with the iPad was that I could carry the machine with me easily and write in even quite short blocks of time, such as in the S-BahnCard train (six minutes from my home station to the office). I found that that worked fairly well as long as I worked on the article so regularly that I had it mostly in mind and did not have to search back to find the threat of what I wrote. Generally I write in landscape mode because the keyboard is bigger and mistakes are therefore fewer, but portrait mode give a far better sense of the virtual page. By the end of the article I tended to use portrait more often, especially when I was revising what I wrote.

Only very recently have I returned to using Microsoft Word on my other computers, and I confess that it is really good. Mostly I don't want its features though. My articles require little fancy formatting or inserts. The Pages app is definitely no substitute for Word, but as a basic word processing tool on the iPad, I found it met my needs well.

eReading

2011-11-01T00:40:00.000-07:00

Some years ago Elke Greifeneder and I offered a seminar in which students tested their reading experiences on a Sony eBook reader, a laptop, a desktop computer, printouts, and a bound book. The reading content consisted of German novels and the students measured the experience only by testing reading speed. The result was that there was no apparent difference between the eBook reader and other media. The students subsequently published the research (see Grzsechik, K. et al, (2010), "Reading in 2110 – reading behavior and reading devices:a case study" The Electronic Library, Vol. 29 Iss: 3, pp.288 - 302 or online at Emerald).

Now a more extensive study at the University of Mainz has reached similar conclusions: "Almost all participants stated that reading from paper was more comfortable than from an e-ink reader despite the fact that the study actually showed that there was no difference in terms of reading performance between reading from paper and from an e-ink reader." The study also found that "the older participants exhibited faster reading times when using the tablet PC." (Source)

The general assumption in Germany is that a strong cultural preference for print on paper is likely to persist. It may, of course, but if the US experience offers any indication, the resistance may give way to the convenience of having multiple works on a single device. On my iPhone right now I have a dissertation and 4 novels. The iPhone is a bit small for dissertation reading, but I never read scholarly works on paper any longer because I want to be able to search them and to look up references simultaneously. I also buy fewer and fewer novels in paper form because I do not want to have to carry one more object with me.

A few years ago I thought that an interesting study would be to sit in the Berlin S-Bahn (elevated train) and count the number of people using eReader devices. Now such a study would be harder, because such a large number of people spend their transit time doing something on their smart phones, but whether that is reading, playing games, or sending email is hard to say. Or perhaps what they are doing is irrelevant. Whatever they are doing, it seems to involve reading on an electronic device.

CyberAnthropology

2011-10-27T13:51:00.000-07:00

I spent the day at the "1st Berlin Symposium on Internet and Society" While the topics were in general very narrowly focused on policy and regulation, the session I attended this afternoon was on "CyberAnthropology -- Being Human on the Internet". While I am not in a strict sense an anthropologist, I have explicitly used methods from cultural anthropology for close to 40 years and thus found the topic interesting.

The session began with a critique of the presenters' paper in which the speaker noted that the presenters had done no original data collection of their own. In my own part of the academic world that criticism would probably be grounds for rejection from any serious scholarly symposium or conference.

It further became clear that the presenters had almost no background in contemporary scholarly anthropology. Their approach was to throw out the whole empirical basis of contemporary anthropology as too narrow and to replace it with a cloud of philosophers starting with Plato and Aristotle and ending with Paul Ricoeur and Derrida. (Note: I was at the University of Chicago during the years when Ricoeur was there -- his concepts are not entirely new to me.) The presenters believe that "philosophical anthropology" and their understanding of hermeneutics eliminates the need for standards for evidence and rules for persuasion. At least that is what I heard them say over and over again. Nonetheless it is interesting that they welcome data from others. I wonder why?

True to their law faculty roots, the presenters have already acquired the rights to http://www.cyberanthropology.de. Others have already captured the name http://www.cyber-anthro.com/, so the brand is not exclusive.

My objection to this symposium paper is partly that I regard it as an embarrassment to my own Philosophical Faculty 1, which houses the university's departments of philosophy and European Ethnography (cultural anthropology). It does not reflect our standards.

AnthroLib moves

2011-10-20T11:54:00.000-07:00

Nancy Foster wrote yesterday that AnthroLib has moved to a new address at the University of Rochester Library. A feature that I had not noticed before is the link to a bibliography in a Zotero Group. The bibliography is quite new (started apparently in September) and doesn't have much in it yet, but I suspect it will grow fast. Some links are to Proquest and seem to assume that everyone has the same level of Proquest access. From Berlin at least the links do not work.Nonetheless the list of articles is interesting.

The map is a typical Google map with the odd quirk that it can start moving and be difficult to stop without reloading the location. The flaw may lie in how I touch the map screen with my cursor. It is mildly annoying. The map shows that most of the AnthroLib projects are US-based and generally east-coast, but perhaps we can get some started in Berlin.

O'Reilly Media Ebook report

2011-10-18T05:03:00.000-07:00

I am thankful to Jim Campbell (University of Virginia) for sending me a link to the O'Reilly Media report on "The Global eBook Market: Current Conditions & Future Projections" by Rüdiger Wischenbart with additional research by Sabine Kaldonek. The strong growth of eBooks and eBook popularity in the US and UK is not yet reflected in Germany, though an acceptance for reading on the screen has grown since 2009. As the report says: "ebooks at this point have a difficult stand against a cultural tradition that places (printed) books and reading high on the scale for defining a person’s cultural identity.

While I read a great deal on the screen and insist that I only read student papers in electronic form, I admit that I take pleasure in Berlin's excellent book stores with their intelligent selections and recommendations. As physical places, they are a delight. I wish they offered eBooks in house the way Barnes & Noble does, though. Then they would be perfect.

Archiving in the Networked World: Preserving Plagiarized Works (Abstract)

2011-09-11T06:44:00.000-07:00

This article will appear in Library Hi Tech, v29, no. 4, which should be available in November 2011 in preprint form. The abstract is below.

Purpose: Plagiarism has become a salient issue for universities and thus for university libraries in recent years. This article discusses three interrelated aspects of preserving plagiarized works: collection development issues, copyright problems, and technological requirements. Too often these three are handled separately even though in fact each has an influence on the other.

Methodology: The article looks first at the ingest process (called the Submission Information Package or SIP, then at storage management in the archive (the AIP or Archival Information Package), and finally at the retrieval process (the DIP or Distribution Information Package).

Findings: The chief argument of this article is that works of plagiarism and the evidence exposing them are complex objects, technically, legally and culturally. Merely treating them like any other work needing preservation runs the risk of encountering problems on one of those three fronts

Implications: This is a problem, since currently many public preservation strategies focus on ingesting large amounts of self-contained content that resembles print on paper, rather than on online works that need special handling. Archival systems also often deliberately ignore the cultural issues that affect future usability.

Microsoft Research Summit 2011 - day 3

2011-07-23T13:00:00.000-07:00

Dinner Cruise
The dinner cruise turned out not to be especially cold, since the ship had large indoor areas where we ate. Microsoft also provided an open bar at this and in fact at all the dinners. Often the wine at such functions is dubious, but even the wine was good and the quality of the selection of micro-brew beer was equally impressive.

Of course the goal of the dinner was not food or drink or even the scenery along the lake, but the conversation among colleagues. I ate with fellow deans from Illinois, Michigan, and Carnegie Mellon, and even if no great research comes from our discussions, collegial discourse is an important social component in the efficient functioning of organizations and projects.

Day 3
I should remember more of day 3 that I do, but jet lag had not quite lost its hold and the morning presentations, while good, left little permanent impression. The main event of day three was in any case the iSchool meeting with Lee Dirks and Alex Wade from Microsoft Research. Lee and Alex gave some sense of the projects they are working on. Lee especially has an interest in long term digital archiving that includes involvement in projects like PLANETS. While testing is an official component of PLANETS, I find that it puts less emphasis on testing than on planing and organization. Testing is, however, what is really needed and that is what I tried to suggest in the meeting -- not, I think, with great success.

The other research aspect that I tried to sell, without much obvious resonance, was ethnographic research on what digital tools people really use and what they really want. Microsoft builds tools and we saw a lot of them that are oriented toward research, but I wonder how well some of them will do in the academic marketplace in the long run. Ethnographic research gives deeper insights into what people understand and misunderstand than do surveys.

Just before I left we held the oral defense of the thesis for one of my very best MA students [1]) whose thesis looked at how a group of literature professors at Humboldt-Universität zu Berlin (which actively supports Open Access) regard Open Access. It was striking how much they misunderstand Open Access and how little they know about it. Most of them would never have filled out a survey. This information would just have slipped away or remained as an anomaly. Microsoft could profit from research like this and had an interest in it in years past. It is less clear that it does today.

[1] Name available on request, with the student's permission.

Microsoft Research Summit 2011 - day 2

2011-07-19T16:48:00.000-07:00

Cosmos: big data and big challenges.

Pat Helland talked about massively parallel processing based on Dryad using a Petabyte store and made the point that these massive systems process information differently. Database processing in this environment involves tools like SCOPE, which is an SQL-like language that has been optimized for execution over Dryad. Saving this data long term is a problem because they are worried about bit rot. Cosmos keeps at least three copies of the data, checks them regularly, and replaces data that is damaged. Interesting how close this is to LOCKSS (which saves 7 copies).

In talking about how faster processing is not always the solution to processing problems, a speaker quoted Henry Ford as saying: “If I had asked my customers what they wanted, they would have said faster horses... “ (source: Eric Haseltine Long Fuse and Big Bang.)

NUI (Natural User Interfaces)

One of the speakers distinguished between making user interfaces imitate nature and making them feel natural. Non-verbal clues that convey meaning need to be part of the interaction. Another speaker said that we need better feedback systems. An example is a touch screen with a single button. If a user touches it and nothing happens, the user will hit it again and again. When designers changed the button to send out sparkles when touched, the repeated touching stopped, even though sparkles are not a natural result of touching a screen.

In the discussion someone said that we aren't doing science if we can't go back to the data. This suggests a clear separation of data and processing that speakers about very large data sets said is no longer really possible, since the data is usable only with a degree of processing. The speaker was doing fundamentally social science research, which is more human-readable than Big Science data.

Other comments of interest:

Do we need to take the “good” into account in our interpretation of the “natural”?
It's not a machine that we are interfacing to any more, though the speaker is not sure what it is exactly. There is no machine, but a task.

Microsoft clearly has a strong interest in image management, particularly three-dimensional images such are used in medical imaging (doubtless a good market) or gaming. They are dividing a picture into quadrants and creating mathematical representations of the edges in each square to create a hash to search for similar photos. Photosynth.net was also demoed -- it allows the creation of three dimensional images from multiple photos.

Evening Cruise
Microsoft has planned a dinner cruise for the evening. It should be pleasant (I will comment tomorrow), but many of us wanted to go back to the hotel to leave computers, etc., and to change clothes because it is fairly chilly out (despite the heat wave in the rest of the US).

Microsoft Research Summit 2011 - day 1

2011-07-18T16:35:00.000-07:00

Microsoft Research invited me to the Summit as part of the iSchool deans group. This blog posting (and several that follow) has my notes and comments.

Plenary

Tony Hey opened the Summit with a talk about changes in scholarly communication, including ways of evaluating output. One of the reasons he left academia had to do with the ranking process at British universities. He emphasized that Microsoft is open (as Steve Jobs recently admitted). Microsoft is now working with the OuterCurve Foundation “to enable the exchange of code and understanding among software companies and open source communities”.

One of the major goals of the conference – a goal that speakers emphasize repeatedly – is to network. This leads to a type of intellectual market in which researchers try to sell their ideas to others who are also trying to sell ideas. Theoretically everyone is a potential idea buyer too, but realistically mostly people want to sell to Microsoft to get research money. This makes Microsoft staff very popular.

As is typical of conferences of computer-oriented people, the wireless network is periodically unable to keep up with the demand. Part of the problem comes from people viewing data-intensive websites related to the presentations (I tried too). Nonetheless it seems like a problem that a corporation like Microsoft should be able to overcome. Happily the problem went away once the plenary session ended. Too many people using too few access points.

Breakout Session: Federal Worlds meet Future Worlds

Howard Schrove from DARPA talked about two models of survivability: the “Fortress” model (which is rigid and doesn't work against an enemy who is already inside) and the “Organism” model (which is adaptable). There is a balance in biology between fixed systems that address known threats, and adaptable systems that address new dangers. The underlying causes of problems in computers come from a few known sources, especially the difference between data and (executable) code. The speaker said that hardware immunity is relatively cheap to develop. Self-adaptive defensive architecture is an adaptive method for software that checks behaviors that compromise it and implements on-the-fly fixes. Instructions sets can be encrypted and randomized. Networking is a vulnerability amplifier, but if the cloud has an operating system that functions essentially as a public health system for the cloud, it may be possible to move the solutions out faster than the attack progresses. A quorum computation can check whether certain systems have been compromised. The result could involve reduced performance and randomization to confuse the attacker. Biosocial concepts are the underpinning of resilient clouds.

Breakout Session: Reinventing Education

Kurt Squire presented some of the educational gaming development that he is working on to get people to have richer experiences with topics like the environment (in a particular consequences for a county in Wisconsin) and medicine (in particular identifying breast cancer). Seth Cooper presented a game called FoldIt, where the goal is to fold biochemicals. Problem-solving is fun and that is part of what makes the games interesting. Tracy Fullerton, a professional game designer, spoke about why traditional assessment takes the fun out of game design. She explained the “yes, and...” game (where you must preface each statement with “yes, and...” rather than “but” or “no”), which helps build collaboration.

Closing plenary

The closing plenary included a variety of speakers. One talked about how Microsoft has been trying to enhance the security of its code. Another spoke about a new app that allows people to write programs on their mobile phones. A developer spoke about echo cancellation for enhancing speech recognition (primarily for gaming). Conclusion: computing research has incredible diversity, and rarely is exclusively "basic" or "applied".

Computational Thinking

2011-07-11T12:35:00.000-07:00

I first heard this concept during David De Roure's talk at the the Bloomsbury Conference (see Blog entry for 3 July 2011) and want to take this opportunity to define computational thinking for the sake of my students and to apply it to digital archiving (and related projects).

Definition

“Computational thinking” is (at least in my definition) processing information the way a computer processes it with the existing tools and systems. These tools and systems change over time and computational thinking has to shift over time as well. At present it implies some understanding of, for example, how to use regular expressions to identify specific text strings and how to search indexed information to find matches. Computational thinking tends to be literal (this string with this specification at this time) and tends to be unforgiving (there is no accidental recognition of what the author really meant). Computational thinking is what students ideally learn in their first computer programming class. The “born digital” generation has no advantage here and perhaps even a disadvantage, since they did not have think computationally when they first interacted with computers.

Computational Thinking and Archiving

In class I talked with students about the differences between the implicit definitions of integrity and authenticity in the analog and digital worlds. One of my favorite examples is a marginal note in a book. While we as librarians tend to discourage readers from marking in library books, we would not throw out a book as irreparably damaged because of a marginal note. A marginal note by a famous author can even add value (the Cornell CLASS project is an example). In the digital archiving world, however, we judge integrity by check-sums and hash-values. We do not look at the content, but at whether two or more check-sums agree with each other. Since a marginal comment changes the check-sums of (for example) a PDF file, we would replace that copy in a LOCKSS archive.

If readers wanted to add a marginal comment to a file without changing its integrity (that is, its check-sum), then they could add the comment external to the file with a mechanism (search or index) to locate where it belongs in the original file. This is not necessarily trivial, but is certainly doable as long as content is not regarded as a single file, but as a set of interacting resources. Merely thinking about this choice is computational thinking.

Computational Thinking and digital cultural migration

Computational thinking is needed in order to recognize content that may not be readily comprehensible in future eras. These are words or phrases that will likely be obscure to future (human) readers, but the machine needs specific rules to follow. For example the city of New York is likely to remain familiar in 100 years, but Saigon (now Ho-Chi-Min City) may well be hard for ordinary readers to recognize, unless the name changes back, in which case readers may need help with Ho-Chi-Min City instead.

Of course computational thinking may change substantially when natural speech recognition improves to the point that computer-based comprehension is not fundamentally worse than human comprehension. Or it may require a different type of computational thinking. Any assumptions will likely err in some direction.

ICE Forum and Bloomsbury Conference

2011-07-03T08:03:00.000-07:00

This post reports on two interrelated and back-to-back meetings: the International Curation Education (ICE) Forum (sponsored by JISC) and the Bloomsbury Conference (sponsored by University College London or UCL). Both took place in the Roberts Building at UCL (which is, interestingly enough, next to where I often lived in London in the 1970s at the now-vanished Friends International Centre).

Overlap among the attendees was only partial – I would estimate that about a third of the registered attendees. The University of North Carolina and Pratt hat particularly strong representation, the former because of research projects, the latter because of a summer school for students. This post will not discuss all of the presentations, only a few points that seemed important to me.

ICE Forum

My own talk at the beginning of the ICE Forum addressed the question of whether the world needs digital curators. My answer talked about the need for digital cultural migration to make content comprehensible over long periods of time. The first half explained what this meant and the second looked at how we can design software to help migrate content. When I have talked about this to library groups, the audience largely sees the need as obvious. Many archivists in this audience felt outraged. One argued that archivists ought to leave it to future generations to interpret content. Another listener felt that machine-based interpretation and migration was too mechanical and allowed too little scope for human sense-making – though she grew thoughtful when I suggested that writing code to interpret a file was not fundamentally different than other forms of writing about it. I will say more about digital cultural migration in a future post.

Seamus Ross (Toronto) gave the closing talk at the ICE Forum, in which he quoted Doran Swade that “software is a cultural artifact”. His argument followed my own theme closely in saying that information needs to be annotated and reannotated to be useful for the future. He emphasized the need for case studies like those in law or business school, and he recommended accrediting not the schools but the graduates. We talked about whether effective accreditation was possible without legal requirements and agreed that it would help. Some in the audience disliked the idea of individual accreditation as creating an elite. This did not bother either Seamus or me.

Bloomsbury Conference

Carol Tenopir (Tennessee) discussed a research project to test a hypothesis that scholars who use social media read less. (Turns out that that is not true.) Some of her statistics were especially interesting. Among scholars:

Electronic sources

in 2011 88% of scholarly reading in the UK came from an electronic source (94% of those readings from a library).
In 2005 54% of the scholarly reading in the US was from an electronic source.

Screen reading

In 2011 45% of the scholarly reading was done on the computer screen and 55% of scholars printed a copy.
In 2005 19% of the scholarly reading was done on the computer screen.

While the studies were done in different locations (US & UK) at different times, the expectation is that the country makes no significant difference. A substantial decline of personal subscriptions combined with a substantial improvement in the quality of computer screens could be significant factors. Carol's article is online in PloS One.

David De Roure (Oxford eResearch Centre) talked about Tony Hey's book on the “Fourth Paradigm”. Data-centric research is talked about as if it is new, but (David pointed out) the arts and humanities have done it for a long time. One of the challenges is to get people to think computationally. People also need to stop thinking in terms of “semantically enhanced publication” and to shift their thinking toward “shared digital research objects.” As an alternative to thinking in “paper-sized chunks”, Elsevier now offers an “executable paper grand challenge”. Perhaps Library Hi Tech should too.

Carolyn Hank (McGilll) gave another notable talk. Her dissertation research was on scholars who blog and the blogs themselves. She did purposeful sampling drawing from the academic blog portal. Of 644 blogs 188 fit her criteria and 153 completed the sample. 80% of the authors considered their blog to be a part of the scholarly record. 68% also said that their blog was subject to critical review. 76% believed that their blog led to invitations to present at a conference. 80% would like to have their blogs preserved for access and use for the “indefinite future”.

The last presentation that I hears was by Claire Ross, a doctoral student in digital humanities at UCL. While talking about the effect of social media, she told how she tweeted about her interests when she arrived at UCL and almost immediately got a response from a person at the British Museum that led to a research project. She uses her blog to show her research activities and argued that Twitter enables a more participatory conference culture. I confess that blogging about conferences makes me listen more closely. Perhaps I should try twittering too. Among her (many) interests is the internet-of-things (especially museum objects), which fits well with the Excellence Cluster (Bild Wissen Gestaltung) that we are developing at Humboldt-Universität zu Berlin.

Archiving in the Networked World: Metrics for Testing (abstract)

2011-06-23T10:48:00.000-07:00

This article will appear in Library Hi Tech, v29, no. 3, which should be available in August 2011 in preprint form. The abstract is below.

Purpose: This column looks at how long term digital archiving systems are tested and what benchmarks and other metrics are necessary for that testing to produce data that the community can use to make decisions.

Methodology: The article reviews recent literature about digital archiving systems involving public and semi-public tests. It then looks specifically at the rules and metrics needed for doing public or semi-public testing for three specific issues: 1) triggering migration; 2) ingest rates; and 3) storage capacity measurement.

Findings: Important literature on testing exists but common metrics do not, and too little data is available at this point to establish them reliably. Metrics are needed to judge the quality and timeliness of an archive’s migration services. Archives should offer benchmarks for the speed of ingest, but that will happen only once they come to agreement about starting and ending points. Storage capacity is another area where librarians are raising questions, but without proxy measures and agreement about data amounts, such testing cannot proceed.

Implications: Testing is necessary to develop useful metrics and benchmarks about performance. At present the archiving community has too little data on which to make decisions about long term digital archiving, and as long as that is the case, the decisions may well be flawed.

German Library Conference in Berlin

2011-06-18T00:33:00.000-07:00

The German Library Conference (Bibliothekartag auf Deutsch) took place last week in Berlin at the Estrel Conference Center in the (far) south east corner of Berlin. The theme of the conference was “Libraries for the Future; the Future for Libraries” and as the theme implies German libraries are aware that the information world is changing in ways that they cannot simply ignore. A friend describes the conference as a purely incestuous association meeting. The Bibliothekartag is certainly more like the American Library Association meeting than purely scholarly conferences like JCDL or TPDL (formerly ECDL). Nonetheless such meetings are important both to measure the readiness of ordinary libraries to make changes and as an opportunity to educate the profession about topics that they approach with considerable reserve.

I attended only a few sessions because of concurrent meetings at my University. One was by Lynn Connaway from OCLC Research. Lynn was one of relatively few speakers who spoke in English – which the German audience understood without any apparent problems. She spoke about a JISC funded project in which her task was to find common results among a number of user-studies. A point that she passed over quickly in the talk (but which we spoke about in greater detail privately) was the difficulty in finding exactly how some of the studies were done: how the subjects were chosen, how exactly the data were gathered, or how they were analyzed. Among the common conclusions that she reported were:

Virtual Help. Users sometimes prefer online help even in the library because they do not want to get out of their chairs.
Squirreling instead of reading. Many users squirrel away information and spend relatively little time working actively with contents.
Libraries = books. Many people think of libraries primarily as collections of physical books and often do not realize the library's role in providing electronic resources. They also criticize the physical library and its traditional collections.

I also attended a session that was entitled “Networked Libraries: Service providers for networked data.” Many of the talks discussed linked data or linked open data. Jakob Voss gave the initial lecture and used a visual metaphor of bridges to make the point both about the need for connections and their fragility (one of his slides showed a bridge that had collapsed). The final presentations in this session focused on digital archiving. The first looked at research data with the idea that “data is the new oil”. One major step forward is that DFG and NSF both now require data management plans for data from supported projects. A serious issue is the long term costs for archiving research data, which both nestor and MIT are beginning to examine. The second archiving talk was mine on the LuKII Project (LOCKSS und KOPAL: Infrastruktur und Interoperabilität). In my overview I mentioned the need to understand cultural as well as technical migration -- that is, our cultural understanding of information changes over time, just as do the formats. This evoked some interest during the discussion.

AnthroLib map

2011-06-01T12:15:00.000-07:00

Nancy Foster, one of the editors of "Studying Students: the Undergraduate Research Project at the University of Rochester" has published a map of anthropologically based library studies that is not only shows the geographic distribution, but shows an impressive number of projects.

In Berlin's "winter" semester (that begins in October 2011), I plan to offer a seminar with a colleague where some students work with her using psychological experiments to evaluate digital libraries, and I use ethnographic methods. Many of the projects in Nancy's map seem to address physical library spaces, but my interest is in digital space.

In talking about cultural anthropology with students I often quote Clifford Geertz:

The "essential task of theory building here is … not to generalize across cases but to generalize within them. ...The diagnostician doesn't predict measles; he decides that someone has them…" (Geertz, 1973, p. 26)

In looking at a particular digital resource the goal should not be not to generalize about all users, but to understand the culture and quirks of those who use (or do not use) that particular source. This is very much like the goal of those studying students at particular libraries, except that the users are not necessarily physically there -- which can be a problem.

For the seminar we may well use some portion of the digital presence of our own university library. One possibility might be to repeat some experiments from Studying Students (for example, the one in which students redesign the library website), but to use culturally different sets of users (humanists and natural scientists? German students and Germanists in the US?) to understand different design preferences.

If anyone has done ethnographic experiments in digital space, I would be interested in hearing about them.

References
Geertz, C. (1973) Thick Description: Toward an Interpretive Theory of Culture, In: The Interpretation of Cultures: Selected Essays. New York, Basic Books, pp.3-32.

Rosetta

2011-05-28T07:34:00.000-07:00

On Friday I heard a presentation by Ex Libris staff about their Rosetta long term digital preservation system. Marketing presentations generally do not interest me, but the presenter was the project manager and could in fact answer questions about technical issues.

Bitstream Integrity

Basically this is not a problem that Rosetta addresses directly, but it also does not deny its importance. They have in fact talked with David Rosenthal about it. The system structure separates the bit-management from other layers and allows multiple solutions, including those that do active integrity checking. When Rosetta must manage the storage directly, it uses checksums and does periodic integrity testing against the stored copy. But if the copy's checksum does not match the stored checksum, then they can only ask for someone else to give them a new copy, which could be troublesome in 100 years or so.

We talked about whether LOCKSS might integrate with Rosetta at this level. The general answer seemed to be yes, or at least that it might be worth a try.

Authenticity

Rosetta does maintain provenance information, but has no way to link back to check against the original to make sure that the authenticity remains synchronized. This is problem is not unique to Rosetta. The digital preservation field really needs to develop reasonable criteria for authenticity testing.

Access

Here Rosetta seems to do a good job in making various access copies and controlling the access rules.

Risk Manager

This feature appears to function something like the migration manager in koLibRI. It uses a database to keep track of technical metadata about formats and versions. Rosetta has a knowledge base that allow institutions "to share their formats, related risks and applications". Rosetta has a work-group to enhance the knowledge base as well. [Thanks to Ido Peled for this addition.]

We talked about the risk problem generally with format change. It is not really 0 or 1, but more likely a scaled reduction of access to certain formats. Clearly there needs to be more thinking about when to trigger migration and what kind of migration (on-the-fly or preventative) makes sense.

Load Speeds

I was pleased to see that Rosetta has tested its performance loading different sizes of data and that the information is publicly accessible (see figure 3). I have talked with a number of publishers recently that have concerns about the ability of archiving systems to load their contents in a timely manner and I think other systems should test their ingest times.

Conclusion

The session ended with our agreeing to talk more about potential collaboration in the research arena.

ANADP in Tallinn - day 3

2011-05-24T22:51:00.000-07:00

Educational Alignment Panel

At the EU level, there is an effort to get more involved with professional bodies. Internships play no significant role in UK digital preservation education, since the masters there tends to be a one-year degree. Knolwledge Exchange is interested in these developments. The Library of Congress has collaborated with a number of US schools to establish internships, which benefit both the Library and the interns, as well as the professions that they later enter.

One the biggest challenges is how to identify the essential skills needed for digital preservation. What correlates to bookbinding in the digital world? It may be programming. The students need much more technical competence. While adding courses step by step may be insufficient, finding the time for curriculum reform is challenging. Addressing the funding dilemma is a key aspect and George Coulbourne (LC) suggests corporate partnerships to share costs and responsibilities. In the question and answer period, the question of a "new" profession vs mainstreaming the new skills in the old profession arose. We need to remain aware of the difference between education and mere training that focuses only on particular skills and belongs to ongoing professional development.

Economic Alignment Panel

Costs are a vital issue for any digital archive. Sharing tools and collaboration are ways to manage costs. Examples include LOCKSS and NDIIPP. In Italy MiBAC offers a legal deposit service for its small institution partners. We can also learn from failed initiatives. PADI was, for example, discontinued after 10 years (see ACRL), largely for economic reasons because the national library ended up having to do most of the funding. Neil Grindley used an analogy with the computer game Asteroids -- in his version funders like JISC fire money a big problems like digital preservation in the hope of breaking the problem (asteroid) up. But what angle should the funder take, when it funds. Looking back at the JISC funding efforts, Neil wonders whether someone should write a "really good" history of digital funding. JISC has been doing some cost-modeling. Archival storage is consistently a small portion (15%) of overall project costs. Repairing problems costs significantly more than initial preservation. The UK is building a higher education cloud infrastructure. PEPRS (Piloting an e-journals preservation registry service) is trying to build similar infrastructure for preservation.

In the Czech republic a funding problem is that digital preservation is invisible and often ignored in favor of digitizing more documents. Electronic deposit began only in 2011 as a pilot project, but digitization began in the 1990s with historical manuscripts, and with endangered newspapers and monographs. The aim is to digitize 26 million pages by 2014. The budget is 12 million Euros.

Digital preservation is the flipside of collection development. At Auburn University in Alabama they are using distributed digital preservation in a Private LOCKSS Network. 7 institutions have joined the Alabama PLN and it has been self-supporting since 2008. The fee-base varies from $300 to $4800 per year. Governance took longer to establish. The guiding principles: keep it simple, keep it cheap, don't build something new if you don't have to. Recommendation: stop chasing soft money and start making tough choices about local commitment.

Breakout session: Education

A former student from the Royal School in Copenhagen suggested that we consider the Erasmus Mundus program and put together a focused program for that funding source. The students would like more specific job expectations, but the expectations are very various. Employers look for the right mindset, not the right skill-set. Squeezing in internships is hard. From the employer perspective, an internship is like a year-long interview. Many of the schools have active hands-on programs that emphasize teamwork and practical problem-solving.

Summary Presentations

Benchmarking takes data from content providers and some are ready to make data available. We also need to communicate about benchmarking and other tests.

Cliff Lynch offered an "opinionated synthesis." What does this term "alignment" mean? Making our limited economic and intellectual resources go further through collaboration is obviously beneficial. Another aspect of alignment is that a common case should speak more effectively to governments.

In the tech discussion there were valuable conversations about benchmarking and testing. We need to be clear what we mean by interoperability, what we want to accomplish and what we want to get out of it. Two topics were missing: monoculture and hubris. We will have more confidence that we know what we need to do in 100 years and diversity in the system in a valuable antidote to the mistakes we make. We need to focus on the bit-storage layer and there will be a lot of money flowing in this area. Security and integrity are topics that were mentioned but need more focus. Imagine a wiki-leaks type leak of embargoed content. It would undermine the trust in cultural preservation institutions.

Strategies inside the national level were not discussed as much as they should have been. The question of the replication of material along organizations also needs more discussion. It is interesting that we see standards in so many roles in digital preservation. The legal issues are becoming more and more dominant and we need to look more opportunities to collaborate here. The one thing he would note on education is that the discussion needs to feed back into the national discussions. We did not talk much about scale in the discussion about economics. The risk management tradeoff for digitizing needs assessment.

The elephant in the room is e-science and e-scholarship. There is a lot of money involved here and big investments. This is not a place where many national libraries have been involved, though universities often are. This is driving both technology and some educational efforts. Other smaller elephants are audiovisual material and the newborn digital contents.

There are two additional axes that matter. One is making the case outside out community. The second is collecting policy. News has been a fundamental part of the public record and we know that it is fundamentally changing its character. Also software, personal records, social media. Cliff hopes that this is helpful in providing a frame for the conversation we have had in the last two days.

Links

Another blog about the ANADP conference is Inge Angevaare' Long-term Access blog (this portion of the blog is in English -- sometimes it is also in Dutch).

ANADP in Tallinn - day 2

2011-05-23T22:51:00.000-07:00

Keynote

The keynote speaker at the ANADP conference for the second day was Gunnar Sahlin from the National Library of Sweden. One of the National Library's explicit tasks is to support university libraries. Open access and e-publishing are key initiatives together with the other 4 Nordic countries. Linked open data a more problematic topic because of resistance by publishers, but the National library strongly supports Europeana's efforts in this area. There is a close cooperation with the public sector, especially Swedish radio and television. The Swedish Parliament is considering a new copyright law that may clarify some issues.

Standards panel

The standards panel began with the idea that we have both too many standards and too few. Standards can be seen as a sign of maturity in a field. Digital preservation has not only its own standards, but many from other areas -- a Chinese menu of choices. Information security standards are for preserving confidentiality, integrity, and the availability of information. Many memory institutions have to comply with these standards. The issue was especially important for Estonia because of internet attacks, especially denial of service attacks. In general information security is well integrated into plans in the Baltic countries, but long term digital preservation is not. Only 12% have an offsite disaster recovery plan.

The UK Data Archive is an archive for social science and humanities data since 1967. "A standard is an agreed and repeatable way of doing something -- a specification of precise criteria designed to be used consistently and appropriately." In fact many standards are impractical, with unnecessary detail (8 [?] pages to explain options for gender in humans). Cal Lee spoke about 10 fundamental assertions, including that no particular level of preservation is canonically correct. Context is the set of symbolic and social relationships. With best practices and standards, trust is a key issue. PLANETS is concerned about quality standards and such standards begin with testing. Trust consists of audits, peer-reviewing, self-assessment, and certification. The process moves from awareness to evidence to learning. The biggest technology challenge comes from de facto standards from industry, and we have little control there. Good standards have metrics and measurement systems. Within our lifetime everything that we have as a preservation standard now will be superseded, but the principles will remain.

Copyright panel

Digital legal deposit is a key element, but not a form of alignment. In the UK, for example, legal deposit is still just for print material. In the Netherlands there is a voluntary agreement that works well. Territoriality is a problem – how to define the venue in which publishing takes place in the digital world, what is unlawful, what is protected, etc. The variance in legal deposit between countries leads to gaps. The rules for diligent search for orphan works are so complicated that they are too expensive to use. Even within the context of Europeana cross border access to orphan works is a problem. In US law contract law takes precedent over copyright law. Too many licenses could undermine the ability to preserve materials.

To a question about Google a speaker said that Google's original defense of the scanning project was "fair use" (17 USC 107) and they had a good chance there. It changed to a class-action suit, which is more complicated. The breakout session on copyright went into further depth about what problems exist in dealing with copyright across national borders. Apparently a feature of Irish copyright law is that the copyright law takes precedence over private contracts. Generally contracts take priority.

Summary of the sessions

Panel chairs gave a summary of their sessions and breakout sessions. For the technical group I spoke about the need for testing, trust (or distrust) and metrics and argued that we are really just beginning to address these issues.

ANADP in Tallinn - day 1

2011-05-23T00:05:00.000-07:00

Opening

This blog is beginning with a very international conference called Aligning National Approaches to Digital Preservation (ANADP) that is taking place today in Tallinn, Estonia.

The President of Estonia opened the conference. He emphasized how technologically and digitally aware the country is and also the national library. 10 million Estonian books were destroyed during the Soviet occupation as part of an effort to erase Estonian identity. Further destruction took place when Estonia came free and people wanted to cover up their past. Digitization allows the country to preserve and made materials accessible. The president closed by saying: "Digitizing our national memory is a cornerstone of liberty."

Laura Campbell (Library of Congress) gave the keynote address. NDIIPP (National Digital Information Infrastructure Preservation Plan) has the goal of preserving digital materials. Congress provided $100 million for this effort. LoC has worked on a distributed network to carry out this mission. The program model was to learn by doing. There was no clear pathway forward. She cited WARC development as one of the key technological components and argued that secrecy through proprietary systems does not lead to success in digital archiving. As an example she told the story of Goldcorp - a Canadian gold mining company -- that put their proprietary software online and offered a prize for the best recommendations on what to do. The company grew significantly as a result of crowdsource-suggestions. She recommended planing broad goals for collaboration for digital preservation and expanding national digital collections into international ones. LoC has a strong outreach program with classroom teachers to push discussion out to younger people.

Technical Alignment

The technical alignment panel looked at two issues: infrastructure and testing. Presentations on infrastructure included kopal, nestor, LuKII, and the UK LOCKSS Alliance. The presentations about testing called for benchmarking, public tests, and metrics that librarians can use when making decisions, rather than just believing vendor claims. A vendor raised questions about this, but admitted that they were not willing to share their test data, except among customers. (Note: I was panel chair and could not make detailed notes during this session.)

The panel on organizational alignment looked at long term commitment, the scale necessary to make the work cost-efficient, and effective interaction with vendors. While the EU funds many projects that promise to continue when the funding ended, most do not. TRAC fostered the audit and Certification of Trustwothy Digital Repositories, which is now an ISO Standard. Social collaboration is a necessary element of infrastructure and the National Digital Stewardship Alliance is an attempt to address this. Distributed digital preservation is an idea as old as monastic copying. MetaArchive is a distributed digital preservation initiative that began with NDIIPP funding. MetaArchive now also has European members and has been experimenting with cross-deposit with IRODS.

Later

The day ended with a reception.