Wednesday, May 18, 2005


The keyword revolution

Over the last couple of weeks I've been learning how to play with Google Print. Although the Print database is certainly not exhaustive, I've been blown away by how many books that interest me--from both trade and academic publishers--are available for full-text searching. And I've been even more impressed by the interface. You can see full-page images of published material, with your keywords highlighted on the page.

Of course, to have access to this resource, you have to be somewhat savvy, because there is not yet a portal page on Google's site for searching books. If you don't already know it, you can tap the vast resources of Google Print in one of at least two ways:

(1) When searching at Google, begin your search string with the word "book" or "books" and then enter your query as usual. If Google Print has book pages that match your query, you should see about two or three "book results" listed above your search. (Example.) You can either click on the individual results or on the headline that sends you to all of your book results. (Example.)

From there you can search within particular books (check the sidebar of an individual result page), look at the index and table of contents for a book, and even scroll through about two or three pages around your result page. Once you are within Google Print, you can also "search all books" by using the form entry box located either at the top of the page or at the bottom. (Hat-tip: Search Engine Watch.)

(2) Another way to get into Google Print is to use this link and then enter your search at the top of the page. (Hat-tip: NT Gateway.)

The scholarly possibilities here are staggering. Google Print makes it possible, for instance, to search for published books that cite a certain book or article--a feature that was difficult to do before without access to some kind of citation-tracking database. Most of all, Google Print makes it possible to see whether there are books that mention a particular name or word, even in passing--something that was nearly impossible to do before.

For instance, if I want to see books that mention the Kentucky abolitionist "Cassius M. Clay," I just do this and get 76 hits. In the "real" world, as they say, I would have had to determine that those 76 books were relevant to Clay, find them on the shelf, and then hope that the book's author or editor had listed "Clay" in the index. All of this was possible before, of course, for academic journals and other kinds of periodical literature. And it was even possible in digital collections of historical books, like the Making of America site or the Samuel May Anti-Slavery Collection at Cornell. But with Google Print, the digital keyword revolution has truly arrived, and the end is not in sight.

What should we make of this revolution, and how revolutionary is it? In the latest issue of Perspectives, there's an article by Carlo Ginzburg considering that question. (There are also two fantastic articles on history blogging by my fellow Cliopatriarchs, Ralph Luker and Manan Ahmed.) Ginzburg argues persuasively that keyword searching in library catalogs is good for scholarship, primarily because "the computer multiplies the possibilities that an unforeseen fact will take us by surprise." (In the above search, for instance, I was surprised to see Clay mentioned in The Education of Henry Adams as one of the morose young man's diplomatic "masters." It turns out that Clay is listed in the index of my printed copy of Education, but I don't know that I would have looked there intentionally for a mention of Clay.)

Of course, that capacity for surprise is not limitless, because we must have some reason for entering in the keywords that we do, and usually our intuitions here are guided by our prior research or the work of others. But it is significant that keyword searches allow us to navigate through texts largely without the mediation of editors, authors, and publishers.

On the other hand, the excitement of surprise can be misleading to a researcher. The temptation when doing keyword searches is always to think that your results are more representative than they are. (This is something I've mused about before.) If I look in the printed index to a book and see one page listed for Clay out of 450 or 500, I can make a rough and ready judgment about how important he is in the context of that book. But when I look at a Google results page, I depend on Google's relevancy algorithms to make that determination for me, and it's easy to forget that when I'm looking at a long list of hits. (I can still tell in Google Print how many times a word appears in a book, and how many pages the book has, but the linearity and ephemerality of a results list can be seductive. It doesn't have the same weight in your hand that the actual book does, and perhaps, subconsciously, that actual, physical extension of the book in space helps our brains make determinations about proportionality and significance.) For all the virtues of keyword searching, then, this revolution warrants some careful reflection.

You can find such reflection in a recent article by David Bell in The New Republic on "The Bookless Future." (Full disclosure: Professor Bell is the incoming Director of Graduate Studies in the Johns Hopkins history department, where I am pursuing said graduate studies.) Unfortunately, and perhaps ironically, "The Bookless Future" is only available online to subscribers. But I found the full text by using Hopkins' institutional subscription to Lexis Nexis and strongly recommend it if you can find a copy.

The bulk of the article (if, following my musings above, it is not a category mistake to talk about the "bulk" of hypertext) wonders about the future of electronic books, and it canvasses several kinds of technology, currently in development, that will hopefully make electronic books easier to read. I think Bell is right that the only thing missing is a vehicle for text that is as optimal for reading as a printed book. The technology to scan full-page images of books and make them searchable is clearly already upon us; it won't be too much longer, I predict, before you can pay a fee and pull a book from Google Print onto your PDA or some other electronic device.

But Bell also expresses warranted concern about the deleterious effects these changes might have on the practice of reading.
The very nature of the computer presents a different problem. If physical discomfort discourages the reading of [online] texts sequentially, from start to finish, computers make it spectacularly easy to move through texts in other ways--in particular, by searching for particular pieces of information. Reading in this strategic, targeted manner can feel empowering. Instead of surrendering to the organizing logic of the book you are reading, you can approach it with your own questions and glean precisely what you want from it. You are the master, not some dead author. And this is precisely where the greatest dangers lie, because when reading, you should not be the master. Information is not knowledge; searching is not reading; and surrendering to the organizing logic of a book is, after all, the way one learns.

If my own experience is any guide, "search-driven" reading can make for depressingly sloppy scholarship. Recently, I decided to examine the way in which the radical eighteenth-century thinker d'Holbach discussed warfare. I could have read his book Universal Morality in the rare-book room of my university library, but I decided instead to download a copy (it took about two minutes). And then, faced with a text hundreds of pages long, instead of reading from start to finish, I searched for the words "war" and "peace." I found a great many juicy quotations, which I conveniently cut and pasted directly into my notes. But at the end, I had very little idea of why d'Holbach had written his book in the first place. If I had had to read the physical book, I could still have skimmed, cut, and pasted, but I would have been forced to confront the text as a whole at some basic level. The computer encouraged me to read in exactly the wrong way, leaving me with little but a series of disembodied passages.
This has often been my troubling experience as well: Henry Adams makes a great quip about Clay, for instance--as a teacher, Clay had "no equal though possibly some rivals." But having previously submitted myself to the organizing logic of Adams' book by reading it cover to cover, I know better than to take Adams' quips at face value. (Sure enough, according to an editor's footnote, Adams referred to Clay in private as a "noisy jackass.") I wonder, though, whether I'm as careful with books that I haven't read. The keyword revolution at least means that I need to be especially careful--I need to balance the subversive virtues of keyword search (the "surprise" of which Ginzburg speaks) with the virtues of "surrendering to the organizing logic of a book."

All of this got me wondering, though, about whether the dangers of "strategic, targeted" reading are really that new. After all, the printed index compiled by an author or editor presents the reader with the same potential for targeted reading, and it is the rare researcher who does not rely heavily on these indexes to quickly jump to parts of a book that are relevant to his or her research. (Here are three papers online that allude to the similarity between online and offline indexes.)

The index, like the codex, predates the printed book. According to Guglielmo Cavallo and Roger Chartier in their edited collection, The History of Reading in the West,
Even beyond its immediate derivation from the manuscript, the book--both before and after Gutenberg--and the manuscript were similar objects composed of sheets folded and gathered into quires and assembled within one binding or cover. It is thus hardly surprising that all the systems of reference that have somewhat hastily been credited to printing existed well before its invention. One of these was the use of signatures and catchwords to help assemble the pages in the right order. Other signalling devices aided reading: folios, columns, or lines might be numbered; the page could be divided up more visibly by the use of devices such as ornamented initials, rubrics and marginal letters; an analytical (rather than a simple spatial) relationship between the text and its glosses could be set up; different characters or different colours of ink could be used to distinguish between text and commentary. Thanks to its organization in quires and to its clear divisions, the codex, whether manuscript or printed, was easy to index. Concordances, alphabetical tables and systematic indexes were common practice even in the age of the manuscript, and it was in monastic scriptoria and stationers' workshops that these modes for the organization of written material were invented. Printers picked them up later. (p. 23)
And programmers picked them up even later. It would be an interesting research question to see (and maybe a medieval historian can correct me if this has already been done) whether the invention of the index in the age of manuscript provoked the same kinds of anxieties we feel today about targeted access to texts. One of the contributors to the Cavallo and Chartier volume, Jacqueline Hamesse, suggests that scholastic modes of reading were shaped in part by these innovations. Unlike monastic readers, scholastics could jump from page to page and cross-reference works without the same kind of intensive, devotional reading:
"Here we enter into a new world that suggests modern reading habits. After the pioneering labours of the Cistercians to organize the content of a manuscript, other aids appeared and flourished: the table of contents, the concept index, concordances of terms, alphabetically arranged analytical tables, summaries and abridgements. Even the great twelfth-century summae were abridged: they were admittedly easier to handle when reduced to a single volume. The abridgements were a pale reflection of the originals, however.

The rise of this new literary genre inevitably meant that reading was no longer direct: now a compiler served as an intermediary, and reading was filtered by selection. Reference to the book changed. Its contents were no longer studied for themselves with the aim of acquiring a certain wisdom, as Hugh of Saint Victor had recommended. Henceforth knowledge was primary, and it too precedence over everything else, even when it was fragmentary. Meditation gave way to utility in a profound shift of emphasis that completely changed the impact of reading.

Certain scholars are quite aware of the important role of these working tools for learning in the Middle Ages, but others have failed to grasp their influence among intellectuals. As any fourteenth-century inventory will show, florilegia, concordances and tables abounded, not only in the libraries of the religious Orders, but also in college and university libraries. Such compilations often replaced consultation and, a fortiori, direct reading of authors' works, and even though they constitute a second-tier literature, their sizeable role in the intellectual preparation of medieval men cannot be denied. Today we have such different methods for acquiring culture that it is difficult for us to comprehend that even the great writers of the age of scholasticism made use of these handy tools for easy access to documentation that was indispensable to their work. The large number of manuscripts that have come down to us bear witness to the use and dissemination of such compilations. (p. 110)
Of course, electronic keyword searching takes concordances to another level. But perhaps this is a good thing. The etymological roots of "concordance" are, after all, entangled with the roots of "concord," and it is sometimes good to introduce discordance into our readings of texts. If Ginzburg is right, then we have a real advantage over our scholastic forbears; unlike them, we don't have to rely on the compilations of other scholars, who might use indexes as a way to assert too much control over the text. But if Bell is right, then we also have a greater responsibility to handle that advantage with care, and to prevent our liberty from becoming license.

You can be the judge of whether I've done that here, because (in a burst of self-referentiality) I found the quotes from the Cavallo and Chartier book by using Google Print, and I've never read the whole thing. When bloggers advise readers to "read the whole thing," do they really mean it? And do we ever really follow that advice?

(Cross-posted at Cliopatria.)

Collective Improvisation:
I've always found it interesting that, throughout my graduate training, I've been told how NOT to read an entire book or article, but rather how to find the most important points and arguments. I've never had a professor who told me that I should read an entire book, cover to cover, page by page in the order that its written. And I don't think this really has much to do with the Internet itself, but with the explosion of all kinds of information outlets. Scholarly publication, in monograph and article form, has grown immensely in the last 100 years. (Or at least it seems that way. Does anyone know how much exactly?)

I think a stronger argument against targeted reading and keyword searching could be made against online and printed primary sources. Granted, keyword searching can be invaluable for initial research and sorting, but keyword searching can't replace careful reading around those keywords and the document as a whole. 

Posted by Jeremy Boggs

Posted by Anonymous Anonymous on 5/20/2005 09:21:00 AM : Permalink  

Interesting point about the difference between reading primary and secondary sources. I think that difference is significance for historians insofar as our primary sources are usually from the past and thus more likely to be unfamiliar to us than secondary sources. Targeted reading can help us understand more only if we already understand a lot about a text, but in cases where we don't know a lot about a text (like a primary source), it is liable to produce misunderstanding. 

Posted by Caleb

Posted by Anonymous Anonymous on 5/23/2005 07:55:00 AM : Permalink  

The idea of surrending one's self to the logic of a book is good - assuming there is a logic to begin with. Just because words are written down, does not mean there is anything particularly logical about it. Indiscriminate reading can be a big waste of time. Perhaps keyword searching is a way of a least helping us to sort out the wheat from the chaff.

I wonder if the practice of keyword searching will lead to a new form of academic writing which is perhaps more concise and targeted and less driven by logic. This is not necessarily a good thing, just an idea about the way things might go.

Posted by Anonymous Anonymous on 2/16/2007 11:38:00 AM : Permalink  

