The Google Books Project: Will it Make Libraries Obsolete?

The Google Books Project: Will it Make Libraries Obsolete?

MANAGING TECHNOLOGY · The Google Books Project: Will it Make Libraries Obsolete? by William C. Dougherty Available online 13 January 2010 L ike many...

115KB Sizes 0 Downloads 71 Views

MANAGING TECHNOLOGY · The Google Books Project: Will it Make Libraries Obsolete? by William C. Dougherty Available online 13 January 2010

L

ike many of you, I have been following the Google Book Project saga over the last few years. Though a lot has been written in both the popular and technical press, especially related to the settlement with authors and publishers, there is still in my mind a lack of clarity regarding how this project may impact libraries in the long term. If it is true that the successful completion of the project will “create the world's largest library online,”1 does this necessarily also entail that libraries as we know and love them are dying? It seems unlikely that there is a direct cause and effect relationship here. If this were true, would libraries and librarians be so directly involved with digitization projects and be included as major players on the current Google effort? U.S. based academic libraries are particularly well represented with Columbia, Cornell, Princeton, Stanford, the University of California, the University of Michigan, the University of Texas, the University of Virginia, and the University of Wisconsin-Madison listed prominently on the “Library Partners” page on the Google Books site.2 Would these prestigious institutions involve themselves with something that could ultimately lead to the termination of traditional learning and scholarly endeavor? To be clear, what Google has begun is not really anything new, either from a technological perspective or as a joint effort with highereducation institutions and publishers. Other projects with similar aims, such as JSTOR, Project Muse, and the Internet Archive, have been around longer than Google has existed as a company. JSTOR was founded in 1995 to build trusted digital archives for scholarship, with participation and support from the international scholarly community. JSTOR has created an interdisciplinary archive of scholarship, is actively preserving over one thousand academic journals in both digital and print formats, and continues to expand access to scholarly works needed for research and teaching globally. In 2009, JSTOR merged with ITHAKA, a not-for-profit organization helping the academic community use digital technologies to preserve the scholarly record and to advance scholarship and teaching in sustainable ways. Thousands of libraries and cultural heritage institutions, hundreds of publishers of scholarly literature, and countless scholars are involved (www.jstor.org).3 Project MUSE is a collaborative effort by libraries and publishers, providing 100% full-text online access to a selection of humanities and social sciences journals. MUSE is also the sole source of complete, fulltext versions of titles from many university presses and scholarly

William C. Dougherty is Director of Systems Support, Network Infrastructure & Services, Virginia Tech, 1700 Pratt Drive, Blacksburg, VA 24060 .

societies. MUSE provides full-text access to current content from over 400 titles representing nearly 100 not-for-profit publishers. MUSE began in 1993 as a joint project of the Johns Hopkins University Press and the Milton S. Eisenhower Library (MSEL) at JHU. Journals from other publishers were first incorporated in 2000, with additional publishers joining in each subsequent year (muse.jhu.edu).4 The Internet Archive is a 501(c)3 non-profit group that was founded in 1996 to build an Internet library. Its purposes include offering permanent access for researchers, historians, scholars, people with disabilities, and the general public to historical collections that exist in digital format. The Internet Archive includes texts (over 1.6 million individual titles), audio, moving images, and software as well as archived web pages, and is working to provide specialized services relating to training, education, or adaptive reading or information access needs of blind or other persons with disabilities (www.archive.org).5 All of the above are non-profit, scholarly endeavors by definition, as is illustrated by their web address suffix of “.org” or “.edu.” Is the “.com” in Google's address causing concern? Is Google just filling a niche that others have not stepped up to fill? Could the state of the world economy be playing a role here? In a capitalist society, shouldn't commercial vendors go where others cannot afford to go? There is little question about Google's innovative spirit related to the Internet and the web. As Google Enterprise president, Dave Girouard, put it recently, when it comes to the web, Google was “born of this platform, and everything we've ever built is about the Web.”6 Mobile offshoots are meeting needs of modern and particularly younger users. The “Google Apps for Education” program is making inroads among higher-education institutions. The latest buzz is around “Google Wave,” a personal communication and collaboration tool announced by Google at the Google I/O conference on May 27, 2009.7 Promising to blend e-mail, chat/texting, and audio-video “threads” together in real-time, Wave could be used a workflow tool or as a supra-collaboration platform for anyone with access to the web. On the other hand, Google is only number 32 on the list of Information Week's 21st annual ranking of Top Innovators.8 Corporations such as J.C. Penney (#6), Coca-Cola (#8), FedEx (#10), and Allstate (#20), IT centric companies Hewlett-Packard (#12) and NetApp (#26), and three university medical related facilities – the University of Pittsburgh Medical Center-UPMC (#9), the University of Arkansas for Medical Sciences (#23), and the University of Massachusetts Memorial Health Care (#30) – all rank higher. However, Lockheed Martin, Oracle, Merck, Chevron, AT&T, Xerox, and GM, and dozens of other commercial entities in the top 250 ranked each year by Information Week rank lower than Google. A question that often comes up in blogs, news reports, and conversations with colleagues is “Will Google be able to meet the

86 The Journal of Academic Librarianship Volume 36, Number 1, pages 86–89

scholarly versus commercial goals that a project such as the Google Books Project will have?” Google's entire life span as a “company” is less than a dozen years. In the Google Digital Book Project/Google Book Search Project, they are fully dependent on others for content; they are not a producer, nor are they a preservationist of any standing. Already issues having to do with the quality of the data being produced in support of the Book Project have surfaced. Geoffrey Nunberg, linguist and professor at UC Berkley's School of Information, has enumerated many of the problems with the metadata currently contained in the Google indices. In his presentation at the “Google Book Settlement Conference” held in August of 2009, he called the problem “The Metadata Mess,”9 and then upgraded this to “A Metadata Train Wreck”10 when posting to the “Language Log” blog site hosted by the University of Pennsylvania. Much the same information appeared later in a “Chronicle of Higher Education/ Chronicle Review” article entitled “Google's Book Search: A Disaster for Scholars.”11 In a nutshell (all three citations are worth a read even if there is overlap), Dr. Nunberg has found problems with the quality of the scanning and OCR processing, misdating – the year 1899 seems to be very popular, classification errors – a Mae West biography is classified under “Religion” for example – and other, sometimes humorous, miscellaneous errors that have to be seen to be believed. While Google's engineering director for the project, Dan Clancy, argues that the bulk of these errors are the fault of libraries and Union Catalogues,12 and claims a fine rebuttal of the above has been prepared by Google (copies of which have failed to be delivered to attendees of the EDUCASUE webinar on this topic despite repeated requests from the moderator). Dr. Nunberg has an explanation for these problems, however, and it relates to the commercial orientation of Google mentioned earlier. The clue comes in some of the mistaken classifications themselves. Indeed everyone recognizes that classification errors exist, and subject heading classifications are no different than any other. In one example, though, Shakespeare's Hamlet is classified under “Antiques and Collectibles.” While there may be some truth to this heading, a scholar is not likely to use this as a search criterion. Someone searching for a present for Christmas might, however, and this book store mentality is precisely where the fault lies. Google is using the BISAC, Book Industry Standards and Communications, committee's recommended subject headings as opposed to those in use at most academic libraries promulgated by the Library of Congress. The variance in number – approximately 3000 under BISAC guidelines versus well over 200,000 for the Library of Congress – and detail leave much to be desired. Even publishing houses have difficulty using BISAC headings and many, such as Ingram, still use their own.13 If these problems exist at the onset of such an ambitious project, what are the chances of things getting better over time? The content of this system will have to be verified and cross-checked regularly. The searching tools and mechanisms will have to evolve with the expansion of the collection. Storage media and methods will likewise have to keep pace as the technology changes. Access and retrieval speeds will have to continually improve as potentially more and more users crowd into the space to access the collection. In the end, is it really that we just do not believe that Google is up to the task? If such a system were to become depended upon, it must be reliable. While I do not believe I have ever had the Google search page fail to load, Gmail outages have been widely publicized over the last 16 months. Nine outages have been documented in the press, some lasting more than 24 hours—on July 16, 2008: a “502 error” struck Gmail, leading to what was described as a “long outage” by affected users; on August 6, 2008: technical trouble knocked an “undetermined number” of Gmail users (including both regular users and paying Google Apps clients) out of their mail for about 15 hours; on August 11, 2008: an issue with Google's “contacts system” caused

Gmail access to go offline for a “couple of hours” for numerous users. Both individual accounts and Google Apps accounts were affected again; on August 15, 2008: the third outage within a span of 2 weeks left users locked out of their accounts for more than 24 hours; on October 16, 2008: users went a full 30 hours without access.14 Outages in 2009 include ones on February 24th, March 10th, and September 1st and 24th.15 Google is also branching out into new venues, including competing head-to-head with Microsoft for business in the arena of “productivity suites.” There will be business to gain here, particularly from those who wish to use something other than a Windows operating system on their desktop, but as Preston Gralla put it in his recent “Opinion” column in Computerworld, “Ultimately, it will come down to trust, and that's not just trust in a particular company, but trust in the company's particular technology approach, be it cloud- or client-based.”16 At least one of the universities initially listed as a library partner with the Google Book Project, Harvard University, has scaled back their involvement. In November of 2008, Harvard University Library Director, expressed concern over the terms of the proposed settlement Google had negotiated with publishers and authors, particularly how the settlement might affect access and use by patrons and scholars. “As we understand it, the settlement contains too many potential limitations on access to and use of the books by members of the higher-education community and by patrons of public libraries.…The settlement provides no assurance that the prices charged for access will be reasonable, especially since the subscription services will have no real competitors [and] the scope of access to the digitized books is in various ways both limited and uncertain.”17 Harvard has continued allowing Google to scan out-ofprint books in their collection, but in-copyright books present a potential quagmire for everyone involved. Publications whose copyright ownership is unknown, so-called “orphan works,” could be an even bigger concern. There were two bills addressing this issue introduced during the last session of Congress. Both were drafted to ease problems caused by the revamping of the Copyright Act in 1976. A report released by the Copyright Office in 2006 noted that while the changes made over 30 years ago brought the U.S. more in compliance with the Berne Convention's prohibition against conditions or penalties for individuals exercising copyright, the automatic assumption of copyright once a work was “fixed in a tangible medium of expression” was problematic. The act increased the likelihood of there being more works whose copyright ownership could not easily be ascertained.18 The legislative outgrowth of the report became Senate Bill 2913, which passed, and House Bill 5889, which never made it out of committee. Neither bill became law. While groups such as the American Library Association and the Electronic Frontier Foundation went on record as being in support of the bills, other groups such as the Artists Rights Society, the Illustrator's Partnership of America, and the Advertising Photographers of America opposed them as introduced, but submitted suggested amendments that would allow their support.19 More vehement, albeit perhaps misguided, opposition can be found particularly among the online community. The claim that “Academic Marxists” helped draft the bills and that they would essentially allow “Disney or GE” to immediately claim for their corporate use any and all works currently online20 is not substantiated by the language in either bill introduced, but it illustrates how deep feelings of mistrust run on this subject. Google makes the argument that the Books Rights Registry will ultimately lead to less orphaned works over time. The Registry will be initially funded by $35 million from Google, but it will be a non-profit independent organization that collects and disburses revenue from users of the Book Search system to authors, publishers and other “rights holders.” Under the first draft of the settlement, the Registry

January 2010 87

would develop and maintain a rights database for all material covered by the agreement and play a role in resolving disputes between rights holders. Some refer to this setup as competition to the U.S. Copyright Office and federal courts, intimating that opponents of the settlement “are paid off” with the Books Rights Registry since it will “read everyone's contracts to say who is owed what of Google's revenues.”21 Georgia Harper, an attorney and the Scholarly Communications Advisor for the University of Texas at Austin Libraries, recently expressed her concern for the way orphaned works would be handled if the Google settlement in its present form were to stand. “Google clearly understood and accepted that this plan was based on an idea I found repugnant: if orphan works don't have owners, by definition, then why is it that the Registry should keep the money that comes in for books that ultimately no one claims? The publishers and authors just don't see orphans as really belonging to everyone in the absence of an owner. They see them as belonging to all the other authors and publishers, but not the public. That really rubbed me the wrong way. After all, it's not the publishers and authors who have collected these books, maintained them, preserved them, and are now making it POSSIBLE for anyone to even have potential to find them and buy them by partnering with Google to make them a part of Book Search. Where do they get off claiming that they are entitled to keep unearned, undeserved revenues to the exclusion of everyone else in the world?”22 The actual form of the registry will change from what was originally proposed anyway. Opposition and support of the settlement is varied and wide ranging—Canada, France, Japan, and Germany have expressed concerns not only about the possible business advantage the agreement would provide Google, but also about the negative impact on copyright holders from outside the U.S.23 The U.S. Justice Department offered its opinion and advised against accepting the settlement as originally drafted, citing potential federal antitrust violations.24 This advice caused the presiding judge to indefinitely delay the case to allow time for the parties to make changes which would presumably remove the Department of Justice's objections.25 Despite current opposition and potential concerns, no one should believe that this project is not going forward. Although the DoJ criticisms were the cause of the current delay, they still think that the project is a good idea. In a statement posted on September 18, 2009, they acknowledged that “a properly structured settlement agreement in this case offers the potential for important societal benefits.”26 A department official called the talks with the groups “very constructive” and considered the groups “motivated” to address the issues raised.27 The use of the Internet for research and scholarship has shown huge acceptance by students and faculty for web based resources, even 7 years ago.28 It is therefore unlikely that many will hesitate to use Google Books as a resource once it is established. Perhaps a bigger and better question to ask is “What happens if Google goes away?” Some may consider this question unrealistic, but others amongst the academic and business community believe that we should take it seriously. The 11 universities in the Big Ten Conference, the University of Chicago, the University of California system, along with the University of Virginia are part of a project called “the Hathi Trust” (www.hathitrust.org) with a goal of ensuring that the work being done as part of the Google project is not wasted. “Google won't be around forever,” said John P. Wilkin, an associate university librarian at the University of Michigan at Ann Arbor and executive director of Hathi Trust. “This is a commitment to the permanence of the materials,” he said, noting that libraries have been around longer than any technology company has. “We've been doing this for a couple of hundred years, and we intend to continue doing it.”29 At least one writer at Forbes magazine agrees with Dr. Wilkin's assessment. Lee Gomes wrote an article recently entitled “Why Google won't last forever” for his Digital Tools column. He argues that while free services from Google have become ubiquitous, they are

88 The Journal of Academic Librarianship

here due to specific business and technological trends unique to our historical era. These same trends are responsible for the demise of newspapers. “Sit people in front of computers all day, where they are inundated not only with news but also every manner and stripe of advertising, and the economic foundation of newspapers disappears– and with it, the profit engine that subsidizes newsrooms in the first place.”30 The economic cycle that has made Google profitable, with reported revenue of $3.7 billion and net income of $1.31 billion in 2008 31 will change as all cycles do. And when the dust settles, I believe libraries will still be there.

NOTES

AND

REFERENCES

1. http://www.cnn.com/2009/TECH/09/04/future.library.technology/ index.html accessed September 24, 2009. 2. http://books.google.com/googlebooks/partners.html accessed September 29, 2009. 3. http://www.jstor.org/page/info/about/organization/index.jsp accessed September 24, 2009. 4. http://muse.jhu.edu/about/muse/index.html accessed September 24, 2009. 5. http://www.archive.org/about/about.php accessed September 24, 2009. 6. Mary Hayes Weier, “Quicktakes column; Microsoft, Google Ready to Square Off,” Information Week, (1239/August 31, 2009): p. 18. 7. http://www.techcrunch.com/2009/05/28/google-wave-dripswith-ambition-can-it-fulfill-googles-grand-web-vision/ accessed October 2, 2009. 8. Information Week's 21st Annual Ranking of 250 Top Innovators, Information Week, (1241/September 14, 2009): pgs. 54-57. 9. http://languagelog.ldc.upenn.edu/myl/GoogBookSM.pdf accessed September 15, 2009. 10. http://languagelog.ldc.upenn.edu/nll/?p=1701 accessed September 15, 2009. 11. http://chronicle.com/article/Googles-Book-Search-A/48245/? sid=at&utm_source=at&utm_medium=en accessed September 14, 2009. 12. http://languagelog.ldc.upenn.edu/myl/GoogBookSM.pdf accessed September 15, 2009. 13. http://waltshiel.com/2009/09/14/to-bisac-or-not-to-bisac/ accessed September 29, 2009. 14. http://www.pcworld.com/article/160153/gmail_outage_ marks_sixth_downtime_in_eight_months accessed August 30, 2009. 15. http://www.computerworld.com/s/article/9129347/Gmail_ down_outage_could_last_36_hours_for_some_ http://www.msnbc. msn.com/id/32647533/ns/technology_and_science-tech_and_ gadgets/ http://news.cnet.com/8301-30685_3-10360729-264. html accessed October 1, 2009. 16. Preston Gralla, “Google vs. Microsoft: It's a matter of trust,” Computerworld, (vol. 43, no. 29): p. 42. 17. http://latimesblogs.latimes.com/jacketcopy/2008/11/googlesettleme.html accessed October 1, 2009. 18. www.copyright.gov/orphan/orphan.report.pdf: pg. 3, accessed October 1, 2009. 19. http://ipaorphanworks.blogspot.com/2008/07/hr-5889-amendments.html accessed October 1, 2009. 20. http://www.youtube.com/watch?v=CqBZd0cP5Yc accessed September 18, 2009. 21. http://online.wsj.com/article/SB123819841868261921.html# accessed September 19, 2009. 22. http://chaucer.umuc.edu/blogcip/collectanea/2008/11/google_ book_search_and_orphan_1.html accessed October 1, 2009. 23. http://chronicle.com/article/Choosing-Sides-in-the-Google/ 48357/ accessed October 1, 2009.

24. http://www.startri bune.com/science/59735727.html ? page=1&c=y accessed September 24, 2009. 25. http://www.cbsnews.com/stories/2009/09/30/ap/hightech/ main5354267.shtml accessed October 6, 2009 and http://www. educause.edu/blog/sworona/FYIGoogleBookssettlementdelaye/ 185174 accessed October 9, 2009. 26. http://www.washingtonpost.com/wp-dyn/content/article/ 2009/09/18/AR2009091804149.html?wprss=rss_technology accessed September 24, 2009. 27. Ibid. 28. Amy Friedlander, “Dimensions and Use of the Scholarly Information Environment: Introduction to a Data Set Assembled by the Digital Library Federation and Outsell, Inc.,” Digital Library Federation and Council on Library and Information Resources, Washington, D.C., November 2002. 29. http://chronicle.com/article/In-Case-Google-Bails-Out-on/ 8580/ accessed September 30, 2009.

30. http://www.forbes.com/2009/07/22/google-internet-newspaperstechnology-breakthroughs-google.html accessed October 9, 2009. 31. http://www.nytimes.com/2008/04/17/technology/17cnd-google. html accessed October 9, 2009.

FURTHER READING http://www.educause.edu/Resources/TheGoogleBookScanningProjectIs/ 178822 http://www.googlebooksettlement.com/r/view_settlement_agreement http://googleblog.blogspot.com/2008/10/new-chapter-for-googlebook-search.html http://latimesblogs.latimes.com/jacketcopy/2008/10/google-publishe. html http://wave.google.com

January 2010 89