Library of Congress Workshop on Etexts - BestLightNovel.com
You’re reading novel Library of Congress Workshop on Etexts Part 8 online at BestLightNovel.com. Please use the follow button to get notification about the latest chapter next time when you visit BestLightNovel.com. Use F11 button to read novel in full-screen(PC only). Drop by anytime you want to read free – fast – latest novel. It’s great if you could leave a comment, share your opinion about the new chapters, new novel with others on the internet. We’ll do our best to bring you the finest, latest novel everyday. Enjoy
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ ANDRE * Overview and history of NATDP * Various agricultural CD-ROM products created inhouse and by service bureaus * Pilot project on Internet transmission * Additional products in progress *
Pamela ANDRE, a.s.sociate director for automation, National Agricultural Text Digitizing Program (NATDP), National Agricultural Library (NAL), presented an overview of NATDP, which has been underway at NAL the last four years, before Judith ZIDAR discussed the technical details. ANDRE defined agricultural information as a broad range of material going from basic and applied research in the hard sciences to the one-page pamphlets that are distributed by the cooperative state extension services on such things as how to grow blueberries.
NATDP began in late 1986 with a meeting of representatives from the land-grant library community to deal with the issue of electronic information. NAL and forty-five of these libraries banded together to establish this project--to evaluate the technology for converting what were then source doc.u.ments in paper form into electronic form, to provide access to that digital information, and then to distribute it.
Distributing that material to the community--the university community as well as the extension service community, potentially down to the county level--const.i.tuted the group's chief concern.
Since January 1988 (when the microcomputer-based scanning system was installed at NAL), NATDP has done a variety of things, concerning which ZIDAR would provide further details. For example, the first technology considered in the project's discussion phase was digital videodisc, which indicates how long ago it was conceived.
Over the four years of this project, four separate CD-ROM products on four different agricultural topics were created, two at a scanning-and-OCR station installed at NAL, and two by service bureaus.
Thus, NATDP has gained comparative information in terms of those relative costs. Each of these products contained the full ASCII text as well as page images of the material, or between 4,000 and 6,000 pages of material on these disks. Topics included aquaculture, food, agriculture and science (i.e., international agriculture and research), acid rain, and Agent Orange, which was the final product distributed (approximately eighteen months before the Workshop).
The third phase of NATDP focused on delivery mechanisms other than CD-ROM. At the suggestion of Clifford LYNCH, who was a technical consultant to the project at this point, NATDP became involved with the Internet and initiated a project with the help of North Carolina State University, in which fourteen of the land-grant university libraries are transmitting digital images over the Internet in response to interlibrary loan requests--a topic for another meeting. At this point, the pilot project had been completed for about a year and the final report would be available shortly after the Workshop. In the meantime, the project's success had led to its extension. (ANDRE noted that one of the first things done under the program t.i.tle was to select a retrieval package to use with subsequent products; Windows Personal Librarian was the package of choice after a lengthy evaluation.)
Three additional products had been planned and were in progress:
1) An arrangement with the American Society of Agronomy--a professional society that has published the Agronomy Journal since about 1908--to scan and create bit-mapped images of its journal.
ASA granted permission first to put and then to distribute this material in electronic form, to hold it at NAL, and to use these electronic images as a mechanism to deliver doc.u.ments or print out material for patrons, among other uses. Effectively, NAL has the right to use this material in support of its program.
(Significantly, this arrangement offers a potential cooperative model for working with other professional societies in agriculture to try to do the same thing--put the journals of particular interest to agriculture research into electronic form.)
2) An extension of the earlier product on aquaculture.
3) The George Was.h.i.+ngton Carver Papers--a joint project with Tuskegee University to scan and convert from microfilm some 3,500 images of Carver's papers, letters, and drawings.
It was antic.i.p.ated that all of these products would appear no more than six months after the Workshop.
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ ZIDAR * (A separate arena for scanning) * Steps in creating a database *
Image capture, with and without performing OCR * Keying in tracking data * Scanning, with electronic and manual tracking * Adjustments during scanning process * Scanning resolutions * Compression * De-skewing and filtering * Image capture from microform: the papers and letters of George Was.h.i.+ngton Carver * Equipment used for a scanning system *
Judith ZIDAR, coordinator, National Agricultural Text Digitizing Program (NATDP), National Agricultural Library (NAL), ill.u.s.trated the technical details of NATDP, including her primary responsibility, scanning and creating databases on a topic and putting them on CD-ROM.
(ZIDAR remarked a separate arena from the CD-ROM projects, although the processing of the material is nearly identical, in which NATDP is also scanning material and loading it on a Next microcomputer, which in turn is linked to NAL's integrated library system. Thus, searches in NAL's bibliographic database will enable people to pull up actual page images and text for any doc.u.ments that have been entered.)
In accordance with the session's topic, ZIDAR focused her ill.u.s.trated talk on image capture, offering a primer on the three main steps in the process: 1) a.s.semble the printed publications; 2) design the database (database design occurs in the process of preparing the material for scanning; this step entails reviewing and organizing the material, defining the contents--what will const.i.tute a record, what kinds of fields will be captured in terms of author, t.i.tle, etc.); 3) perform a certain amount of markup on the paper publications. NAL performs this task record by record, preparing work sheets or some other sort of tracking material and designing descriptors and other enhancements to be added to the data that will not be captured from the printed publication.
Part of this process also involves determining NATDP's file and directory structure: NATDP attempts to avoid putting more than approximately 100 images in a directory, because placing more than that on a CD-ROM would reduce the access speed.
This up-front process takes approximately two weeks for a 6,000-7,000-page database. The next step is to capture the page images.
How long this process takes is determined by the decision whether or not to perform OCR. Not performing OCR speeds the process, whereas text capture requires greater care because of the quality of the image: it has to be straighter and allowance must be made for text on a page, not just for the capture of photographs.
NATDP keys in tracking data, that is, a standard bibliographic record including the t.i.tle of the book and the t.i.tle of the chapter, which will later either become the access information or will be attached to the front of a full-text record so that it is searchable.
Images are scanned from a bound or unbound publication, chiefly from bound publications in the case of NATDP, however, because often they are the only copies and the publications are returned to the shelves. NATDP usually scans one record at a time, because its database tracking system tracks the doc.u.ment in that way and does not require further logical separating of the images. After performing optical character recognition, NATDP moves the images off the hard disk and maintains a volume sheet. Though the system tracks electronically, all the processing steps are also tracked manually with a log sheet.
ZIDAR next ill.u.s.trated the kinds of adjustments that one can make when scanning from paper and microfilm, for example, redoing images that need special handling, setting for dithering or gray scale, and adjusting for brightness or for the whole book at one time.
NATDP is scanning at 300 dots per inch, a standard scanning resolution.
Though adequate for capturing text that is all of a standard size, 300 dpi is unsuitable for any kind of photographic material or for very small text. Many scanners allow for different image formats, TIFF, of course, being a de facto standard. But if one intends to exchange images with other people, the ability to scan other image formats, even if they are less common, becomes highly desirable.
CCITT Group 4 is the standard compression for normal black-and-white images, JPEG for gray scale or color. ZIDAR recommended 1) using the standard compressions, particularly if one attempts to make material available and to allow users to download images and reuse them from CD-ROMs; and 2) maintaining the ability to output an uncompressed image, because in image exchange uncompressed images are more likely to be able to cross platforms.
ZIDAR emphasized the importance of de-skewing and filtering as requirements on NATDP's upgraded system. For instance, scanning bound books, particularly books published by the federal government whose pages are skewed, and trying to scan them straight if OCR is to be performed, is extremely time-consuming. The same holds for filtering of poor-quality or older materials.
ZIDAR described image capture from microform, using as an example three reels from a sixty-seven-reel set of the papers and letters of George Was.h.i.+ngton Carver that had been produced by Tuskegee University. These resulted in approximately 3,500 images, which NATDP had had scanned by its service contractor, Science Applications International Corporation (SAIC). NATDP also created bibliographic records for access. (NATDP did not have such specialized equipment as a microfilm scanner.
Unfortunately, the process of scanning from microfilm was not an unqualified success, ZIDAR reported: because microfilm frame sizes vary, occasionally some frames were missed, which without spending much time and money could not be recaptured.
OCR could not be performed from the scanned images of the frames. The bleeding in the text simply output text, when OCR was run, that could not even be edited. NATDP tested for negative versus positive images, landscape versus portrait orientation, and single- versus dual-page microfilm, none of which seemed to affect the quality of the image; but also on none of them could OCR be performed.
In selecting the microfilm they would use, therefore, NATDP had other factors in mind. ZIDAR noted two factors that influenced the quality of the images: 1) the inherent quality of the original and 2) the amount of size reduction on the pages.
The Carver papers were selected because they are informative and visually interesting, treat a single subject, and are valuable in their own right.
The images were scanned and divided into logical records by SAIC, then delivered, and loaded onto NATDP's system, where bibliographic information taken directly from the images was added. Scanning was completed in summer 1991 and by the end of summer 1992 the disk was scheduled to be published.
Problems encountered during processing included the following: Because the microfilm scanning had to be done in a batch, adjustment for individual page variations was not possible. The frame size varied on account of the nature of the material, and therefore some of the frames were missed while others were just partial frames. The only way to go back and capture this material was to print out the page with the microfilm reader from the missing frame and then scan it in from the page, which was extremely time-consuming. The quality of the images scanned from the printout of the microfilm compared unfavorably with that of the original images captured directly from the microfilm. The inability to perform OCR also was a major disappointment. At the time, computer output microfilm was unavailable to test.
The equipment used for a scanning system was the last topic addressed by ZIDAR. The type of equipment that one would purchase for a scanning system included: a microcomputer, at least a 386, but preferably a 486; a large hard disk, 380 megabyte at minimum; a multi-tasking operating system that allows one to run some things in batch in the background while scanning or doing text editing, for example, Unix or OS/2 and, theoretically, Windows; a high-speed scanner and scanning software that allows one to make the various adjustments mentioned earlier; a high-resolution monitor (150 dpi ); OCR software and hardware to perform text recognition; an optical disk subsystem on which to archive all the images as the processing is done; file management and tracking software.
ZIDAR opined that the software one purchases was more important than the hardware and might also cost more than the hardware, but it was likely to prove critical to the success or failure of one's system. In addition to a stand-alone scanning workstation for image capture, then, text capture requires one or two editing stations networked to this scanning station to perform editing. Editing the text takes two or three times as long as capturing the images.
Finally, ZIDAR stressed the importance of buying an open system that allows for more than one vendor, complies with standards, and can be upgraded.
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ WATERS *Yale University Library's master plan to convert microfilm to digital imagery (POB) * The place of electronic tools in the library of the future * The uses of images and an image library * Primary input from preservation microfilm * Features distinguis.h.i.+ng POB from CXP and key hypotheses guiding POB * Use of vendor selection process to facilitate organizational work * Criteria for selecting vendor * Finalists and results of process for Yale * Key factor distinguis.h.i.+ng vendors *
Components, design principles, and some estimated costs of POB * Role of preservation materials in developing imaging market * Factors affecting quality and cost * Factors affecting the usability of complex doc.u.ments in image form *
Donald WATERS, head of the Systems Office, Yale University Library, reported on the progress of a master plan for a project at Yale to convert microfilm to digital imagery, Project Open Book (POB). Stating that POB was in an advanced stage of planning, WATERS detailed, in particular, the process of selecting a vendor partner and several key issues under discussion as Yale prepares to move into the project itself.
He commented first on the vision that serves as the context of POB and then described its purpose and scope.
WATERS sees the library of the future not necessarily as an electronic library but as a place that generates, preserves, and improves for its clients ready access to both intellectual and physical recorded knowledge. Electronic tools must find a place in the library in the context of this vision. Several roles for electronic tools include serving as: indirect sources of electronic knowledge or as "finding"
aids (the on-line catalogues, the article-level indices, registers for doc.u.ments and archives); direct sources of recorded knowledge; full-text images; and various kinds of compound sources of recorded knowledge (the so-called compound doc.u.ments of Hypertext, mixed text and image, mixed-text image format, and multimedia).
POB is looking particularly at images and an image library, the uses to which images will be put (e.g., storage, printing, browsing, and then use as input for other processes), OCR as a subsequent process to image capture, or creating an image library, and also possibly generating microfilm.
While input will come from a variety of sources, POB is considering especially input from preservation microfilm. A possible outcome is that the film and paper which provide the input for the image library eventually may go off into remote storage, and that the image library may be the primary access tool.
The purpose and scope of POB focus on imaging. Though related to CXP, POB has two features which distinguish it: 1) scale--conversion of 10,000 volumes into digital image form; and 2) source--conversion from microfilm. Given these features, several key working hypotheses guide POB, including: 1) Since POB is using microfilm, it is not concerned with the image library as a preservation medium. 2) Digital imagery can improve access to recorded knowledge through printing and network distribution at a modest incremental cost of microfilm. 3) Capturing and storing doc.u.ments in a digital image form is necessary to further improvements in access.
(POB distinguishes between the imaging, digitizing process and OCR, which at this stage it does not plan to perform.)
Currently in its first or organizational phase, POB found that it could use a vendor selection process to facilitate a good deal of the organizational work (e.g., creating a project team and advisory board, confirming the validity of the plan, establis.h.i.+ng the cost of the project and a budget, selecting the materials to convert, and then raising the necessary funds).
POB developed numerous selection criteria, including: a firm committed to image-doc.u.ment management, the ability to serve as systems integrator in a large-scale project over several years, interest in developing the requisite software as a standard rather than a custom product, and a willingness to invest substantial resources in the project itself.
Two vendors, DEC and Xerox, were selected as finalists in October 1991, and with the support of the Commission on Preservation and Access, each was commissioned to generate a detailed requirements a.n.a.lysis for the project and then to submit a formal proposal for the completion of the project, which included a budget and costs. The terms were that POB would pay the loser. The results for Yale of involving a vendor included: broad involvement of Yale staff across the board at a relatively low cost, which may have long-term significance in carrying out the project (twenty-five to thirty university people are engaged in POB); better understanding of the factors that affect corporate response to markets for imaging products; a compet.i.tive proposal; and a more sophisticated view of the imaging markets.