SGML: ACRL Report (Mallery)


From owner-etextctr@lists.Princeton.EDU Thu Mar 16 18:13:29 1995
Return-Path: <owner-etextctr@lists.Princeton.EDU>
Received: from lists.Princeton.EDU by utafll.uta.edu (4.1/25-eef)
	id AA28426; Thu, 16 Mar 95 18:13:24 CST
Received: by lists.Princeton.EDU id <23278.s2-1>; Thu, 16 Mar 1995 17:13:12 -0500
Received: from ponyexpress.princeton.edu ([128.112.129.131]) by lists.Princeton.EDU with SMTP id <23071.s1-2>; Thu, 16 Mar 1995 17:12:17 -0500
Received: from phoenix.Princeton.EDU by ponyexpress.princeton.edu (8.6.9/1.7/newPE)
	id RAA10079; Thu, 16 Mar 1995 17:12:06 -0500
Received: by phoenix.Princeton.EDU (4.1/Phoenix_Cluster_Client)
	id AA09238; Thu, 16 Mar 95 17:12:05 EST
Message-Id: <CMM.0.88.795391924.etextctr@phoenix.Princeton.EDU>
Date: 	 Thu, 16 Mar 1995 17:12:04 EST
Reply-To: etextctr@lists.Princeton.EDU
Sender: owner-etextctr@lists.Princeton.EDU
From: ETEXTCTR Discussion List <etextctr@phoenix.Princeton.EDU>
To: Electronic Text Centers List <etextctr@lists.Princeton.EDU>
Subject: Report on the ACRL E-Text Center Discussion Group 2/95
X-To: etextctr@lists
X-Listprocessor-Version: 7.1 -- ListProcessor by CREN
Status: RO

Sender:  Mary Mallery <mallery@eden.rutgers.edu>
Subject: Report on the ACRL E-Text Centers Discussion Group 2/95

Report on the ACRL Electronic Text Centers Discussion Group
at the American Library Association Mid-Winter Meeting, February
4, 1995: "Markup and Access Techniques for Electronic Texts: TEI
and SGML"

     Marianne Gaunt, Associate University Librarian at Rutgers
University Libraries, chaired this session of the ACRL Electronic
Text Centers Discussion Group and introduced the speakers.  She
noted that this is a "practically oriented" session to introduce
people to SGML (the Standard Generalized Markup Language), to
explain why it is important to librarians, and to demonstrate
software packages for creating, searching and manipulating
documents that contain SGML markup.  

     John Price-Wilkin, the Humanities Text Initiative (HTI)
Librarian at the University of Michigan, spoke to the first of
these questions, giving the audience of more 60 attendees
strategies for introducing SGML-encoded texts into a research
library environment as well as an overview of current access to
SGML-encoded texts and tools. Susan Hockey, the Director of the
Center for Electronic Texts in the Humanities (CETH) and the
scheduled speaker, was unable to attend, but Gregory Murphy, Text
Systems Manager at CETH, demonstrated the validation and error
message functions of James Clark's "sgmls" application for
parsing an SGML-encoded document and answered questions about
other SGML-aware editors that he has reviewed for CETH.

     John Price-Wilkin began with a description of the HTI at the
University of Michigan (Web page available at
http://www.hti.umich.edu).  The HTI is a joint project of the
University of Michigan Libraries, the UM Press, and the School of
Library and Information Studies, with support from the College of
Literature, Science & Arts. The HTI creates, maintains and
provides access to SGML-encoded texts to the University community
over a wide-area network with TCP/IP.  They are encoding
humanities texts in SGML for a new kind of "retrospective
conversion" of library materials. In addition, the University of
Michigan has undertaken a "JSTOR project," in which they will put
online the text of ten journals in economics and history (bitmaps
with character-based full-text indexes behind them to locate the
appropriate bitmap).  The HTI will probably be responsible for an
SGML markup project for five years of the ten journals (roughly
100,000 pages) as a test of cost and value.

     Price-Wilkin reviewed the nuts and bolts issues of how to
set up such a text project in an academic library, addressing
such questions as "Which texts do you convert:  reference
materials or primary literature?" and "Who should do the markup?" 
The group at Michigan found that a text should be marked up in
layers, with each layer requiring a different level of expertise
in the person performing the markup.  For the first layer,
library school students were employed to do the basic markup,
which reflects the typographical layout of the text, the basic
structure, index and basic transcription (e.g., page numbers and
paragraphs).  Next, catalogers were employed to fill in the
correct information and to do the authority control work required
for standardizing author names, titles and subject headings for
the TEI (Text Encoding Initiative) header. The final level of
markup is "ideational" as opposed to "structural," and here
Price-Wilkin recommends involving a faculty member or subject
specialist, who would have a definite project in mind for the
text and thus could clearly define thematic or stylistic fields
that they would want to mark consistently throughout the text.

     Because SGML is a relatively new method of preparing texts,
training is a key issue in its implementation.  Price-Wilkin said
that he found that the training issues at the HTI broke down into
three parts:  

     (1) training primary staff, or "document experts" (Note:
     there are not many places to send people for training at
     this level:  CETH provides intensive training in its Summer
     Seminar; the University of Michigan School of Information
     and Library Studies has initiated two linked courses on
     these issues; the Rare Books School at the U. of Virginia
     has a course in electronic texts; and the TEI has sponsored
     workshops for training the trainers in TEI implementation); 

     (2) these document experts are then on-hand to train two
     basic types of SGML taggers:  support staff who perform
     document conversion and whose function is more production
     oriented, and the graduate students and faculty who have
     specific projects; and finally

     (3) these latter ideational taggers require a more in-depth
     introduction to the principles of the TEI and SGML for text
     analysis markup.

     Price-Wilkin also gave a short introduction to pre-
processing tools for SGML encoding.  He explained the uses of
perl, with which one can write programs to recognize patterns in
a text in a batch mode, and other UNIX stream-oriented software
(e.g., sed and awk). However, he noted that, though these
programs can save time, they cannot do all the work of SGML
encoding.  Much markup work goes beyond simple pattern matching,
where "humans are needed to make the complex decisions" about
what they consider worthwhile to mark up in a text.  Commercial
packages for SGML-encoding texts do exist, and their numbers are
growing.  Two software packages that Price-Wilkin mentioned were
Avalanche's Fast Tag (with a price tag of approximately $1500)
and OmniMark, produced by Exoterica.  These commercial packages
are more effective and less complicated than their free-ware
counterparts.

     Next, Price-Wilkin reviewed interactive editing tools,
comparing the free-ware products: emacs and psgmls used in
combination and sgmls used with any editor you have. He noted
that "No matter what:  KNOW YOUR DTD (Document Type Definition)."
Gregory Murphy provided the anonymous ftp site where one can get
a copy of psgmls and sgmls:  ftp://ftp.ex.ac.uk. The commercial
editors that Price-Wilkin recommended were SoftQuad's
Author/Editor and Arbor Text's Adept version.  Price-Wilkin also
looked forward to the release of SoftQuad's Panorama, which
includes math and science markup capabilities (unlike html) and a
Table of Contents generator.

     Finally, Price-Wilkin discussed html (the hypertext markup
language) and richer SGML DTDs.  Though the use of html is more
widespread than the use of SGML because it is the language used
for writing home pages accessible through Mosaic and Netscape
browsers on the World Wide Web, html is an instance of SGML that
is "fundamentally inadequate" for document encoding for research
purposes because it lacks structure recognition, is too mutable
and presents a poor suite of representational tools.  Price-
Wilkin noted that the most "effective and exciting" use of html
for displaying texts on the Web (converted from original SGML
markup) that he had seen were the at the IATH Web site (Institute
for Advanced Technology in the Humanities), especially Jerome
McGann's Rossetti Archive and the Hoyt Duggan's Piers Plowman
Project.  Price-Wilkin has reviewed and provided links to these
as well as other text projects in his article "Using the World
Wide Web to Deliver Complex Electronic Documents: Implications
for Libraries" from The Public-Access Computer Systems Review 5,
no. 3 (1994): 5-21, available in html form at
http://www.lib.virginia.edu/staffpubs/jpw/yale.html/.

     The question and answer period was quite animated. 
Questions included:  What is a dtd?  How do you get Roget's
Thesaurus online? How do we get publishers to use SGML? Is there
a clearinghouse for dtds?  What is the cost in production mode of
SGML documents?  

     Gregory Murphy, Text Systems Manager at CETH, then
demonstrated sgmls, a free-ware program written by James Clark,
available through anonymous ftp from ftp://ftp.clark.com/pub/. 
Sgmls reads the dtd then builds a model of the document, then
decides if the SGML markup in the document conforms to the rules
in the dtd.  Murphy ran sgmls on the Oxford University Press's
SGML-encoded version of the "Romance of the Rose," noting that it
runs quickly but its interface is not the best. For example, the
user knows that the SGML in the text is syntactically correct
only if nothing happens after the program has run.  However, if
the text is not correct, even by just one tag, the Error Message
generated by sgmls is not helpful, and the program will not run
past the error. 

     CETH is developing tools for marking up text using the Text
Encoding Initiative's dtds for humanities texts.  Murphy has made
these tools as well as some perl scripts and C programs for SGML
parsing available through anonymous ftp at
ftp://ceth.princeton.edu.  

     Questions for Murphy centered around the problem of
standards:  what happens if different parsers don't parse each
other's documents?  Murphy mentioned SGML Open (home page at
http://www.sgmlopen.org), where they are developing sofware tools
to modify documents to parse with commercial software, and he
recommended Robin Cover's SGML Web Page (at
http://www.sil.org/sgml/sgml.html) for its comprehensive coverage
of SGML software, research projects and information sources, such
as newsletters and electronic discussion groups on SGML.

     Announcements at the close of the meeting included Richard
Entlich from Cornell University's Mann Library, who asked that
everyone have a look at the CORE (Chemistry Online Retrieval
Experiment) project, where they have converted journal articles
to SGML for the American Chemistry Society.  On the Internet,
there are 60 articles available from this project.  The interface
is for an X-Windows environment.  The team at Cornell converted
the phototypesetting markup of the journals' format to SGML. 
Also, Susan Severtson noted that she has almost concluded a
project with READEX of an SGML-coded database of "Black Studies: 
Documenting the African American Experience."  A demonstration
will be given at the ACRL meeting in Pittsburgh in late April,
and the database is scheduled to be available for sale in June.

     The next meeting of the ACRL Electronic Text Centers
Discussion Group will be in Chicago in June at the American
Library Association meeting.