torus6.gif (1371 bytes)


cctext.gif (2384 bytes)

Services Tips Theory WebPoems Workshops Books Articles Lisa Jonathan

Our Articles ...

On Complex Documentation

  On Personalizing Content

On Complex Documentation

  On Kids' Software

On Home Software

On TV and Video

On Art, Poetry, & Drama

Adding Semantics to SGML Databases

By Subhasish Mazumdar, Weifeng Bao, Zhengang Yuan, and Jonathan Price

New Mexico Institute of Mining and Technology

Presented at Electronic Publishing ’98 Conference, Saint Malo, France (April 1-4, 1998), and included in the proceedings volume, in Hersch, R., Andre, J., & Brown, H. (Eds.), Electronic Publishing, Artistic Imaging, and Digital Typography, Vol. 1375 in the Lecture Notes in Computer Science Series, Springer-Verlag, Berlin, 1998. 563-574.

Digest (Original fulltext encumbered with copyright by Springer-Verlag).

Technical writers who must maintain complex, delicately interconnected information often look to object-oriented SGML databases as a way of storing, retrieving, reusing, and reassembling the constituent objects of new documents, created on the fly to respond to a particular customer’s needs. The SGML tags help identify structural packages such as procedures, illustrations, or glossary items; in a large database, then, writers can filter out unwanted material, locating only the structural pieces they need for the job in hand. For instance, to produce a quick reference, a writer might pull up the names of procedures and their steps, but not the introductions or explanations. Similarly, a user could search for illustrations only. But illustrations of what? With no subject matter defined, such searches result in hundreds, even tens of thousands of hits.

To speed up access to the precise passages wanted, end users and writers need a way to narrow their searches by defining the precise subject matter (the meaning, or semantics) as well as the structural elements they seek. We recommend using an attribute called Subject Matter for every object class. We suggest that whatever values we assign to Subject attribute of the document as a whole should trickle down to every object within it, and that a writer should enter additional values in the subject attribute for each chapter, values which would then apply to every object within that chapter. No one wants to have to fill out a form identifying the subject matter of every paragraph. But occasionally a paragraph strays so far from the main topic of the chapter that a user would never discover it without the writer adding a few more values to that paragraph’s Subject attribute.

Unfortunately, whenever writers are asked to create keywords to passages, the results are, at best, uneven. Many writers just echo the title of the page or section; few add synonyms; very few attempt to rethink the material from the point of view of a newcomer. The ability to pour values down from the larger package, such as a document or chapter, to all the objects within offers us some increase in consistency; and the ability to add values as we descend to lower level objects gives us more precision.

If the organization has adopted an enterprise data model as part of an effort to re-engineer its processes, technical writers should be encouraged to draw on the concepts in that schema as values in the Subject field, so that the terms in the documentation match those in the business. In this way, a visitor may ask to look at the diagram of the enterprise data model as a way to locate the "correct" term for a subject, then request that a search of the database of documentation for that subject, in all structural elements, or just a few. A writer may also want to create a thesaurus entry for major topics, entering synonyms or related terms for as additional subject values. In these ways, authors can add semantic information to speed up access to the particular chunks they, or a user, might need.

But what happens when change occurs? Here are some examples of the kind of transformations that must be dealt with, in this approach.

  • When a term is excised from the enterprise data model but remains in legacy documents, the system administrator must continue to allow a search on the term, so users of older products can continue to locate the information they need.
  • Similarly when some object in the enterprise data model, such as a department, has its components are changed (as in a reorganization), customers who were used to the old schema must be allowed to continue to search on the earlier term.
  • And when a writer comes up with an important term that has not been recognized in the enterprise data model, there must be some mechanism for having the concept recognized, and included.

Running nightly checks on the pointers, and creating an error table, the system administrator must make sure that anyone using the current enterprise data model can locate the information, and anyone who recalls earlier schemas can do so as well. If an entity is scratched from the enterprise data model, but has never referred to anything mentioned in the documentation database, it can be ignored. If every entity mentioned in a document has been eliminated from the enterprise data model, the administrator should consider archiving the document offline.

The downside of our approach, then, is that involves regular maintenance. But we argue that increasingly the individual writer will be responsible for constant maintenance, and will be able to perform such routines after having been given instruction by the system administrator.

The benefits of our approach are many:

  • Increased consistency of values in the Subject attribute, leading to more successful searches
  • Increased coherence between the enterprise data model and the terms used in the documentation
  • The use of the enterprise data model as a visual table of contents for people who are trying to locate something but do not know the exact terms—a visual model that allows top-down exploration, and the creation of an inner mental map of the organization, which in turn speeds up and improves navigation through the documentation.
  • The combination of semantic with structural material offers a powerful way to filter out unwanted material.
  • Using other attributes such as Date, Owner, Creator, or Filetype could help refine such a search even further.

But why not just do a fulltext search? In building their indexes, most full text searches fail to locate passages that happen not to mention the word being queried; ignore attribute values; know nothing of the enterprise data model; and provide, at best, a thousand points of light, rather than an overall picture, lit up with individual topics.

The most powerful filter is the human mind, and our approach enables writers to provide meaningful access to their work, through the semantics of values in the attributes of each object in the database.



Copyright 1998-2001 Jonathan and Lisa Price, The Communication Circle
Return to our site at
Email us at