MOTS 15

An interactive concordance system

built from mostly off the shelf parts

C. M. Sperberg-McQueen

17 October 2001

rev. 20 October 2001



Mots 15 is an interactive concordance or full-text retrieval system built mostly out of off-the-shelf software. This document provides a high-level overview of the system and lists some currently unsolved problems and currently open opportunities.
The goals of the Mots-15 project are:
From these design goals follow several design principles:

1. Basic interfaces in a query system

1.1. Monoliths

At a very simple level, an interactive query system simply accepts queries from a user, which return responses from the data.
A monolithic query system.
In systems like Arras and Tact, the single monolithic software package controls everything in the diagram.

1.2. Web interface

With the advent of graphical browsers for the World Wide Web, however, it is possible to provide a fairly attractive interface at a much lower cost than would otherwise be possible. It may still make sense to devise special-purpose user interface software for specific purposes, but we can go a long way without it, just relying on the user to have chosen a Web browser they like reasonably well. The Web, that is, exposes an interface between the user interface and the data in the back end.[1]
A Web-based query system.
This interface sets certain limits to our freedom—we must now use HTML to describe what the user sees[2] and the user's interactions with the server are limited to what can be done using HTML forms—but within those limits we can develop better user interfaces at a lower cost than if we were building from scratch.
Even more important, we can now swap front- and back-ends in and out. We can experiment with different user interfaces by writing different front-end forms and HTML style sheets. In theory, we can also experiment with different back ends by substituting one for the other and using the same front end; in practice, the existing systems built on this model don't easily allow for swapping different back ends in and out, because the interface between the front end and the back end varies with the specific product used as the back end. Because different commercial products rarely support identical interfaces, this means it's rarely possible to swap a new back end in with minimal effort.

1.3. Mots 15

The Mots 15 system differs from the generic Web-based system primarily by exposing a generic query interface in front of the back-end-specific query interface, in order to buffer the front end and back end from each other.
Basic plan of MOTS query system.
Ideally, this generic query interface should follow some open specification; ideally, it should provide all the functionality we want (to keep life simple), and no more (so that it is easy to build back ends if we want to do it ourselves); the exact choice depends on the tradeoff between these incompatible goals.
Assuming that we have some suitable query language, and a way to translate from it into the query language of the back end, then any XML query engine may be used as back end.
Using sgrep as the MOTS back end.
A SQL dbms may be the most flexible back end. The design made by MSM for this would involve a few light-weight scripts which run ‘on top of’ the SQL database system. The SQL system itself would in this design produce not elements but element pointers, which would be used to extract the elements from a saved copy of the XML or SGML file.
Using SQL DBMS as the MOTS back end.
The task of translating from the open query language to the proprietary back end query language is, of course, simplified if the back end accepts the open query language itself.

2. Pieces of Mots 15

Mots 15 is designed to make it relatively simple to specify and implement each piece of the system. The better we succeed in this goal, the easier it will be for us to experiment with different parts of the system, and the easier it will be for eventual users to customize it for their own purposes. Eventually, the designers hope that Mots 15 will grow into a library of reusable and customizable pieces, which individuals and small projects can modify to make useful special-purpose systems.
The Mots 15 design requires the following pieces of software:
  • browser: an off-the-shelf Web browser; this handles the actual display of results on the user's screen and interaction with the user
  • forms: one or more HTML forms which allow the user to specify searches; these produce an HTML-forms data stream which the parser hands to an appropriate CGI script
  • form-to-query translator: a program to translate the forms data into a query, expressed in the open query language
  • query-to-query translator: a program to translate the query from the open query language into the query language supported by the back end
  • back end: a program, which accepts queries in some (possibly proprietary) query language and returns as results some set of SGML or XML elements[3]
  • wrapper: a program which takes the results and places them in two-level wrapper: (a) an outermost mots:result element and (b) a mots:hit element wrapped around each hit, each with attributes providing useful information about the query and its results
  • SGML-to-HTML translator: a program which takes the wrapped results and translates them into HTML suitable for display in the user's off-the-shelf browser
  • transaction manager: a CGI script to manage the query/response transaction, by calling (or incorporating) the various other programs in this list; it may also be responsible for session management

2.1. The user interface (forms design)

There is no obvious single right way to write a Web interface for a full-text query system; we plan to write several, both to experiment and to allow different users to have different interfaces.
If Mots 15 is ever widely deployed, we expect that much of the user customization will involve modifying the forms interface by rewriting the static HTML.
We expect to produce:
  • a very simple user interface into which users type words (which will be ANDed together and put into a search for paragraphs, speeches, or lines)
  • a more complex form which allows the user to select elements within which to search, by generic identifier
  • a form which allows the user to type in a search expression using some particular query language (in the short term, the ‘open query language’ identified in the diagrams; in the long term, possibly other query languages)
  • as many others as we can think of or find rationales for
The definition of a new form should always include the specification of the fields it uses and their meanings.

2.2. Form to query and query-to-query translation

The translation from a form to the standard query language may be simple or complex, depending on the form. No generalizations are possible.
Translation from the standard query language into the back end's query language is apt to be relatively complex: a task for a programmer rather than a power user.
There is no technical reason, but there is a design reason, not to combine the form-to-query and the query-to-query translators. If they are combined, then the front and back ends have direct exposure to each other, which means in turn that it's harder to substitute in a different back end or front end.

2.3. Hit wrappers

The mots:result and mots:hit element types carry information about the query which it's useful to have; as the Mots 15 system matures, we expect to gain a better understanding of what information needs to go here. In the current system (0.5), the elements and their attributes are:
  • the mots:result element, which provides general query information and has as attributes:
    • query, which shows the open-query-language query to which a response is being returned
  • the mots:hit element, which is wrapped around each hit, with attributes
    • text: identifies the document from which the hit came
    • sourceid: gives the unique ID within the source document
    • canonical-reference: gives a canonical reference to this location in the document, for display to the user

2.4. Transaction management

It appears useful to have a central program which does nothing but manage all of the others, and keep track of information which some of them, but not all of them, need, including:
  • style sheet to be used in formatting results (when more than one is available)
  • other user- or session-specific settings

3. Mots 15 development plan

3.1. Version 0.5

Version 0.5 of Mots 15 is a minimal system with
  • a simple, or rather primitive, Web interface
  • support for straightforward XML documents only
  • XSLT stylesheets for XML-HTML translation
  • limited query language
The test bed for version 0.5 is limited to TEI Lite documents acquired from the Oxford Text Archive (with thanks to OTA and to the contributors of those texts).
The query language used is XPath 1.0.
A fuller description of the Mots 15 interfaces is needed, but should take place in a separate document.

3.2. Open problems and opportunities

There are several obvious challenges for the future development of Mots 15:
  • serious Web interface (room for experiment)
  • ‘XML++’ support
  • external annotation
  • display of parallel versions, textual variation
  • user-supplied annotation
  • proximity searching
  • exploit grammatical annotation of text (how?)
  • support documents with overlap
  • handling multiple stylesheets
  • how to handle requests for word lists and frequency lists within a given region—an extension to the query language? or just extension functions within a standard query language?
  • allowing users to search as if the text were marked up more simply than it is (e.g. with a uniform chapter/section/paragraph/sentence hierarchy)
  • supporting more powerful back ends (either by means of wormholes in the open query language, or by means of a second interface)
  • managing selection of texts from a corpus or collection; federated searches?
On a less demanding but practically important level, also:
  • support for convenient entry of foreign characters[4]
  • confirming that outpointing links from the searched texts are preserved in the output

A. References

Price-Wilkin, John. “Using the World-Wide Web to Deliver Complex Electronic Documents: Implications for Libraries”. Public-Access Computer Systems Review 5.3 (1994): 5-21. http://jpw.umdl.umich.edu/pubs/yale.html.

Price-Wilkin, John. “A Gateway between the World Wide Web and PAT: Exploring SGML Through the Web”. The Public-Access Computer Systems Review 5.7 (1994): 5-27.

Price-Wilkin, John. “The Feasibility of Wide-area Textual Analysis Systems in Libraries: A Practical Analysis”. Presented at Literary Texts in an Electronic Age: Scholarly Implications and Library Services, the 31st Annual Clinic on Library Applications of Data Processing (University of Illinois at Urbana-Champaign). April 10-12, 1994. http://jpw.umdl.umich.edu/pubs/dpc.html. Published in the Proceedings of the Clinic. “A Gateway between the World Wide Web and PAT: Exploring SGML Through the Web.”

Price-Wilkin, John. “Just-in-time Conversion, Just-in-case Collections: Effectively leveraging rich document formats for the WWW”. D-Lib Magazine May 1997. http://www.dlib.org/dlib/may97/michigan/05pricewilkin.html

B. Acknowledgements

I am grateful to a number of people for the help they have given me in clarifying the ideas of Mots 15. First of all, of course, to Claus Huitfeldt, Paul Meurer, Sindre Sørensen, and Kjersti Berg for agreeing that it's worth trying and for their work in implementing it.
The fundamental idea of Mots 15 became clear in my head while I was listening to a discussion organized by Geoffrey Rockwell and John Bradley at ALLC/ACH '98 in Debrecen, Hungary. I am grateful to them for provoking that clarity. They should not, however, be held responsible for the result: their ideas on software development and the right way to go about building interactive concordance systems are rather different from mine, with some complicated patterns of agreement and disagreement.
Mots 15 incorporates ideas on software development and on interactive concordance and text analysis sytems in particular which I have discussed over the years with a number of people. I am grateful to Willard McCarty, Steve DeRose, and Fotis Jannidis for discussions that have relatively obvious links to elements of this design. Less obvious in detail, but pervasive, are my debts to Lou Burnard.
An crucial debt is to discussions with Geoff Bilder, to which I owe my conviction that the query language interface is a crucial determinant; in that connection I also acknowledge a debt to Susan Hockey, who organized the meeting at which those discussions took place, and to Peter Batke (who provided the number 15 in the name of the system).
The key ideas of the system, of course, I learned from John Price-Wilkin years ago; he made them seem so natural that when I formulated them again for myself I thought, for a while, that they were new.
My thanks to all of these, and to all of the others I should have named but have not.

Notes

[1] The basic idea for such a system is, of course, not original. It inheres in the Common Gateway Interface (CGI) developed at the National Center for Supercomputing Applications (NCSA) at the University of Illinois at Urbana/Champaign. It was applied to interactive full-text systems by John Price-Wilkin at the University of Michigan and the University of Virginia (using Pat as the back end) and by Mark Olsen at the University of Chicago (using Philo-Logic as the back end). Price-Wilkin has described the basic outline of the Michigan and Virginia systems in a number of articles; see the references.
[2] We can use arbitrary XML if we are willing to require that the user have an XML-capable browser like Internet Explorer 5.5 or 6.0, but for simplicity in exposition I assume that we wish to support older browsers as well.
[3] Note that in this model it is not possible to search the database for any results other than elements: no strings, no attribute values. This may prove too restrictive in practice, but has not so far. If it does prove to be a problem, we may relax this restriction, given that we assume that each hit is wrapped in an element before the results are processed, anyway.
[4] By foreign characters I mean characters foreign to the user or the user's environment, in particular characters not present on the user's keyboard. A character which is foreign in one environment may or may not be foreign in another.