XPath++

Extensions to Xpath for use in Mots 15

C. M. Sperberg-McQueen

30 October 2001

Revised 11 November 2001



This document specifies some extensions to XPath 1.0 for use in Mots 15. All extensions are defined as extension functions; that is, they require no syntactic extension to XPath 1.0 at all. They simply extend the set of functions which Mots-15 processors are expected to understand and implement.
Knowledge of the Mots 15 system is assumed.

1. Functions which return Mots-15 elements

The functions defined here return elements in the Mots-15 namespace; they thus extend the XPath type system.
The word-frequencies function generates a word-frequency list; the gi-frequencies function generates an element-type (gi) frequency list. Elements for these lists are defined in the Mots 15 namespace.
Some other elements defined here (hitlist, string, and number) are intended as simple wrappers around XPath result values: when the evaluation of an XPath expression returns a node set, the result may be wrapped in a mots:hitlist element to be sent to the caller; string and number results may be wrapped in mots:string and mots:number elements, respectively.
If the evaluation of an XPath expression results in an error, the back end may return a mots:error element with a description.

1.1. Function: mots:wordfreqlist word-frequencies(node set, key, dir)

Returns a word frequency list (word form plus frequency) for all the word types found in the string forms of the nodes found.
The key argument is either type or count; the dir argument is either ascending or descending. By default, words are sorted in ascending order, frequencies in descending order.
The relevant elements in the mots namespace may be declared thus:
<!ENTITY % Nodeset "CDATA"> 
<!--* an XPath expression returning a nodeset *-->

<!ELEMENT mots:wordfreqlist (mots:sourceinfo, mots:word*)>
<!ATTLIST mots:wordfreqlist
          xmlns:mots        CDATA       "http://www.hit.uib.no/mots15" >
<!ELEMENT mots:sourceinfo   (mots:textsource+) >
<!ELEMENT mots:textsource   EMPTY >
<!ATTLIST mots:textsource   
          text              CDATA       #REQUIRED
          context           %Nodeset;   #REQUIRED >
<!ELEMENT mots:word         EMPTY >
<!ATTLIST mots:word
          type              CDATA       #REQUIRED
          count             %integer;   #REQUIRED >
How the implementation identifies words or performs the tokenization is not under user control. Later extensions may allow the function to specify a set of legal word characters, or a regular expression for word.
Because this function returns a literal wordfreqlist element, it is not useful except as a top-level function.
For example: calling word-frequencies("//l/stage",type,ascending) on the text of our English version of Peer Gynt produces the following output:
<mots:wordfreqlist xmlns:mots="http://www.hit.uib.no/mots15"
  context="//l/stage">
<!--* !! Word data produced by worddata.xsl *-->
<mots:word type="across" count="1"/>
<mots:word type="Calls" count="1"/>
<mots:word type="close" count="1"/>
<mots:word type="eyes" count="1"/>
<mots:word type="gradually" count="1"/>
<mots:word type="His" count="1"/>
<mots:word type="Shrieks" count="1"/>
<mots:word type="the" count="1"/>
<mots:word type="yard" count="1"/>
<mots:word type="" count=""/>
</mots:wordfreqlist>

1.2. Function: mots:gifreqlist gi-frequencies(node set, sort-key)

As for word-frequencies, except that element types (generic identifiers) are found and sorted and returned, instead of words.
The relevant elements in the mots namespace may be declared thus:
<!ELEMENT mots:gifreqlist (mots:gi*)>
<!ATTLIST mots:gifreqlist
          xmlns:mots      CDATA       "http://www.hit.uib.no/mots15"
          context         %Nodeset;   #IMPLIED >
<!ELEMENT mots:gi         EMPTY >
<!ATTLIST mots:gi
          type              CDATA       #REQUIRED
          count             %integer;   #REQUIRED >
Because this function returns a literal element, it is not useful except as a top-level function.

1.3. Function: mots:hitlist hits(node set)

Returns a mots:hitlist element with all the nodes in the input node set.
(N.B. in practice, this function is redundant because when one passes a Mots 15 system any XPath expression that evaluates to a node set, they return a mots:hitlist element.)
The relevant elements in the mots namespace may be declared thus:
<!ELEMENT mots:hitlist (mots:gi* | mots:string | mots:number)>
<!ATTLIST mots:hitlist
          xmlns:mots      CDATA       "http://www.hit.uib.no/mots15"
          query           CDATA       #IMPLIED >
<!ELEMENT mots:hit        ANY >
<!ATTLIST mots:hit
          text            CDATA      #REQUIRED
          sourceid        CDATA      #REQUIRED
          canonical-reference CDATA  #IMPLIED>
<!--* perhaps text and sourceid should be #IMPLIED? *-->
(N.B. In Mots 15 0.5, the elements were named result and hit.)
Because this function returns a literal element, it is not useful except as a top-level function.
For example: calling hits("//l/stage") on the text of our English version of Peer Gynt produces the following output:
<?xml version="1.0" encoding="utf-8"?>
<!--Search results produced by backend.xsl-->
<mots:hitlist xmlns:mots="http://www.hit.uib.no/mots15" 
	      query="//l/stage">

  <mots:hit text="norm-peergynt.en.xml" 
	    sourceid="stage-46" 
	    canonical-reference="OTA2017-stage-45">
    <stage id="stage-46" 
	   n="OTA2017-stage-45" 
	   type="mix" 
	   TEIform="stage">Shrieks.</stage>
  </mots:hit>
  <mots:hit text="norm-peergynt.en.xml" 
	    sourceid="stage-63" 
	    canonical-reference="OTA2017-stage-14">
    <stage id="stage-63" 
	   n="OTA2017-stage-14" 
	   type="mix" 
	   TEIform="stage">His eyes gradually close.</stage>
  </mots:hit>
  <mots:hit text="norm-peergynt.en.xml" 
	    sourceid="stage-166" 
	    canonical-reference="OTA2017-stage-86">
    <stage id="stage-166" 
	   n="OTA2017-stage-86" 
	   type="mix" 
	   TEIform="stage">Calls across the yard:</stage>
  </mots:hit>
</mots:hitlist>

1.4. Wrapper elements mots:string and mots:number

When an XPath expression returns a string or a number, the Mots-15 back end may wrap the string in a mots:string element. If for some reason it's inappropriate for the result to be returned as a hitlist element, the string or number may in turn be wrapped in a result element. These elements may be declared thus:
<!ELEMENT mots:result     (mots:string | mots:number | mots:error) >
<!ELEMENT mots:string     (#PCDATA) >
<!ELEMENT mots:number     (#PCDATA) >
<!ELEMENT mots:error      ANY >

1.5. Wrapper element mots:error

When an XPath expression raises an error, it is desirable that the Mots-15 back end return an error element to report on the problem. The contents of the element may be any well-formed XML.
<!ELEMENT mots:error      ANY >

2. Functions returning node sets

2.1. Function: node set union(node set 1, node set 2, ...)

Returns the union of all the node sets passed as arguments.
(N.B. in practice, this function is redundant because one can use the "|" operator instead.

2.2. Function: node set intersection(node set 1, node set 2, ...)

Returns the intersection of all the node sets passed as arguments.

2.3. Function: node set diff(node set 1, node set 2)

Returns the set of nodes in ns1 which are not also in ns2.

2.4. Function: node set outer(node set 1)

Returns the set of nodes in ns1 which are not contained within (descendants of) any of the nodes in ns1. Identical (?) in meaning to
diff("//p","//p/descendant::*")
Cf. sgrep outer operator.

2.5. Function: node set inner(node set 1)

Returns the set of nodes in ns1 which do not contain (are not ancestors of) any of the nodes in ns1. Identical (?) in meaning to
diff("//p","//p/ancestor::*")
Cf. sgrep inner operator.

2.6. Function: node set exclude(node set 1, node set 2)

Returns copies of the set of nodes in ns1 in which all the nodes in ns2 have been suppressed. For example:
exclude("//sp", union("//note[@resp='editor']","//stage")
returns copies of all the speeches (TEI sp elements) in the document, with all the editorial notes (<note resp='editor'> ... <note> and all the stage directions left out.
(? Is this feasible? As the example shows, it would be very handy for generating word lists and the like.)