This document specifies some extensions to XPath 1.0 for use
in Mots 15. All extensions are defined as extension functions;
that is, they require no syntactic extension to XPath 1.0 at all.
They simply extend the set of functions which Mots-15 processors are
expected to understand and implement.
Knowledge of the Mots 15 system is assumed.
1. Functions which return Mots-15 elements
The functions defined here return elements in the Mots-15
namespace; they thus extend the XPath type system.
The word-frequencies function generates a
word-frequency list; the gi-frequencies function
generates an element-type (gi) frequency list. Elements for
these lists are defined in the Mots 15 namespace.
Some other elements defined here (hitlist,
string, and number) are intended
as simple wrappers around XPath result values: when the
evaluation of an XPath expression returns a node set, the result
may be wrapped in a mots:hitlist element to be
sent to the caller; string and number results may be wrapped
in mots:string and mots:number
elements, respectively.
If the evaluation of an XPath expression results in an
error, the back end may return a mots:error
element with a description.
1.1. Function: mots:wordfreqlist word-frequencies(node set, key, dir)
Returns a word frequency list (word form plus frequency) for all the
word types found in the string forms of the nodes found.
The key argument is either type
or count; the dir argument is
either ascending or descending.
By default, words are sorted in ascending order, frequencies in
descending order.
The relevant elements in the
mots namespace may be
declared thus:
<!ENTITY % Nodeset "CDATA">
<!--* an XPath expression returning a nodeset *-->
<!ELEMENT mots:wordfreqlist (mots:sourceinfo, mots:word*)>
<!ATTLIST mots:wordfreqlist
xmlns:mots CDATA "http://www.hit.uib.no/mots15" >
<!ELEMENT mots:sourceinfo (mots:textsource+) >
<!ELEMENT mots:textsource EMPTY >
<!ATTLIST mots:textsource
text CDATA #REQUIRED
context %Nodeset; #REQUIRED >
<!ELEMENT mots:word EMPTY >
<!ATTLIST mots:word
type CDATA #REQUIRED
count %integer; #REQUIRED >
How the implementation identifies words or performs the tokenization
is not under user control. Later extensions may allow the function
to specify a set of legal word characters, or a regular expression
for word.
Because this function returns a literal wordfreqlist
element, it is not useful except as a top-level function.
For example: calling
word-frequencies("//l/stage",type,ascending)
on the text of our English version of Peer Gynt
produces the following output:
<mots:wordfreqlist xmlns:mots="http://www.hit.uib.no/mots15"
context="//l/stage">
<!--* !! Word data produced by worddata.xsl *-->
<mots:word type="across" count="1"/>
<mots:word type="Calls" count="1"/>
<mots:word type="close" count="1"/>
<mots:word type="eyes" count="1"/>
<mots:word type="gradually" count="1"/>
<mots:word type="His" count="1"/>
<mots:word type="Shrieks" count="1"/>
<mots:word type="the" count="1"/>
<mots:word type="yard" count="1"/>
<mots:word type="" count=""/>
</mots:wordfreqlist>
1.2. Function: mots:gifreqlist gi-frequencies(node set, sort-key)
As for word-frequencies, except that element types
(generic identifiers) are found and sorted and returned, instead of
words.
The relevant elements in the
mots namespace may be
declared thus:
<!ELEMENT mots:gifreqlist (mots:gi*)>
<!ATTLIST mots:gifreqlist
xmlns:mots CDATA "http://www.hit.uib.no/mots15"
context %Nodeset; #IMPLIED >
<!ELEMENT mots:gi EMPTY >
<!ATTLIST mots:gi
type CDATA #REQUIRED
count %integer; #REQUIRED >
Because this function returns a literal
element, it is not useful except as a top-level function.
1.3. Function: mots:hitlist hits(node set)
Returns a mots:hitlist element with all the nodes
in the input node set.
(N.B. in practice, this function is redundant because
when one passes a Mots 15 system any XPath expression that
evaluates to a node set, they return a mots:hitlist
element.)
The relevant elements in the
mots namespace may be
declared thus:
<!ELEMENT mots:hitlist (mots:gi* | mots:string | mots:number)>
<!ATTLIST mots:hitlist
xmlns:mots CDATA "http://www.hit.uib.no/mots15"
query CDATA #IMPLIED >
<!ELEMENT mots:hit ANY >
<!ATTLIST mots:hit
text CDATA #REQUIRED
sourceid CDATA #REQUIRED
canonical-reference CDATA #IMPLIED>
<!--* perhaps text and sourceid should be #IMPLIED? *-->
(N.B. In Mots 15 0.5, the elements were named result
and hit.)
Because this function returns a literal
element, it is not useful except as a top-level function.
For example: calling
hits("//l/stage")
on the text of our English version of Peer Gynt
produces the following output:
<?xml version="1.0" encoding="utf-8"?>
<!--Search results produced by backend.xsl-->
<mots:hitlist xmlns:mots="http://www.hit.uib.no/mots15"
query="//l/stage">
<mots:hit text="norm-peergynt.en.xml"
sourceid="stage-46"
canonical-reference="OTA2017-stage-45">
<stage id="stage-46"
n="OTA2017-stage-45"
type="mix"
TEIform="stage">Shrieks.</stage>
</mots:hit>
<mots:hit text="norm-peergynt.en.xml"
sourceid="stage-63"
canonical-reference="OTA2017-stage-14">
<stage id="stage-63"
n="OTA2017-stage-14"
type="mix"
TEIform="stage">His eyes gradually close.</stage>
</mots:hit>
<mots:hit text="norm-peergynt.en.xml"
sourceid="stage-166"
canonical-reference="OTA2017-stage-86">
<stage id="stage-166"
n="OTA2017-stage-86"
type="mix"
TEIform="stage">Calls across the yard:</stage>
</mots:hit>
</mots:hitlist>
1.4. Wrapper elements mots:string and mots:number
When an XPath expression returns a string or a number,
the Mots-15 back end may wrap the string in a
mots:string element. If for some reason it's
inappropriate for the result to be returned as a
hitlist
element, the string or number may in turn be wrapped in a
result element. These
elements may be declared thus:
<!ELEMENT mots:result (mots:string | mots:number | mots:error) >
<!ELEMENT mots:string (#PCDATA) >
<!ELEMENT mots:number (#PCDATA) >
<!ELEMENT mots:error ANY >
1.5. Wrapper element mots:error
When an XPath expression raises an error, it is desirable that
the Mots-15 back end return an
error element to
report on the problem. The contents of the element may be any
well-formed XML.
<!ELEMENT mots:error ANY >
2. Functions returning node sets
2.1. Function: node set union(node set 1, node set 2, ...)
Returns the union of all the node sets passed as arguments.
(N.B. in practice, this function is redundant because
one can use the "|" operator instead.
2.2. Function: node set intersection(node set 1, node set 2, ...)
Returns the intersection of all the node sets passed as arguments.
2.3. Function: node set diff(node set 1, node set 2)
Returns the set of nodes in ns1 which are not also in ns2.
2.4. Function: node set outer(node set 1)
Returns the set of nodes in ns1 which are not contained within
(descendants of) any
of the nodes in ns1. Identical (?) in meaning to
diff("//p","//p/descendant::*")
Cf. sgrep
outer operator.
2.5. Function: node set inner(node set 1)
Returns the set of nodes in ns1 which do not contain
(are not ancestors of) any
of the nodes in ns1. Identical (?) in meaning to
diff("//p","//p/ancestor::*")
Cf. sgrep
inner operator.
2.6. Function: node set exclude(node set 1, node set 2)
Returns copies of the set of nodes in ns1 in which all the nodes in ns2
have been suppressed. For example:
exclude("//sp", union("//note[@resp='editor']","//stage")
returns copies of all the speeches (TEI
sp elements) in the
document, with all the editorial notes
(
<note resp='editor'> ... <note>
and all the stage directions
left out.
(? Is this feasible? As the example shows, it would be very handy for
generating word lists and the like.)