Chapter 3
Document Type Definitions
[back
to index]
Writing valid XML
In the previous chapter, we went through the basic
rules of XML, and how to create a well-formed XML document instance.
In this chapter we will go a little bit further by creating the
structural rules for the same document. If you followed the example
in Chapter two, you ended up with a document very similar to this
one:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<COLLECTION>
<CD>
<ARTIST>Queen</ARTIST>
<TITLE>A Night at the Opera</TITLE>
<YEAR function="release">1975</YEAR>
</CD>
<CD>
<ARTIST>Queen</ARTIST>
<TITLE>A Day at the Races</TITLE>
<YEAR function="release">1976</YEAR>
<YEAR function="purchase">1993</YEAR>
</CD>
<CD>
<ARTIST>Tom Petty & the Heartbreakers</ARTIST>
<TITLE>Echo</TITLE>
<YEAR function="release">1999</YEAR>
</CD>
</COLLECTION>
If it didn't look exactly like this one, it does
not matter very much since we will start more or less from scratch
again. This is not because I don't like the structure we have so
far, or just to be mean. It is simply because we have to go through
the process of creating the DTD step by step. So, just remember
what the well-formed document looks like and we will create one
just like it that is not only well-formed, but valid. Before we
start discussing this in more detail there are a few points I would
like to make clear. For practical purposes, it will be very useful
if the examples you create as you go along resides on a server that
is accessible on the World Wide Web. It is not strictly speaking
necessary, but it will make things a lot easier when we start to
create document type definitions that are separate from the XML
document instance. The reason for this is that the XML Validator
I use to illustrate the examples (The Scholarly Technology Group
XML Validation Form at the Web address I specified in the previous
chapter (http://mama.stg.brown.edu/service/xmlvalid/) only validates
documents that are accessible through a Web address. If you do not
have access to a web server you may consider setting up one on your
local machine (assuming you have an internet connection, that is).
This is not as complicated as many people think. All you need to
do is download a server program like Apache (freeware) or WebSite
(shareware) and install them. Then you will be able to access your
files over the Internet by using your machine's IP address as the
URL. As an example, my machine has the IP address 129.177.24.81,
and my test documents are available at http://129.177.24.81/xml/.
Even if you do have access to a more specialised Web server, this
may be something that is worth doing while creating the files to
avoid having to work against a server whenever you have to make
changes to a document. If we look at the file we have already created
through a web-browser, this is what it should look like:

Our well-formed document instance on the WWW
Please note that at the time of this writing, only
Microsoft Internet Explorer 5 supports the XML standard. If you
try to view this page in any other browser, you will either get
an error message or be prompted to download the file.
[back
to index]
About the Document Type Definition (DTD)
A document type definition provides a list of the
elements, tags, attributes and entity references contained in an
XML document and describes their relationships to each other. We
say that the DTD specifies a set of rules for the structure of a
document. DTD's can be included in the file that contains the document
they describe, or they can be separated into a file of its own.
If you choose to separate the DTD into a new file, this is know
as an external DTD. In order to use external DTD's, a link
inside the prolog of the XML document must point to the address
where this DTD is located, either locally on your own machine or
globally via an URL (Uniform Resource Locator, commonly know as
a Web Address). The primary benefits of external DTDs is that they
can be shared by several documents of the same type. This allows
people who work within the same field of research to agree upon
standards of encoding. If everyone follow these standards, information
exchange and processing can be made a lot easier. To use an example
concerning the description of archival material: The American Association
of Archivists in the USA has published a DTD called EAD (Encoded
Archival Description), that deals specifically with the encoding
of all sorts of archival material. This is a very general, yet functional
description of how to logically structure a documentation of any
archival unit. By using an encoding standard like this, archivists
could be able to enter data and immediately share them with others.
We will get back to the use of external DTD's later, but for the
time being we will keep things a bit simpler by writing the DTD
into the document instance we have already created.
Where do we place the DTD ?
The DTD is placed inside the prolog of the document,
directly after the XML declaration, but before the actual document
data begins. The DTD consists of a number of markup declarations,
dealing with particular elements, entities and attributes. For the
time being we will concentrate on element declarations. If start
all over again with our CD example, this is what the outermost structure
would look like:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<COLLECTION>
This is the outermost element of our example
</COLLECTION>
This contains only the XML declaration and the
root element. The example below shows the same document with a simple
DTD which declares that this document can contain an element named
"COLLECTION", and that this element may contain text.
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<!DOCTYPE COLLECTION [
<!ELEMENT COLLECTION (#PCDATA)>
]>
<COLLECTION>
This is the outermost element of our example
</COLLECTION>
If you paste this into the text field at the STG
Validator site, you should get the message that: "Document
Validates OK". We have just made a very simple, valid XML document.
A DTD always starts with <!DOCTYPE
and always ends with ]>. This
tells the XML processor that a DTD starts and ends, respectively.
Directly after the <!DOCTYPE
comes the name of the root element, in this case COLLECTION,
followed by a [ . This is not optional.
A valid document, must always have the root element specified like
this inside the DTD. Between the two square brackets comes all of
the element and attribute declarations, including one for the root
element. In our example, the root element is the only one present,
and it may contain parsed character data (#PCDATA).
An XML validator reads through the document and reports the errors
it finds. If it doesn't find any errors, it will usually output
the result in an application that understands XML (in our case the
Microsoft browser (Internet Explorer 5)).
Preparing to make a DTD
Before you start to actually create a DTD, it is
probably a good idea to think through the structure of the entities
you are going to describe in your document. In other words: how
can you make a sensible structure for you data. If we think in terms
of our CD Collection again, how can we make this work? On the most
basic level, a collection consists of many CDs. Our <COLLECTION>
element will then have to contain any number of <CD>
elements. So far, so good. The next question would be: what kind
of information does each CD contain ? Well, you have the name of
the artist and the name of the album as the two most important ones,
but also total playing time, production year and the label. Maybe
you wish to add some personal information also, like the year of
purchase or maybe a rating of how well you like it. This means that
we must allow for all of the following tags to be used inside the
<CD> element:
<ARTIST>, <TITLE>,
<LABEL>, <YEAR> (of purchase and release),
<TIME>, <COMMENT> and
<RATING>.
So far we have looked at the general description
of the CDs in the collection. But, wait a minute! What about records
that contain more than one CD, like for instance an opera. Maybe
we should rename the <CD>
element to <RECORD> so that
the element better describes the content? In that case we can allow
a new sub element called <DISC>
that allows description of each individual CD in terms of inspanidual
track names, track length etc. To get an overview of the structure,
we can try to make a well-formed document out of the information:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<COLLECTION>
<RECORD>
<ARTIST/>
<TITLE/>
<TIME/>
<LABEL/>
<YEAR/>
<RATING/>
<COMMENT/>
<DISC>
<TRACK>
<NAME/>
<TIME/>
</TRACK>
</DISC>
<DISC>
<TRACK>
<NAME/>
<TIME/>
</TRACK>
</DISC>
</RECORD>
</COLLECTION>
The above example is a record that contains two
CDs with one track on each CD. All of the elements that would normally
contain information have been left empty. This is done because,
for the time being, we only need to check if this kind of structure
is allowed. If we run it through a parser, it tells us that this
is indeed a well-formed document. Let's decide that this is a structure
that we can live with, and not complicate things any further. As
you may have noticed, the element <TIME>
is used two different places in the structure. This is not illegal
in XML, but it must be specified in the DTD.
[back
to index]
Creating the DTD
A DTD is designed to specify exactly what is and
what isn't allowed in an XML document instance. This isn't as complicated
as it may seem. It simply means that you have to specify all elements,
attributes and entity references, as well as their relationships
to each other. Even if it is not very difficult to do this, it is
relatively time consuming and you have to be very accurate. This
is why it is a good idea to think through the structure of the document
before you start creating the DTD. As a final point before we start
creating the DTD: remember that DTDs are very conservative. This
means that everything that you have not explicitly permitted is
forbidden, and interpreted as a mistake by the XML processor.
When you are building a DTD, it is usually easiest
to work from the outermost element (the root) and work your way
hierarchically down the structure of the document. This allows you
to build the DTD and the content of the document together, and check
for validity along the way. The first thing to do is to specify
the root tag. As we have already seen, a DTD always begins with
<!DOCTYPE rootname [
and ends with
]>
This does, however, only specify the actual name
of the root tag, but not what it is allowed to contain. We must
therefore create a so called element type declaration for
the root element. An element type declaration specifies the name
of the tag, which children are allowed inside that tag and whether
or not the tag is empty (tags are nonempty by default in XML - which
means they have to contain 'something' unless otherwise specified).
Every single tag used in an XML document must be declared once (and
only once) in the DTD. An element type declaration for our root
element looks like this:
<!ELEMENT COLLECTION ANY>
An element type declaration always start with <!ELEMENT
(like the document type declaration, this is case sensitive) and
it ends with >. They basically include two things: the name of
the element that is being declared - in this case COLLECTION
- and the allowed contents of that tag. The ANY
keyword that we have used in this example means that all possible
elements and parsed character data are allowed inside the tag. This
is our example complete with DTD so far:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<!DOCTYPE COLLECTION [
<!ELEMENT COLLECTION ANY>
]>
<COLLECTION>
This is where the other elements will appear
</COLLECTION>
[back
to index]
Adding Elements
With the root element in place, it is time to add
the other elements in the document. Let's start with the ones we
referred to as 'general information' above. We stated that every
CD contains information on artist, title, playing time, label, year
and other optional information and comments. The rules for these
elements have to be added in element type definitions similar to
the one for <DOCUMENT>, like
this:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<!DOCTYPE COLLECTION [
<!ELEMENT COLLECTION ANY>
<!ELEMENT RECORD ANY>
<!ELEMENT ARTIST (#PCDATA)>
<!ELEMENT TITLE (#PCDATA)>
<!ELEMENT YEAR (#PCDATA)>
<!ELEMENT LABEL (#PCDATA)>
<!ELEMENT TIME (#PCDATA)>
<!ELEMENT RATING (#PCDATA)>
]>
<COLLECTION>
<RECORD>
<ARTIST>Bruce Springsteen</ARTIST>
<TITLE>Tracks</TITLE>
<YEAR>1998</YEAR>
<LABEL>Columbia Records</LABEL>
<TIME>250 minutes</TIME>
<RATING>7/10</RATING>
</RECORD>
</COLLECTION>
Since the <RECORD>
element is the one that contains all of the other elements, this
declaration uses the ANY keyword
just like the root element of our document. For all of the other
new elements, however, we have used the #PCDATA
keyword. This means that these elements may contain only parsed
character data, or simply put: text. We have done this because these
elements are only supposed to contain textual information about
our CD, and not other sub elements. This means that we can not try
to nest an element inside one of the other declared elements, like
this for example:
<LABEL>Columbia Records
<YEAR>1998</YEAR>
</LABEL>
Even if this is well-formed according to the rules
of XML, the DTD we have just created does not allow us to do it.
The element type declaration for <LABEL>
specifically states that it can contain #PCDATA
and nothing else.
In the example above, we said that the <COLLECTION>
element could contain any kind of child elements and character data.
This is very useful if you work with data that has a very loose
structure, but in our case we know that the root element <COLLECTION>
will only contain elements of the type <RECORD>.
To gain better control over the structure of the document, it might
be a good idea to specify this in the DTD. This is done by replacing
ANY with (RECORD)
in the element type declaration for <COLLECTION>.
All rules except the ANY keyword
must be written inside parentheses, otherwise you will not be able
to validate the document. If an element can take more than one subelement,
all of the allowed subelements must be listed inside the
element type declaration. A stricter rule for the <RECORD>
element will look like this:
<!ELEMENT RECORD (ARTIST, TITLE, YEAR, LABEL, TIME, RATING)>
If we make the suggested changes to these two elements
the document will still be valid, but we will have reduced the flexibility
of the document structure.
[back
to index]
Customising Elements
Before we continue with our example, we need to
have a look at a few features and problems regarding the rules in
element type declarations. Some of them are relevant in the above
example and some will become relevant when we continue with the
rest of the CD information.
Optional Elements.
So, what if you have elements that do not contain
any information? Say for example that you haven't decided on a rating
yet. Lets remove the element from the document and see what happens
when we try to validate it. You will get this error message from
the validator:
error (1154): content ends prematurely for element: RECORD (expecting: RATING)
This tells us that the document is invalid because
it lacks an element that is specified in the DTD. In other words,
the elements we have specified as legal subelements in the element
type declaration for <RECORD>
are not only allowed, but also required. To avoid this kind of problem,
we need to specify in the DTD that some elements are optional. This
is done by adding a question mark to the element name in the parent
element's element type declaration. For our example, this means
that we can make this change to <RECORD>,
which is the parent element of <RATING>:
<!ELEMENT RECORD (ARTIST, TITLE, YEAR, LABEL, TIME, RATING?)>
Once this change has been made, the document is
valid again, even without the <RATING>
element present. While we are at it we can make some of the other
elements, like YEAR, LABEL
and TIME optional also. This means that only
ARTIST and TITLE
will remain required information in our example.
Designating zero or more children.
Now that we have dealt with optional elements,
let's turn towards another common problem. Since most CD collections
consist of more than one CD, we will have to add several <RECORD>
elements inside the top-level <COLLECTION>.
Let's see what happens if we add a new CD to the collection at this
stage. Remember, since we have made some of the elements optional,
we can save some time by adding only the artist and the title, so
try to add these lines to our document:
<RECORD>
<ARTIST>Queen</ARTIST>
<TITLE>A Night at the Opera</TITLE>
</RECORD>
This should give us the following error message:
error (1152): element violates enclosing tag's content model: RECORD (expecting: [nothing])
So, what does this mean ? It tells us that the
RECORD element does not expect
any more information than it already contained before we changed
the example. We can change this in a similar way to the example
with optional elements, but instead of a question mark, we add an
asterisk:
<!ELEMENT COLLECTION (RECORD*)>
With this little change in place, the XML processor should accept the document.
Several elements of the same type.
So far we have dealt with elements that are optional
or elements that are allowed to occur zero or more times within
the same parent element. The final basic rule concerning the occurrences
of child elements deals with elements that are required to appear
at least once within a parent element. If we use our familiar CD
example again, you may remember that we said that the different
titles in our collection may be contained on more than one single
CD, like an opera for example. Each individual item in the collection
must then contain at least one element of the type DISC.
Each disc is divided into several tracks that contain information
on song names and track length. The length and names of the individual
tracks can be made optional, but each CD must contain at
least one track. If we want to specify this in the DTD, this is
done by adding a plus-sign to the element in the element type declaration
of the parent. A more complete DTD for our example, with the individual
songs, would look like this:
<!DOCTYPE COLLECTION [
<!ELEMENT COLLECTION (RECORD*)>
<!ELEMENT RECORD (ARTIST, TITLE, YEAR?, LABEL?, TIME?, RATING?, DISC+)>
<!ELEMENT ARTIST (#PCDATA)>
<!ELEMENT TITLE (#PCDATA)>
<!ELEMENT YEAR (#PCDATA)>
<!ELEMENT LABEL(#PCDATA)>
<!ELEMENT TIME (#PCDATA)>
<!ELEMENT RATING (#PCDATA)>
<!ELEMENT DISC (TRACK+)>
<!ELEMENT TRACK (NAME*,TIME*)>
<!ELEMENT NAME (#PCDATA)>
<!ELEMENT TIME (#PCDATA)>
]>
If we go through this step by step, this is how
we can interpret the DTD:
- We are dealing with a DTD of the type "collection".
This means that the root element must be called <COLLECTION>.
- The element type declaration for the root element states that
it may contain any number of RECORD
elements - nothing else.
- Each RECORD must contain information
on exactly one ARTIST and one
TITLE. Furthermore, it may contain
information on YEAR, LABEL,
TIME , RATING
and COMMENT, but this is not
required. In addition to this, each RECORD
must contain at least one DISC.
- Except for DISC, all the other
elements in RECORD must contain
character data and not other elements.
- Each DISC must contain at least
one TRACK.
- The TRACK elements can hold
information on NAME and TIME
of the individual songs on the CD, but this is optional.
This is what a more complete, valid document for
this DTD will look like (to save some space I have only entered
the first song from the three first CD's):
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<!DOCTYPE COLLECTION [
<!ELEMENT COLLECTION (RECORD*)>
<!ELEMENT RECORD (ARTIST, TITLE, YEAR?, LABEL?, TIME?, RATING?, COMMENT?, DISC+)>
<!ELEMENT ARTIST (#PCDATA)>
<!ELEMENT TITLE (#PCDATA)>
<!ELEMENT YEAR (#PCDATA)>
<!ELEMENT LABEL (#PCDATA)>
<!ELEMENT TIME (#PCDATA)>
<!ELEMENT RATING (#PCDATA)>
<!ELEMENT COMMENT (#PCDATA)>
<!ELEMENT DISC (TRACK+)>
<!ELEMENT TRACK (NAME*,TIME?)>
<!ELEMENT NAME (#PCDATA)>
]>
<COLLECTION>
<RECORD>
<ARTIST>Bruce Springsteen</ARTIST>
<TITLE>Tracks</TITLE>
<YEAR>1998</YEAR>
<LABEL>Columbia Records</LABEL>
<TIME>250 minutes</TIME>
<RATING>7/10</RATING>
<DISC>
<TRACK>
<NAME>Mary Queen of Arkansas</NAME>
<TIME>3:26</TIME>
</TRACK>
</DISC>
<DISC>
<TRACK>
<NAME>Restless nights</NAME>
<TIME>4:05</TIME>
</TRACK>
</DISC>
<DISC>
<TRACK>
<NAME>Cynthia</NAME>
<TIME>4:26</TIME>
</TRACK>
</DISC>
</RECORD>
</COLLECTION>
[back
to index]
More about elements
Before we move on to entities and attributes in
the next chapter, we will have a look at a few more functions concerning
ordering of the elements in the element type definition. In the
previous example, we have used commas to separate the different
child elements. This means that they have to appear in that particular
order. The three symbols we have used indicates how many times those
particular elements appear at that particular place in the sequence.
If you would like a bit more flexibility in your DTD structure,
here are a few points that will make the DTD a little bit more flexible.
Choosing between elements - In some cases
it might be useful to allow the author of the document to choose
between elements at one particular point in the sequence. If we
have a look at the element type definition for TRACK:
<!ELEMENT TRACK (NAME*, TIME?)>
this states that this element can contain any number
of NAME elements and an optional
TIME element. If we want to specify
that tracks should be described by either name or length, this is
done by inserting a vertical bar between the two elements in the
declaration:
<!ELEMENT TRACK (NAME | TIME)>
After this change in the element type declaration,
the TRACK element can contain only
one of the two elements in the description.
Grouping elements - Elements in an element
type declaration can be grouped together by using parenthesis. Parentheses
combine elements so they appear as a single element at some level.
We can use the RECORD element as
an example. We stated in the DTD that this element can contain RATING
and COMMENT. If we want to specify
that only one of these can be used for each record, this is done
by combining the use of the vertical bar described above and a set
of parentheses:
<!ELEMENT RECORD (ARTIST, TITLE, YEAR?, LABEL?, TIME?, (RATING | COMMENT), DISC+)>
In this example, we have singled out RATING
and COMMENT and told the DTD that
they are closely related to each other, and are to be treated as
a single element. The author will then have to chose which one of
them he or she will use to describe the record.
Mixed content - In the examples we have
used so far, the elements we have described are required to contain
other elements or character data - but not both. It is entirely
possible to do this, and the following line demonstrates how:
<!ELEMENT TRACK (#PCDATA | NAME | TIME)>
This is commonly referred to as mixed-content tags.
Even if this may seem like a good idea in some cases, this severely
restricts the structure of the document you are working with. Inside
a mixed-content element you can only specify the names of the child
elements - not the number of times they appear, the order in which
they occur or whether or not they are optional.
Empty elements - For reasons that we will
get back to later on, it can useful to declare some tags as empty.
Empty in this case means that they have no content. Since empty
tags holds no information they are very easy to declare. This is
what the element type declaration for an empty tag looks like:
<!ELEMENT SONG EMPTY>
The keyword EMPTY
is case-sensitive. Please note that valid XML documents must declare
both empty and non-empty tags. This means that you cannot leave
an element empty, even if it has been declared in the DTD as ,for
example, an optional element. To demonstrate what this means in
practise, we can substitute the line <LABEL>Columbia
Records</LABEL> with the empty tag <LABEL/>
and see what happens. The document will still be valid, but you
will get a warning that this document is not specified to be empty:
warning (1106): empty-tag syntax used for element not declared with EMPTY content model: LABEL
In the next chapter we will have a closer look
at Document Type Definitions, and how we can give a more detailed
description of the elements in a document.
|