Black Pixel
Black Pixel
Menu
Black Pixel
Black Pixel
Black Pixel
    :Contents
    :Chapter 1
    :Chapter 2
    :Chapter 3
    :Chapter 4
    :Chapter 5
    :Literature
Black Pixel
Black Pixel


Valid HTML 4.0!
Valid CSS


Black Pixel
Black Pixel
Ground Zero - My test site
Black Pixel
Black Pixel
Black Pixel
Black Pixel
Black Pixel
Black Pixel

Chapter 3

Document Type Definitions

[back to index]

Writing valid XML

In the previous chapter, we went through the basic rules of XML, and how to create a well-formed XML document instance. In this chapter we will go a little bit further by creating the structural rules for the same document. If you followed the example in Chapter two, you ended up with a document very similar to this one:


<?xml version="1.0" encoding="UTF-8" standalone="yes"?>

<COLLECTION>
  <CD>
    <ARTIST>Queen</ARTIST>
    <TITLE>A Night at the Opera</TITLE>
    <YEAR function="release">1975</YEAR>
  </CD>
  <CD>
    <ARTIST>Queen</ARTIST>
    <TITLE>A Day at the Races</TITLE>
    <YEAR function="release">1976</YEAR>
    <YEAR function="purchase">1993</YEAR>
  </CD>
  <CD>
    <ARTIST>Tom Petty &amp; the Heartbreakers</ARTIST>
    <TITLE>Echo</TITLE>
    <YEAR function="release">1999</YEAR>
  </CD>
</COLLECTION>

If it didn't look exactly like this one, it does not matter very much since we will start more or less from scratch again. This is not because I don't like the structure we have so far, or just to be mean. It is simply because we have to go through the process of creating the DTD step by step. So, just remember what the well-formed document looks like and we will create one just like it that is not only well-formed, but valid. Before we start discussing this in more detail there are a few points I would like to make clear. For practical purposes, it will be very useful if the examples you create as you go along resides on a server that is accessible on the World Wide Web. It is not strictly speaking necessary, but it will make things a lot easier when we start to create document type definitions that are separate from the XML document instance. The reason for this is that the XML Validator I use to illustrate the examples (The Scholarly Technology Group XML Validation Form at the Web address I specified in the previous chapter (http://mama.stg.brown.edu/service/xmlvalid/) only validates documents that are accessible through a Web address. If you do not have access to a web server you may consider setting up one on your local machine (assuming you have an internet connection, that is). This is not as complicated as many people think. All you need to do is download a server program like Apache (freeware) or WebSite (shareware) and install them. Then you will be able to access your files over the Internet by using your machine's IP address as the URL. As an example, my machine has the IP address 129.177.24.81, and my test documents are available at http://129.177.24.81/xml/. Even if you do have access to a more specialised Web server, this may be something that is worth doing while creating the files to avoid having to work against a server whenever you have to make changes to a document. If we look at the file we have already created through a web-browser, this is what it should look like:


Screendump of XML document

Our well-formed document instance on the WWW

Please note that at the time of this writing, only Microsoft Internet Explorer 5 supports the XML standard. If you try to view this page in any other browser, you will either get an error message or be prompted to download the file.


[back to index]

About the Document Type Definition (DTD)

A document type definition provides a list of the elements, tags, attributes and entity references contained in an XML document and describes their relationships to each other. We say that the DTD specifies a set of rules for the structure of a document. DTD's can be included in the file that contains the document they describe, or they can be separated into a file of its own. If you choose to separate the DTD into a new file, this is know as an external DTD. In order to use external DTD's, a link inside the prolog of the XML document must point to the address where this DTD is located, either locally on your own machine or globally via an URL (Uniform Resource Locator, commonly know as a Web Address). The primary benefits of external DTDs is that they can be shared by several documents of the same type. This allows people who work within the same field of research to agree upon standards of encoding. If everyone follow these standards, information exchange and processing can be made a lot easier. To use an example concerning the description of archival material: The American Association of Archivists in the USA has published a DTD called EAD (Encoded Archival Description), that deals specifically with the encoding of all sorts of archival material. This is a very general, yet functional description of how to logically structure a documentation of any archival unit. By using an encoding standard like this, archivists could be able to enter data and immediately share them with others. We will get back to the use of external DTD's later, but for the time being we will keep things a bit simpler by writing the DTD into the document instance we have already created.


Where do we place the DTD ?

The DTD is placed inside the prolog of the document, directly after the XML declaration, but before the actual document data begins. The DTD consists of a number of markup declarations, dealing with particular elements, entities and attributes. For the time being we will concentrate on element declarations. If start all over again with our CD example, this is what the outermost structure would look like:


<?xml version="1.0" encoding="UTF-8" standalone="yes"?>

  <COLLECTION>
    This is the outermost element of our example
  </COLLECTION>

This contains only the XML declaration and the root element. The example below shows the same document with a simple DTD which declares that this document can contain an element named "COLLECTION", and that this element may contain text.


<?xml version="1.0" encoding="UTF-8" standalone="yes"?>

<!DOCTYPE COLLECTION [
  <!ELEMENT COLLECTION (#PCDATA)>
]>
  <COLLECTION>
    This is the outermost element of our example
  </COLLECTION>

If you paste this into the text field at the STG Validator site, you should get the message that: "Document Validates OK". We have just made a very simple, valid XML document.


A DTD always starts with <!DOCTYPE and always ends with ]>. This tells the XML processor that a DTD starts and ends, respectively. Directly after the <!DOCTYPE comes the name of the root element, in this case COLLECTION, followed by a [ . This is not optional. A valid document, must always have the root element specified like this inside the DTD. Between the two square brackets comes all of the element and attribute declarations, including one for the root element. In our example, the root element is the only one present, and it may contain parsed character data (#PCDATA). An XML validator reads through the document and reports the errors it finds. If it doesn't find any errors, it will usually output the result in an application that understands XML (in our case the Microsoft browser (Internet Explorer 5)).


Preparing to make a DTD

Before you start to actually create a DTD, it is probably a good idea to think through the structure of the entities you are going to describe in your document. In other words: how can you make a sensible structure for you data. If we think in terms of our CD Collection again, how can we make this work? On the most basic level, a collection consists of many CDs. Our <COLLECTION> element will then have to contain any number of <CD> elements. So far, so good. The next question would be: what kind of information does each CD contain ? Well, you have the name of the artist and the name of the album as the two most important ones, but also total playing time, production year and the label. Maybe you wish to add some personal information also, like the year of purchase or maybe a rating of how well you like it. This means that we must allow for all of the following tags to be used inside the <CD> element:

<ARTIST>, <TITLE>, <LABEL>, <YEAR> (of purchase and release), <TIME>, <COMMENT> and <RATING>.

So far we have looked at the general description of the CDs in the collection. But, wait a minute! What about records that contain more than one CD, like for instance an opera. Maybe we should rename the <CD> element to <RECORD> so that the element better describes the content? In that case we can allow a new sub element called <DISC> that allows description of each individual CD in terms of inspanidual track names, track length etc. To get an overview of the structure, we can try to make a well-formed document out of the information:


<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<COLLECTION>
  <RECORD>
    <ARTIST/>
    <TITLE/>
    <TIME/>
    <LABEL/>
    <YEAR/>
    <RATING/>
    <COMMENT/>
      <DISC>
        <TRACK>
        <NAME/>
        <TIME/>
        </TRACK>
      </DISC>
      <DISC>
        <TRACK>
        <NAME/>
        <TIME/>
        </TRACK>
      </DISC>
  </RECORD>
</COLLECTION>

The above example is a record that contains two CDs with one track on each CD. All of the elements that would normally contain information have been left empty. This is done because, for the time being, we only need to check if this kind of structure is allowed. If we run it through a parser, it tells us that this is indeed a well-formed document. Let's decide that this is a structure that we can live with, and not complicate things any further. As you may have noticed, the element <TIME> is used two different places in the structure. This is not illegal in XML, but it must be specified in the DTD.


[back to index]

Creating the DTD

A DTD is designed to specify exactly what is and what isn't allowed in an XML document instance. This isn't as complicated as it may seem. It simply means that you have to specify all elements, attributes and entity references, as well as their relationships to each other. Even if it is not very difficult to do this, it is relatively time consuming and you have to be very accurate. This is why it is a good idea to think through the structure of the document before you start creating the DTD. As a final point before we start creating the DTD: remember that DTDs are very conservative. This means that everything that you have not explicitly permitted is forbidden, and interpreted as a mistake by the XML processor.

When you are building a DTD, it is usually easiest to work from the outermost element (the root) and work your way hierarchically down the structure of the document. This allows you to build the DTD and the content of the document together, and check for validity along the way. The first thing to do is to specify the root tag. As we have already seen, a DTD always begins with


    <!DOCTYPE rootname [

    and ends with

    ]>

This does, however, only specify the actual name of the root tag, but not what it is allowed to contain. We must therefore create a so called element type declaration for the root element. An element type declaration specifies the name of the tag, which children are allowed inside that tag and whether or not the tag is empty (tags are nonempty by default in XML - which means they have to contain 'something' unless otherwise specified). Every single tag used in an XML document must be declared once (and only once) in the DTD. An element type declaration for our root element looks like this:


<!ELEMENT COLLECTION ANY>

An element type declaration always start with <!ELEMENT (like the document type declaration, this is case sensitive) and it ends with >. They basically include two things: the name of the element that is being declared - in this case COLLECTION - and the allowed contents of that tag. The ANY keyword that we have used in this example means that all possible elements and parsed character data are allowed inside the tag. This is our example complete with DTD so far:


<?xml version="1.0" encoding="UTF-8" standalone="yes"?>

<!DOCTYPE COLLECTION [
  <!ELEMENT COLLECTION ANY>
]>

<COLLECTION>
     This is where the other elements will appear
</COLLECTION>

[back to index]

Adding Elements

With the root element in place, it is time to add the other elements in the document. Let's start with the ones we referred to as 'general information' above. We stated that every CD contains information on artist, title, playing time, label, year and other optional information and comments. The rules for these elements have to be added in element type definitions similar to the one for <DOCUMENT>, like this:


<?xml version="1.0" encoding="UTF-8" standalone="yes"?>

<!DOCTYPE COLLECTION [
  <!ELEMENT COLLECTION ANY>
  <!ELEMENT RECORD ANY>
  <!ELEMENT ARTIST (#PCDATA)>
  <!ELEMENT TITLE (#PCDATA)>
  <!ELEMENT YEAR (#PCDATA)>
  <!ELEMENT LABEL (#PCDATA)>
  <!ELEMENT TIME (#PCDATA)>
  <!ELEMENT RATING (#PCDATA)>
]>

<COLLECTION>
  <RECORD>
    <ARTIST>Bruce Springsteen</ARTIST>
    <TITLE>Tracks</TITLE>
    <YEAR>1998</YEAR>
    <LABEL>Columbia Records</LABEL>
    <TIME>250 minutes</TIME>
    <RATING>7/10</RATING>
  </RECORD>
</COLLECTION>

Since the <RECORD> element is the one that contains all of the other elements, this declaration uses the ANY keyword just like the root element of our document. For all of the other new elements, however, we have used the #PCDATA keyword. This means that these elements may contain only parsed character data, or simply put: text. We have done this because these elements are only supposed to contain textual information about our CD, and not other sub elements. This means that we can not try to nest an element inside one of the other declared elements, like this for example:


<LABEL>Columbia Records
  <YEAR>1998</YEAR>
</LABEL>

Even if this is well-formed according to the rules of XML, the DTD we have just created does not allow us to do it. The element type declaration for <LABEL> specifically states that it can contain #PCDATA and nothing else.

In the example above, we said that the <COLLECTION> element could contain any kind of child elements and character data. This is very useful if you work with data that has a very loose structure, but in our case we know that the root element <COLLECTION> will only contain elements of the type <RECORD>. To gain better control over the structure of the document, it might be a good idea to specify this in the DTD. This is done by replacing ANY with (RECORD) in the element type declaration for <COLLECTION>. All rules except the ANY keyword must be written inside parentheses, otherwise you will not be able to validate the document. If an element can take more than one subelement, all of the allowed subelements must be listed inside the element type declaration. A stricter rule for the <RECORD> element will look like this:


<!ELEMENT RECORD (ARTIST, TITLE, YEAR, LABEL, TIME, RATING)>

If we make the suggested changes to these two elements the document will still be valid, but we will have reduced the flexibility of the document structure.


[back to index]

Customising Elements

Before we continue with our example, we need to have a look at a few features and problems regarding the rules in element type declarations. Some of them are relevant in the above example and some will become relevant when we continue with the rest of the CD information.


Optional Elements.

So, what if you have elements that do not contain any information? Say for example that you haven't decided on a rating yet. Lets remove the element from the document and see what happens when we try to validate it. You will get this error message from the validator:


error (1154): content ends prematurely for element: RECORD (expecting: RATING)

This tells us that the document is invalid because it lacks an element that is specified in the DTD. In other words, the elements we have specified as legal subelements in the element type declaration for <RECORD> are not only allowed, but also required. To avoid this kind of problem, we need to specify in the DTD that some elements are optional. This is done by adding a question mark to the element name in the parent element's element type declaration. For our example, this means that we can make this change to <RECORD>, which is the parent element of <RATING>:


<!ELEMENT RECORD (ARTIST, TITLE, YEAR, LABEL, TIME, RATING?)>

Once this change has been made, the document is valid again, even without the <RATING> element present. While we are at it we can make some of the other elements, like YEAR, LABEL and TIME optional also. This means that only ARTIST and TITLE will remain required information in our example.


Designating zero or more children.

Now that we have dealt with optional elements, let's turn towards another common problem. Since most CD collections consist of more than one CD, we will have to add several <RECORD> elements inside the top-level <COLLECTION>. Let's see what happens if we add a new CD to the collection at this stage. Remember, since we have made some of the elements optional, we can save some time by adding only the artist and the title, so try to add these lines to our document:


<RECORD>
  <ARTIST>Queen</ARTIST>
  <TITLE>A Night at the Opera</TITLE>
</RECORD>

This should give us the following error message:


error (1152): element violates enclosing tag's content model: RECORD (expecting: [nothing])

So, what does this mean ? It tells us that the RECORD element does not expect any more information than it already contained before we changed the example. We can change this in a similar way to the example with optional elements, but instead of a question mark, we add an asterisk:


<!ELEMENT COLLECTION (RECORD*)>

With this little change in place, the XML processor should accept the document.


Several elements of the same type.

So far we have dealt with elements that are optional or elements that are allowed to occur zero or more times within the same parent element. The final basic rule concerning the occurrences of child elements deals with elements that are required to appear at least once within a parent element. If we use our familiar CD example again, you may remember that we said that the different titles in our collection may be contained on more than one single CD, like an opera for example. Each individual item in the collection must then contain at least one element of the type DISC. Each disc is divided into several tracks that contain information on song names and track length. The length and names of the individual tracks can be made optional, but each CD must contain at least one track. If we want to specify this in the DTD, this is done by adding a plus-sign to the element in the element type declaration of the parent. A more complete DTD for our example, with the individual songs, would look like this:


<!DOCTYPE COLLECTION [
  <!ELEMENT COLLECTION (RECORD*)>
  <!ELEMENT RECORD (ARTIST, TITLE, YEAR?, LABEL?, TIME?, RATING?, DISC+)>
  <!ELEMENT ARTIST (#PCDATA)>
  <!ELEMENT TITLE (#PCDATA)>
  <!ELEMENT YEAR (#PCDATA)>
  <!ELEMENT LABEL(#PCDATA)>
  <!ELEMENT TIME (#PCDATA)>
  <!ELEMENT RATING (#PCDATA)>
  <!ELEMENT DISC (TRACK+)>
  <!ELEMENT TRACK (NAME*,TIME*)>
  <!ELEMENT NAME (#PCDATA)>
  <!ELEMENT TIME (#PCDATA)>
]>

If we go through this step by step, this is how we can interpret the DTD:

  1. We are dealing with a DTD of the type "collection". This means that the root element must be called <COLLECTION>.
  2. The element type declaration for the root element states that it may contain any number of RECORD elements - nothing else.
  3. Each RECORD must contain information on exactly one ARTIST and one TITLE. Furthermore, it may contain information on YEAR, LABEL, TIME , RATING and COMMENT, but this is not required. In addition to this, each RECORD must contain at least one DISC.
  4. Except for DISC, all the other elements in RECORD must contain character data and not other elements.
  5. Each DISC must contain at least one TRACK.
  6. The TRACK elements can hold information on NAME and TIME of the individual songs on the CD, but this is optional.

This is what a more complete, valid document for this DTD will look like (to save some space I have only entered the first song from the three first CD's):


<?xml version="1.0" encoding="UTF-8" standalone="yes"?>

<!DOCTYPE COLLECTION [
  <!ELEMENT COLLECTION (RECORD*)>
  <!ELEMENT RECORD (ARTIST, TITLE, YEAR?, LABEL?, TIME?, RATING?, COMMENT?, DISC+)>
  <!ELEMENT ARTIST (#PCDATA)>
  <!ELEMENT TITLE (#PCDATA)>
  <!ELEMENT YEAR (#PCDATA)>
  <!ELEMENT LABEL (#PCDATA)>
  <!ELEMENT TIME (#PCDATA)>
  <!ELEMENT RATING (#PCDATA)>
  <!ELEMENT COMMENT (#PCDATA)>
  <!ELEMENT DISC (TRACK+)>
  <!ELEMENT TRACK (NAME*,TIME?)>
  <!ELEMENT NAME (#PCDATA)>
]>

<COLLECTION>
  <RECORD>
  <ARTIST>Bruce Springsteen</ARTIST>
  <TITLE>Tracks</TITLE>
  <YEAR>1998</YEAR>
  <LABEL>Columbia Records</LABEL>
  <TIME>250 minutes</TIME>
  <RATING>7/10</RATING>
    <DISC>
      <TRACK>
        <NAME>Mary Queen of Arkansas</NAME>
        <TIME>3:26</TIME>
      </TRACK>
    </DISC>
    <DISC>
      <TRACK>
        <NAME>Restless nights</NAME>
        <TIME>4:05</TIME>
      </TRACK>
    </DISC>
    <DISC>
      <TRACK>
        <NAME>Cynthia</NAME>
        <TIME>4:26</TIME>
      </TRACK>
    </DISC>
  </RECORD>
</COLLECTION>

[back to index]

More about elements

Before we move on to entities and attributes in the next chapter, we will have a look at a few more functions concerning ordering of the elements in the element type definition. In the previous example, we have used commas to separate the different child elements. This means that they have to appear in that particular order. The three symbols we have used indicates how many times those particular elements appear at that particular place in the sequence. If you would like a bit more flexibility in your DTD structure, here are a few points that will make the DTD a little bit more flexible.


Choosing between elements - In some cases it might be useful to allow the author of the document to choose between elements at one particular point in the sequence. If we have a look at the element type definition for TRACK:


<!ELEMENT TRACK (NAME*, TIME?)>

this states that this element can contain any number of NAME elements and an optional TIME element. If we want to specify that tracks should be described by either name or length, this is done by inserting a vertical bar between the two elements in the declaration:


<!ELEMENT TRACK (NAME | TIME)>

After this change in the element type declaration, the TRACK element can contain only one of the two elements in the description.


Grouping elements - Elements in an element type declaration can be grouped together by using parenthesis. Parentheses combine elements so they appear as a single element at some level. We can use the RECORD element as an example. We stated in the DTD that this element can contain RATING and COMMENT. If we want to specify that only one of these can be used for each record, this is done by combining the use of the vertical bar described above and a set of parentheses:


<!ELEMENT RECORD (ARTIST, TITLE, YEAR?, LABEL?, TIME?, (RATING | COMMENT), DISC+)>

In this example, we have singled out RATING and COMMENT and told the DTD that they are closely related to each other, and are to be treated as a single element. The author will then have to chose which one of them he or she will use to describe the record.


Mixed content - In the examples we have used so far, the elements we have described are required to contain other elements or character data - but not both. It is entirely possible to do this, and the following line demonstrates how:


<!ELEMENT TRACK (#PCDATA | NAME | TIME)>

This is commonly referred to as mixed-content tags. Even if this may seem like a good idea in some cases, this severely restricts the structure of the document you are working with. Inside a mixed-content element you can only specify the names of the child elements - not the number of times they appear, the order in which they occur or whether or not they are optional.


Empty elements - For reasons that we will get back to later on, it can useful to declare some tags as empty. Empty in this case means that they have no content. Since empty tags holds no information they are very easy to declare. This is what the element type declaration for an empty tag looks like:


<!ELEMENT SONG EMPTY>

The keyword EMPTY is case-sensitive. Please note that valid XML documents must declare both empty and non-empty tags. This means that you cannot leave an element empty, even if it has been declared in the DTD as ,for example, an optional element. To demonstrate what this means in practise, we can substitute the line <LABEL>Columbia Records</LABEL> with the empty tag <LABEL/> and see what happens. The document will still be valid, but you will get a warning that this document is not specified to be empty:


warning (1106): empty-tag syntax used for element not declared with EMPTY content model: LABEL

In the next chapter we will have a closer look at Document Type Definitions, and how we can give a more detailed description of the elements in a document.



< Previous | Index | Next >


Black Pixel
Black Pixel
Black Pixel Black Pixel
Black Pixel