Chapter 4
More on Document Type Definitions
[back
to index]
Using Attributes in XML
So far, we have used elements to describe textual
content. Towards the end of chapter two I mentioned that attributes
are a central concept concerning the physical structure of a document.
Attributes are name-value pairs that occur within an element's start
tag. This attribute provides additional information for the application
that reads the XML document and is as a general rule not intended
for the human viewer. So, if they are not intended for the user
of the document, then why do we sometimes feel the need to use attributes
in XML elements?
The answer to this question is already given, in
part, in the preceding paragraph. Attributes are used when we want
to add additional information to a text, without changing the text
itself. This could be almost anything from a short ID-number to
a lengthier description. We have already seen, in the previous chapters,
that we can differentiate between year of production and year of
purchase in our CD collection by using attributes. When I say that
these attributes are not intended for the human user, this is not
entirely true. If you use attributes that make some sort of sense,
they can add a lot of extra information to the document for human
readers. But in most cases, you will not see the document - only
the result after it has been processed. This is where the main benefit
of attributes enters into the picture - they can make it easier
to process your XML document. As an example, you can tell a style
sheet to display the different years for each CD in different colours,
or display one and not the other. Since style sheets is something
that we will not get around to in this manual, let us focus on how
attributes work in XML.
[back
to index]
How do attributes fit into a document ?
First a quick recap from chapter two. Attributes
are name-value pairs that are separated by an equals sign (=). They
may occur inside start-tags or empty tags, but never inside end-tags.
This is an example of a start-tag with an attribute:
<YEAR function="release">
We say that the element YEAR
has a function attribute, which has the value "release".
This value provides additional, useful information about the content
of the YEAR tag. In order to use
an attribute in an XML document instance, we must declare it in
the DTD, just like we have already done with the elements. An attribute
declaration looks fairly similar to an element declaration. The
main difference between them, is that where you use an <!ELEMENT>
tag for element type declarations in the DTD - you use an <!ATTLIST>
tag for attributes.
To illustrate this, we can go back to our CD Collection
again. In the previous chapter we created a DTD that allowed the
tag <YEAR> to be used in
the document. The element type declaration looked like this:
<!ELEMENT YEAR (#PCDATA)>
Let's say that we want to an attribute that gives
a more accurate description of this element, like we have already
done in the examples above. In the previous chapter, we learned
that it is not possible to add elements without having declared
them in the document type definition first. This is, of course,
also true of attributes. If you add an attribute to an element without
declaring it, you will not be able to validate your document. To
add an attribute to our <YEAR>
element, we need to add this line in the DTD:
<!ATTLIST YEAR function CDATA "release">
Like element declarations, it does not matter where
in the DTD you put this line, but it may be a good idea to place
it directly following the element declaration for YEAR.
The reason for this is, of course, that it makes your XML document
instance more structured and easier for other people to understand.
Let's go through the attribute declaration above
and see how it works. The <!ATTLIST>
tag in itself does nothing more than tell the DTD that we are starting
to declare an attribute for an element in the DTD. If we go through
the example above from left to right, we can see how an attribute
declaration works in XML:
Directly after the <!ATTLIST>
tag, you have to specify the name of the element that is supposed
to include this attribute. In this example, so far, we have specified
that the element YEAR should have
an attribute of some sort. The next step would be to tell the DTD
the name of the attribute for YEAR.
This is done by typing the name of the attribute directly after
the name of the element (separated by a space though). In our little
example, we have stated that the element YEAR is allowed to contain
an attribute called 'function'.
The final steps towards creating a valid attribute
declaration, deals with what type of data and which values are valid
in the specified attribute. After you have specified the name of
the attribute, you must decide what kind of attribute you have just
created. This will be one of ten possible types, where CDATA
is the most common one. CDATA means
that the attribute can contain text, and nothing else. We will get
back to a more in-depth description of the most common attribute
types later in this chapter, but here is a list of the different
types that can be used in an attribute declaration:
- CDATA
- Enumerated
- NMTOKEN
- NMTOKENS
- ID
- IDREF
- ENTITY
- ENTITIES
- NOTATION
- Enumerated NOTATION
As I mentioned above, not all of these types are
commonly used, but the ones that are useful to us will be described
later. The last item concerning the attribute declaration, deals
with the value the attribute takes on if it has not been specified
in the element. This is called the default value of the attribute.
In our example we have told the DTD that unless anything else is
specified, YEAR should mean the
year the CD is actually released on the market. If you paste the
example above into the document from chapter three, you should be
able to validate it without any problems - even if you haven't entered
any attributes yet.
[back
to index]
How to work with Attributes.
As we have seen above, it is relatively easy to
add attributes to an existing DTD. It works in very much the same
way as element declarations, and like element declarations there
are a number of things you can do to modify the attributes. The
example we have used so far can, in many, ways be regarded as an
"average" attribute. It contains regular text and we have
one default value that can be used. Not all attributes will fit
neatly into this pattern, so over the next few pages we will have
a look at how attribute declarations can be changed to fit our needs.
One of the more immediate questions that come to
mind when we work with attributes is whether or not it is possible
to include several attributes inside the same element. The answer
to this is quite simple - yes it is possible. This is done in more
or less exactly the same way as with one attribute. To illustrate
this, let's use our standard example again. We have stated in our
DTD that each individual disc contains any number of tracks, and
that these tracks can be described according to name and time. This
was done to simplify our work in the previous chapters, but at this
stage we realise that more information can be added to the TRACK
element. We can for instance add which number it is on the CD or
the size (if we have a digital copy). Let's say that we would like
to add this information to the collection example, but as attributes
to the TRACK element - not as new
sub elements. This can be done in two different ways:
- Creating two new attribute declarations.
<!ELEMENT TRACK (NAME*,TIME?)>
<!ATTLIST TRACK number CDATA #IMPLIED>
<!ATTLIST TRACK size CDATA #IMPLIED>
- Including both attributes in the same declaration.
<!ELEMENT TRACK (NAME*,TIME?)>
<!ATTLIST TRACK number CDATA #IMPLIED
size CDATA #IMPLIED>
Both of these solutions are perfectly valid XML.
Which one of them you decide to use depends solely on personal taste,
in terms of what you feel gives the best document structure. As
far as the parser is concerned, however, these two are identical.
In some cases, depending on which parser you use, you may get a
warning when you try to validate the first option. (See the illustration
below). This is not because anything is wrong with your document,
but it is a reminder that you have two attribute declarations for
the same element. The reason why the parser returns this message
is simply that in XML you are allowed to declare the same attribute
more than once. If this happens, the first declaration takes precedence
over the other.
In the example above you may have noticed that
I have you used the value #IMPLIED
where the default value of the attributes should have been. This
brings us to our next item on the agenda: how to deal with default
values. In the example where we went through the process of creating
an attribute declaration, we specified that "released"
was to be used as default value for the 'function' attribute. The
problem with this approach is that the author of a document may
not always have a particular value that can be used as default.
Instead of specifying a default value yourself, you can use certain
keywords to do one of three things:
- Require the author to specify a value (any value)
- Allow the value to be omitted
- Force the use of a given value
This is done by using one of these keywords instead
of the attribute value: #REQUIRED,
#IMPLIED or #FIXED.
Even if the parser returns a warning - this is still valid XML
#REQUIRED:
This keyword is used in situations where the author of the DTD wants
to force the users to provide a value for a particular attribute.
If we go back to our first attribute example again, we can try to
replace the default value ('released')
with the #REQUIRED keyword. If
we make this change to our existing document and try to parse it
again, the program should return an error message like this:
line 26, http://129.177.24.81/xml/testing.xml:
error (1201): required attribute missing: YEAR (missing "function")
Why? Quite simply because we have taken away the
default value of the 'function' attribute and told the DTD that
this must be provided by the users for the document to be valid.
In our example there is only one YEAR element, so all we have to
do in this case is to add a function attribute to it. The content
of this attribute can be whatever you feel best describes the element,
as long as it is plain text. Add this inside the element tag and
validate it again: function="whatever".
Your document should be valid again.
#IMPLIED:
This keyword is used in situations where the author of the DTD wants
to provide the users with the possibility of adding an attribute
value, without forcing them to do so. This is what we did in the
second example, with the two new attributes of the TRACK
element. The 'number' and 'size'
attributes are allowed within the element, but the user is not required
to supply these attributes, nor is a specific default value given
by the author.
#FIXED:
The final keyword we are going to discuss, is also the one that
is least used. This option is used when you want to provide a specific
default value and you don't want it to be changed by anyone. If
we look at our example, there are not really any of the elements
where we would want to use a fixed attribute. But for demonstration
purposes, let's say that we want to create an attribute in the RECORD
element that identifies the owner of each individual CD in the collection.
To do this, we add the following attribute declaration to the DTD:
<!ATTLIST RECORD owner CDATA "Vemund">
At this point we have allowed the RECORD
element to hold information on the owner of the CD's in the collection,
and that this should be me by default. People will, however, be
able to change the value of the 'owner' attribute if they wish to.
To prevent this, we can insert the #FIXED
keyword in the attribute declaration. Like this:
<!ATTLIST RECORD owner CDATA #FIXED "Vemund">
The DTD will now understand that "Vemund"
is the default value of the 'owner'
attribute, and that this can not be changed. If we add this to our
example and then try to give the 'owner' attribute a different value
then the one we have specified in the DTD, the parser will return
an error message when it tries to validate the document:
line 22, http://129.177.24.81/xml/testing.xml:>
error (1200): attribute value doesn't match fixed default: owner="whoever" (default "Vemund")
This kind of solution may be useful in situations
where you would like to insert some sort of standard information
to each document you create, like a copyright-line for example.
[back
to index]
Different types of attributes
So far, all the attributes we have used have been
of the type CDATA, which basically
means plain text. CDATA is the
most commonly used attribute type, but nine other attributes types
are allowed in XML. It goes without saying that some of these are
more commonly used than others, since each type is designed to serve
a particular purpose in XML. In the following overview I will therefore
give certain attribute types more weight than others, simply because
I feel that these are more important to our particular need in XML.
Furthermore, I will at this point only include the first six attributes
from the list below in the discussion. The reason for this is that
the last four attribute types will require some knowledge about
entities and entity references. This is something we will get back
to later, so we will wait a little while before we go into detail
about these attribute types.
Currently, these ten attribute types are allowed
in XML:
- CDATA
- Enumerated
- NMTOKEN
- NMTOKENS
- ID
- IDREF
- ENTITY
- ENTITIES
- NOTATION
- Enumerated NOTATION
[back
to index]
CDATA
As we have already seen, this is the most common
and general attribute type. An attribute of this type can contain
any string of text, save for a few rules and exceptions. A CDATA
attribute can not contain the following characters:
- Less than sign ( < )
- Ampersand ( & )
- Quotation mark ( " )
These signs can, however, be used if they are replaced
by their usual entity references ( <
, & and "
, respectively). Double quotes can actually be used without resorting
to entity references, but then you will have to surround the attribute
value by single quotes. These two examples are interpreted in the
same way by the parser, and are both valid XML:
length="7"" or length='7"'
If your attribute value contains both single and
double quotes, they must both be escaped by their entity references
(' and "
respectively).
[back
to index]
Enumerated
Unlike the other attribute types we will be discussing
here, 'Enumerated' is not used
as a keyword in XML. 'Enumerated' is used to provide a list of possible
attribute values in the DTD. These values must be written inside
parenthesis and be separated by vertical bars. We can use the YEAR
element to illustrate this:
<!ATTLIST YEAR function (released | purchased) "released">
This looks fairly similar to what we have already
done. We have just replaced CDATA
with (released | purchased). The
effect of this is that the attribute 'function'
must contain one of the two predefined values, and that the value
is assumed to be 'released' unless
otherwise stated.
[back
to index]
NMTOKEN
The NMTOKEN keyword
is used when you want the attribute value to be a valid XML name.
A valid XML name must conform to the same rules as valid element
names as they are described at the start of chapter two. XML names
must begin with an underscore ( _ ) or a letter. Subsequent letters
in the name may include: letters, digits, underscores, hyphens and
periods. XML names may never contain white spaces.
The NMTOKEN keyword
is used mainly when you want to manipulate your XML data with other
programming languages. Since the name restrictions in XML are more
or less the same as in Java and JavaScript, you can use NMTOKEN
to associate particular Java classes with XML elements. Since we
will not go into Java, or any other programming language, in this
manual this is not an attribute type that we will have to use either.
NMTOKENS
This is the plural form of NMTOKEN.
It is very rarely used, but it can be used to create an attribute
value from several XML names.
[back
to index]
ID
The ID type is
used to uniquely identify elements in an XML document instance.
An attribute value of this type must be a valid XML name, as described
above in the NMTOKEN section. Since
this type is a unique identifier, a particular value may not be
used as an ID attribute more than
once. Furthermore, each individual element may not contain more
than one attribute of the ID type.
For obvious reasons, the ID attribute
type can not be used together with #FIXED.
Fixed attributes must always have the same value, whereas ID
attributes must always have different values. To illustrate this
with an example, we can use our CD Collection again. In a CD archive,
each individual disc has a unique Id number. This Id number can,
among other things, be used to retrieve information about the CD
from databases on the Internet. We can therefore decide to include
this Id number in the DISC element.
Before we go ahead with the example, let me just specify that this
only works if you do not have two copies of the same CD in the collection.
Unless you are creating a list for a record store, this is probably
not the case, so let's go ahead with the example.
In the example we have been using so far, we have
entered information on the first track from each of the four CD's
in Bruce Springsteen's "Tracks" collection. This means
that we have four DISC elements
that can be given unique ID numbers.
Let's decide to call the attribute "discid"
and create an attribute declaration for it in the DTD. It would
look like this:
<!ATTLIST DISC discid ID #REQUIRED>
The ID type does
not necessarily have to be #REQUIRED,
it could be #IMPLIED (but never
#FIXED). Before this will work
we need to assign the Id numbers to each of the elements. This is
done by adding the following four lines inside the DISC
elements:
discid="a410410c"
discid="a50b910c"
discid="a5Toa5"
discid="ad0bf70d"
It is important to notice that since the ID
type must conform to the rules regarding XML names, numbers cannot
be used to start an ID attribute.
If the parser encounters an illegal ID
attribute value, it will return an error message:
line 30, http://129.177.24.81/xml/testing.xml:
error (1221): character in attribute value is illegal according to declaration: 4
Similarly, if a unique ID
value is encountered more than once in the same document, the parser
should return an error message:
line 43, http://129.177.24.81/xml/testing.xml:
error (661): duplicate value for ID attribute: discid
[back
to index]
IDREF
As you probably have guessed, the IDREF
type is closely associated with the ID
type. The IDREF attribute effectively
allows an element elsewhere in the XML document instance to be the
value of an attribute. This means that the value of the IDREF
attribute must be identical to the ID
value of another element. If we elaborate a little bit on the example
we used above, this is how it works in practise:
We have already seen that each individual Compact
Disc in our collection has it's own unique ID
number. If want to make sure that the tracks on these CDs are linked
to this unique number, this can be done by using the IDREF
attribute type. We have already made sure that the TRACK
element can contain the attributes 'number'
and 'size', so let us allow another
attribute called 'parent', like
this:
<!ATTLIST TRACK number CDATA #IMPLIED
size CDATA #IMPLIED
parent IDREF #IMPLIED>
Now that we have allowed this attribute to be used
we can use this attribute to associate the individual tracks with
their parent element:
<DISC discid="a410410c">
<TRACK parent="a410410c">
<NAME>Mary Queen of Arkansas</NAME>
<TIME>3:26</TIME>
</TRACK>
</DISC>
The ID/IDREF combination
is normally used when you want to establish a parent - child relationship
between elements within a document. It is important to remember,
however, that XML parsers in these cases only check that the attributes
are 'grammatically' correct. They will not be able to check whether
or not your IDREF value refers
to the correct ID. This means that
i could have written parent="abcdef"
in the example above and the document would still be valid.
Before we start to explain the last four attribute
types, we need to have a closer look at how entities and entity
references are dealt with in XML.
[back
to index]
Entities in XML
Towards the end of chapter two, we briefly discussed
the concept of entities in XML. In the next chapter, we go on to
a more in-depth look at how entities work, but now that we are more
familiar with XML, it might be a good idea repeat some of the most
important points from chapter two.
As we stated in chapter two, an entity is an 'item'
that holds data, like a database record, an image file or a text
document. The purpose of an entity is, in other words, to hold content.
The example we have been using so far, is an entity that contains
textual content. Furthermore there are two types of entities: external
and internal. Internal entities are defined completely within
one document, whereas external entities get their content from another
source through an URL.
[back
to index]
In addition to this, entities may be parsed
or unparsed. Parsed entities are made up of text that follows
the rules of XML. The content of unparsed entities, on the other
hand, is binary data or text that does not adhere to the XML rules.
The rest of this chapter will deal specifically with one type of
entities that was covered to some extent in chapter two also: General
Entity References. More specifically, general entity references
are used to merge text into already existing documents. I compared
them to macros in Word processing, and this is not very far off
the mark - they simply substitute one portion of text for another.
Like all other things in XML, general entity references must obey
certain rules:
- General Entity references must always begin with an ampersand
( & )
- General Entity references must always end with a semicolon (
; )
- General Entity references are case-sensitive
- General Entity references are composed of alphanumeric characters
- General Entity references must be declared in a DTD, unless
you are using one of the five pre-defined entity references listed
below.
These five entity references are pre-defined in
XML:
- Ampersand ( & ) - &
- Left tag ( < ) - <
- Right tag ( > ) - >
- Apostrophe ( ' ) - '
- Quotation marks ( " ) - "
To create your own entity references in a document,
you will need to use the declare them in the DTD with the <!ENTITY>
tag. We can illustrate this with an example from the collection
we have created. Let's assume that we have not only one, but all
of Bruce Springsteen's albums in our collection. Instead of writing
"Bruce Springsteen and the E Street Band" in all the appropriate
places, we decide to save ourselves some time by creating an entity
reference for this artist. First we have to declare the entity reference
in the DTD:
<!ENTITY bs "Bruce Springsteen and the E Street Band">
With this change in the DTD, we can now use the
entity reference &bs; in our
document. This example hides the real significance of entity references
to a certain degree. After all - Bruce Springsteen hasn't released
that many albums. Entity references are used when you have repeated
occurrences of one particular string of text. As an example we could
imagine a large archive, consisting of thousands of XML documents
that share the same DTD. These documents all have the e-mail of
the owner as a footer. If this persons e-mail changes for some reason,
it would be easier to change this once in the DTD, than to go through
each individual document and change it.
Before we end this chapter, we will have a brief
look at a few special points regarding entity references. All of
these points deal with how you can use entity references inside
the DTD of a document. If we have a look at the example we used
for our general entity reference, we see that the 'and' in the artist
name might be replaced by an ampersand. As we have seen, the ampersand
is one of the five predefined entity references in XML. This means
that we don't have to create a new declaration for this entity reference,
but can use it directly with the one we have already created, like
this:
<!ENTITY bs "Bruce Springsteen & the E Street Band">
[back
to index]
You can also use your own entity references inside
the DTD, but they are subject to a couple of restrictions:
- Unlike elements and attributes, the position of the entity declaration
is not irrelevant in XML. Entity references must be declared before
they are used.
- General Entity references can not insert text that will be a
part of the DTD and not the document content.
This last point means that you can not use general
entity references to replace XML keywords like #PCDATA
for example. So far, we have been discussing so-called general
entity references. These can only be used to merge text into
the document content, so if you want to use entity references to
replace parts of the DTD itself, you will have to use something
called parameter entity references. Parameter entity references
are very similar to General entity references, with these two very
important distinctions:
- Parameter entity references begin with a percent sign ( % )
- Parameter entity references can only occur inside the DTD
Apart from this, parameter entity references are
used and declared in the same way as general entity references.
To demonstrate this, let us use a parameter entity reference to
substitute all occurrences of the keyword #PCDATA
in our DTD. The only difference between this entity declaration
and the general one above is that we have to insert a percent sign
before the entity name, like this:
<!ENTITY % pc"(#PCDATA)">
If we insert this line into the DTD, and then change
all (#PCDATA) with %pc;,
the document should still be valid XML. There is, however, a problem
with this use of parameter entities. According the XML 1.0 specification,
paramenter entity references are not allowed in XML Documents with
an internal DTD. In other words, for this example to work, we need
to separate the DTD from the rest of the XML document. So, before
we proceed with entities, we will start the next chapter by exporting
our DTD. To make sure that we all do the same thing, this is the
file that failed to validate in our last attempt:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<!DOCTYPE COLLECTION [
<!ELEMENT COLLECTION (RECORD*)>
<!ELEMENT RECORD (ARTIST, TITLE, YEAR?, LABEL?, TIME?, RATING?, COMMENT?, DISC+)>
<!ATTLIST RECORD owner CDATA #FIXED "Vemund">
<!ENTITY % pc "(#PCDATA)">
<!ELEMENT ARTIST %pc;>
<!ELEMENT TITLE %pc;>
<!ELEMENT YEAR %pc;>
<!ATTLIST YEAR function CDATA #REQUIRED>
<!ELEMENT LABEL %pc;>
<!ELEMENT RATING %pc;>
<!ELEMENT COMMENT %pc;>
<!ELEMENT DISC (TRACK+)>
<!ATTLIST DISC discid ID #REQUIRED>
<!ELEMENT TRACK (NAME*,TIME?)>
<!ATTLIST TRACK number CDATA #IMPLIED
size CDATA #IMPLIED
parent IDREF #IMPLIED>
<!ELEMENT NAME %pc;>
<!ELEMENT TIME %pc;>
<!ENTITY bs "Bruce Springsteen & the E Street Band">
]>
<COLLECTION>
<RECORD>
<ARTIST>&bs;</ARTIST>
<TITLE>Tracks</TITLE>
<YEAR function="release">1998</YEAR>
<LABEL>Columbia Records</LABEL>
<TIME>250 minutes</TIME>
<RATING>8/10</RATING>
<DISC discid="a410410c">
<TRACK parent="a410410c">
<NAME>Mary Queen of Arkansas</NAME>
<TIME>3:26</TIME>
</TRACK>
</DISC>
<DISC discid="a50b910c">
<TRACK parent="a50b910c">
<NAME>Restless nights</NAME>
<TIME>4:05</TIME>
</TRACK>
</DISC>
<DISC discid="a50b911c">
<TRACK parent="a50b911c">
<NAME>Cynthia</NAME>
<TIME>4:26</TIME>
</TRACK>
</DISC>
<DISC discid="ad0bf70d">
<TRACK parent="ad0bf70d">
<NAME>Leavin' train</NAME>
<TIME>3:46</TIME>
</TRACK>
</DISC>
</RECORD>
</COLLECTION>
|