Chapter 2
XML Basics
[back
to index]
Before we start going into detail about how to
write XML, we need to take a look at some of the tools that will
be required.
XML Editors
When it comes to creating the actual XML - files,
in other words the marked up textual information, there are basically
three different types of editors:
- Plain text editors
- Specialised XML editors
- Standard word processing software with XML support
All of these types have their pros and cons. If we start with plain
text editors, this is the simplest solution and has a number of
major benefits. Text editors used for programming are widely available
and many of them are freeware. Most operating systems have a simple
text editor included, like the Notepad in Windows. This means that
you will be able to get started in XML without having to spend time
to get editing software. However, the simplicity of text editors
has a few drawbacks. First and foremost this means that you will
have to add all the XML markup by hand. This can be a major drawback
if you work with long text files that need substantial amounts of
markup. Some text editors have SGML add-ons that will ease this
task significantly. If you are new to XML, it would probably be
a good idea to opt for this solution simply because adding markup
by hand is the best way to learn how XML works.
XML Spy : Specialised XML Editor
The other solution, specialised XML authoring software,
is very well suited for adding XML markup, since all of the functions
of XML are built into the editor in an easy to use manner. The major
drawback with this kind of authoring tool is that it may be difficult
and confusing to inexperienced users to get an overview of all the
functions that are built into it. The final solution, that of standard
word processing software with built-in XML support, lies somewhere
between the two other options. It's easy to use, and you have a
relatively clear idea of what you are doing. The drawback is, in
principle, that word processing programs are relatively expensive.
Since most people have some sort of word processing package on their
machines, the more practical side of this problem is that this software
may not be completely up to date, and therefore not able to support
XML.
XML Parsers
Your first question would probable be: What's a
parser ? A parser is a program that checks if XML documents are
valid and well-formed. We will get back to the concepts of validity
and well-formedness later in this chapter, but all you need to know
for now is that this basically means whether or not your document
is 'legal' XML. All software packages that supports XML should have
a built-in parser in the form of an XML processor. This means that
if you use a text editor that does not have XML support, you will
need to get a standalone XML parser. Fortunately these parsers are
widely available and usually free of charge. The most widely used
SGML/XML parser is nsgmls, which is a part of the SP suite written
by James Clark. It is, however, not the most user friendly one around
since it runs from an MS-Dos command line. There are also a number
of Java-based parsers available that might be worth looking into,
but bear in mind that these are usually a bit slower than system-specific
parsers.
DTD and XSL Editors
As I mentioned in the previous chapter, all XML
document must conform to a set of rules laid in a special file called
the Document Type Definition. Creating a DTD for a certain type
of documents is a very time-consuming and complex task. As with
the XML files themselves, you do not need anything else than a text
editor to do this, but there has been developed software that deals
specifically with the creation and documentation of DTDs.
XSL, the style language for XML and the third major
part on the way to an XML page, is a language that has been under
constant revision up until now. This means that since nobody has
been absolutely sure of how XSL would finally be implemented, there
does not exist a great number of editors that deal specifically
with this style language. That is not to say they don't exist at
all, they are just not as common as XML or DTD editors. Again, the
best solution in terms of inexperienced XML programmers, is to use
a standard text pad for XSL authoring also.
If you are interested in more information on XML-related
software this is one of the best sites around: http://www.xmlsoftware.com/
For the examples used in this manual I have used
the following software:
- Editor : Text Pad 4.0 - 32 Bit (Ó
Helios Software Solutions)
- Editor : XML Spy 2.5 (Ó Icon
Information-Sytems)
- Parser : http://mama.stg.brown.edu/service/xmlvalid/
- Browser : Microsoft Internet Explorer 5 (Ó
Microsoft Corp.)
[back
to index]
The XML Structure
We have already discussed the general principles
behind generalised markup, and why this is a good idea. This chapter
will go into more detail about the structure of XML and what is
necessary to create simple XML instances. As I mentioned previously,
one of the primary benefits of XML is that it adds structure to
documents. The structure of any XML document instance is divided
into two parts: a logical structure and a physical structure.
Logical Structure : Logical structure
is the part of an XML document instance that describes how the document
is built. In other words, how the different parts of the document
are related to each other. This logical structure is divided into
two separate parts:
- The Prolog
- The Document Element
The Prolog is the first structural element
in the XML document, and is usually divided into two basic components:
an XML declaration and a document type declaration. An XML declaration
is a line that identifies which version of the XML specification
you are using in your document (at the moment 1.0 is your only option).
A simple XML declaration looks like this:
<?xml version="1.0"?>
Note that XML is case-sensitive, and the XML declaration
must be in lowercase. In addition to the specification version,
the declaration can contain two other items of information. A standalone
declaration allows the author to specify whether or not external
markup declarations may exist. This option must be set to 'no' if
you intend to use external document type definitions, but for simple
XML document instances this should be set to yes. The final item
of information in the XML declaration, is the encoding declaration.
This only needs to be used if you, as an author, use other character
encoding sets than US-ASCII or UTF-8. These two character encoding
sets, which are default in XML, contain most of the characters used
in Western European languages. A more complete XML declaration looks
like this:
<?xml version="1.0"encoding="UTF-8" standalone="yes"?>
The other part of the prolog, the document type
declaration, is used to specify which DTD is used for the document
in question, or which document class it belongs to. DTD's and document
classes will be discussed later, so for the time being we will not
concern ourselves with this part.
The Document Element comes directly after
the prolog, and contains all the data in your XML document. This
is similar to a root directory on a hard drive - it contains all
of the data on that drive, but the data is divided into any number
of folders and subfolders. If we use the example from chapter 1,
<collection> would be the Document Element for this particular
file. This top-level element can have any number of other elements
, or entire documents nested inside. An element nested inside another
is commonly referred to as a child element. The element that holds
this child element is then, sensibly enough, know as the parent
element.
To illustrate how the structure of XML works, we
can try to convert the CD collection example into a properly structured
XML file. Before we do that, here are a few fundamental points about
the linguistic rules of XML:
- An element in XML must contain a start tag and a matching end
tag prefixed by a slash. As an example: <YEAR>1976</YEAR>
- If an element happens to be empty, you can use an emty-element
tag like <YEAR/> instead of writing both tags with nothing
in between. You can not just drop the end-tag
- XML is case-sensitive, so it is probably a good idea to use
only uppercase letters inside the tags.
- Element names must begin with an underscore ( _ ) or a letter.
Subsequent letters in the element name may include: letters, digits,
underscores, hyphens and periods. White spaces are not allowed
in element names, it is therefore common to replace them by underscores
(ex: CD_COLLECTION instead of CD COLLECTION)
In the example below, I have converted the CD collection
to XML. To describe the structure of the document, I have placed
line numbers at the beginning of each line with an explanation of
the individual lines below the example.
- <?xml version="1.0"
encoding="UTF-8" standalone="yes"?>
- <COLLECTION>
- <CD>
- <ARTIST>Queen</ARTIST>
- <TITLE>A Night at the Opera</TITLE>
- <YEAR>1975</YEAR>
- </CD>
- <CD>
- <ARTIST>Queen</ARTIST>
- <TITLE>A Day at the Races</TITLE>
- <YEAR>1975</YEAR>
- </CD>
- </COLLECTION>
Here is the explanation:
- This is the XML declaration of the document. It states that
the document conforms to the XML 1.0 specification and that it
does not require information from other sources. The encoding
information is not required unless you use an encoding scheme
other than UTF-8 or US-ASCII, which are default. I included it
here anyway, just for the sake of completeness.
- This is the document element of the XML instance. Obviously,
the document element does not have to be named 'document', but
can be anything that the author feels gives the best description
of the content. The document element can contain any number of
sub-elements, but it can not be repeated.
- Inside the document element you can put as many other elements
as you like, but the relationships between the different elements
are not determined at random. In the example above, the collection
may contain any number of CD's where each inividual CD has information
about artist, title and year. The description of each individual
CD in the collection must be completed before information on the
next CD can be entered. The reasons for this become a lot clearer
when we go into detail about how a DTD works in the next chapter.
- After the last piece of information is entered, the document
element end-tag closes the XML instance.
[back
to index]
How do we check the logical structure ?
As I have mentioned before, one of the main benefits
of XML is that it adds structure to documents. This is of course
very nice, but how do we check the correctness of our XML-files?
The answer to this is to run the document you have created through
an XML parser. The parser checks the document for two things: validity
and well-formedness.
Valid Documents
The primary difference between valid and well-formed
documents is their relationship to a DTD. The DTD is a set of rules
that a document follows, and, among other things, explicitly states
which elements may be contained within each other and what kind
of data the various elements can contain. For an XML instance to
be declared valid, the parser checks if the document is described
according to the structure in the DTD. The main advantage of having
to check for validity is that authors must create their documents
against a predefined structure and benefit from a clear document
model.
Well-formed Documents
A well-formed XML document must obey the syntactical
rules of XML, outlined in the example with the CD collection, but
it does not have to be checked against a DTD. This means that, as
long as elements are properly structured within each other, XML
authors will be able to create elements in response to their development.
This flexibility allows authors greater control over document processing
and design than in traditional SGML environments, where the structure
had to be defined in a DTD before documents could be written. On
our level, it means that we can check if the example we have used
is correctly structured without having to write a DTD. To do so,
copy the example into the text field at:
and then hit the 'parse' button. If you have copied
the text without mistakes, you should get a message that there are
no errors in your document. Now, try to remove the slash from one
of the end tags and parse it again. This should return an error
message and a reference to the incorrect line. Before we start to
learn about DTD's, we will have to look briefly at a few concepts
regarding the physical structure of the XML language.
[back
to index]
Physical Structure : Since logical
structure is about the organisation of the document, it makes sense
that physical structure is about the actual content of your XML
instance. The content of an XML file is contained in chunks of information
called entities. Generally, like in the example we have used
so far, this content is text, but it may also be binary data like
an image file. Entities have names by which they can be identified,
and they must be declared in the prolog (usually in a DTD). They
are then referenced later somewhere inside the document element
(the root element). On the most basic level, the root element of
the xml instance is an entity because it is the outermost element
of the document, and it contains all of the other elements. Because
all well-formed XML documents must contain a root element, they
also have at least one entity. Since entities are for content -
not structure, the prolog and the DTD are not regarded as entities
as they are part of the XML structure.
[back
to index]
There are two kinds of entities: internal and
external. Internal entities are defined completely within
the document itself. This means that the entire content of the entity
is found within the main document and that it is declared in the
prolog. Internal entities are always text. External entities are,
as opposed to internal entities, not located within the main document,
but draw their content from an external file or source. The main
document only contains a reference to the external file through
an URL, or a link. An image file is a typical example of an external
entity.
Both external and internal entities can be either
parsed or unparsed. The content of parsed entities is text that
follows the XML rules. Unparsed entities is binary data or text
that does not follow XML guidelines, but follow the rules given
in the DTD by the author. This gives us four possible combinations
of entities:
- Internal, parsed entity - An internal entity made up of parsable
text.
- Internal, unparsed entity - An external entity made up of unparsable
text.
- External, parsed entity - An external entity reference that
points to parsable text. (once parsed, the text becomes part of
the document)
- External, unparsed entities - An entity reference that points
to a binary file or unparsable text.
So, the example we have been working with so far
contains only internal parsed entities. Since we will stay with
this kind of examples for a while, the concept of external and internal
entities will not be of major importance until later on. Just keep
the basic concept in the back of your mind for the time being, we
will get back to entities later.
[back
to index]
Another basic concept concerning the physical structure
of an XML documents is that of attributes. Attributes provide
a method of associating values to the elements of an XML document
instance without making the attributes part of the content of the
individual elements. To use our familiar example again: the different
tags in this example have been given names that reflect the content
in a suitable way, but they may not be as clear as we would like
them to be. If we look at the 'YEAR' element for instance, this
does not really say anything about what this year reflects. It would
most likely be the production year of the CD, but it could also
be the year it was purchased. Attributes are simply a way of specifying
this kind of information for elements in an XML document. This is
done by adding a name/value pair inside the element start-tag like
this:
<YEAR function="release"> or <YEAR function="purchase">
If you substitute one of the original lines in
the test example with either one of these, you should be able to
parse the document without any errors. Each attribute consists of
an attribute name and an attribute value. The attribute name are
strings that are placed in front of the equals sign and follow the
same rules as tag names described above. That is, they must start
with a letter or an underscore, and can not include white spaces.
Letters, digits, underscores and hyphens are legal characters. Like
element names, attribute names are also case-sensitive. It is often
a good idea to separate elements and attributes by using capital
letters for element names and lowercase for attribute names. The
attribute value is whatever comes after the equals sign and must
be contained in quotation marks (single or double). Unlike attribute
names, there are very few limitations on the content of an attribute
value. They may contain almost all kinds of characters and white
spaces (provided that they are available in the encoding scheme
you are using). The one important exception is quotation marks.
It is common practice to surround the attribute value by double
quotes. If the value itself contains double quotes, single quotes
can be used to surround the attribute value instead. Thus
<SIZE length='12"'>
would be a valid attribute, whereas
<SIZE length="12" ">
wouldn't.
The text itself
So far, we have looked at the two most important
aspects of the physical description in XML: attributes and entities.
Before we move on to talk about the Document type definition in
the next chapter, we need to have a look at a few special things
regarding the textual content of an XML document. The text that
goes between two element tags basically consists of different characters,
digits, punctuation marks and so on. XML uses the Unicode character
set and therefore understands all characters in the English and
most other Western European languages. It is possible to make XML
understand other character sets, but then this will have to be specified
in the prolog. Unicode is what we call the default character set
for XML. The text in an XML document is commonly referred to as
character data, and everything else is then named markup.
Markup includes all tags, processing instructions, DTD's and so
on.
[back
to index]
Entity References
When we said that you can use any sign in the
English language in your character data, this is not entirely true.
The ampersand (&), single and double quotes ( ‘ and " )
and tags ( < and > ) can not appear inside sections of character
data. The reason for this is that these five characters are reserved
for XML processing instructions. If you wish to use these characters
inside a section of character data, you will have to replace them
by so-called entity references. Entity references are used in XML
in stead of specific characters that would otherwise be interpreted
as part of the markup. The entity reference is a combination of
characters written between an ampersand and a semicolon. Thus, the
entity reference for a start tag is: < The following five
entity references are predefined in XML:
- & = &
- < = <
- > = >
- " = "
- ' = '
Entity references are examples of external parsed
entities, since it is a reference to a chunk of text that exists
outside the document. Upon parsing, the entity reference is replaced
by the external text. As I mentioned above, the five listed entity
references are predefined in XML. This means that when the document
is parsed, all occurrences of the entity reference & inside
a piece of character data is replaced by the ampersand sign ( &
). To see that this really works, try to add the following record
to the collection example:
Tom Petty & the Heartbreakers - Echo
- 1999
When you parse the document, the output should
display an & instead of &
When we begin to work with DTD's we will have a
closer look at how we can make our own, custom made entity references.
For the time being, we can say that entity references can be used
in XML in a similiar way as macros are used in word processing programs.
As an very simple example, we could say that if you have repeated
occurrences of a term like "Java Runtime Environment"
in a text, it would make sense to replace it by an entity reference
like &JRE; to save yourself some writing. In order for the XML
processor to understand this, these kind of entity references have
to be defined in the prolog of the XML instance, most typically
in the DTD.
[back
to index]
CDATA
As we have explained above, almost everything that
is not inside a pair of tags is considered to be character data
in XML. Also, all occurrences of &, <, >, " and '
must be replaced by their respective entity references for the XML
document to be regarded as well-formed. So, what if you have a large
number of these special characters in the text ? This could happen
if you write an online tutorial in XML for instance. To avoid going
insane by writing too many entity references into the character
data, there is something called CDATA sections in XML.
CDATA are portions of text where the XML parser
does not try to interpret the text - all text is pure character
data. All occurrences of & inside a CDATA section will simply
be read as & and vice versa. CDATA sections begin with:
<![CDATA[ and ends with: ]]>. The only text that is not allowed
within CDATA is the end tag ] ]>.
[back
to index]
Comments
If you are familiar with HTML, you are probably
familiar the concept of comments already. Comments is a special
set of tags that start with <!-- and end with --> . All data
written between these two tags is ignored by the XML processor.
Comments are usually used to make small notes inside the XML document
or to comment out entire sections of XML code, like this:
<!-- Hm, I must remember to enter the data on my new CD right here -->
<!--
<CD>
<ARTIST>The Beatles</ARTIST>
<TITLE>Rubber Soul</TITLE>
<YEAR type="release">1965</YEAR>
<YEAR type="purchase">1996</YEAR>
</CD>
-->
In the first example, I just made a note to myself
inside the Xml document. In the other example, I commented out an
entire CD from the collection. When the XML processor parses the
document, it will simply ignore the part about this CD. It is important
to note that comments can be used to surround and hide tags, but
they can never be used inside the tags themselves. This is not legal
use of comments: <YEAR <!-- don't really know -->>.
Since comments effectively "delete" sections of text,
it is important to rember that the remaining text must still be
a well-formed XML document. As with CDATA, the closing part with
the two hyphens (square brackets in CDATA) can not be included in
a comment. Because of this limitation, you can not nest sections
of comments and CDATA.
[back
to index]
Summary
Through this chapter I have tried to give a brief
outline of how a document 'must' be structured in order to become
well-formed XML. If the document does not follow both the logical
and the physical structure, all attempts to process them will fail.
Here is a quick recap of the most important rules:
- Start with an XML Declaration.
- Match start tags and end tags.
- End empty tags with />.
- One Element must completely contain all other elements (the
root element).
- Tags may nest, but never overlap.
- Attribute values must be in quotes (single or double).
- Always use < to start tags.
- Always use & to start entities.
- Use only & < > ' and "
as entity references (all others have to be specified in a DTD).
|