Black Pixel
Black Pixel
Menu
Black Pixel
Black Pixel
Black Pixel
    :Contents
    :Chapter 1
    :Chapter 2
    :Chapter 3
    :Chapter 4
    :Chapter 5
    :Literature
Black Pixel
Black Pixel


Valid HTML 4.0!
Valid CSS


Black Pixel
Black Pixel
Ground Zero - My test site
Black Pixel
Black Pixel
Black Pixel
Black Pixel
Black Pixel
Black Pixel

Chapter 2

XML Basics

[back to index]

Before we start going into detail about how to write XML, we need to take a look at some of the tools that will be required.


XML Editors

When it comes to creating the actual XML - files, in other words the marked up textual information, there are basically three different types of editors:

  • Plain text editors
  • Specialised XML editors
  • Standard word processing software with XML support

All of these types have their pros and cons. If we start with plain text editors, this is the simplest solution and has a number of major benefits. Text editors used for programming are widely available and many of them are freeware. Most operating systems have a simple text editor included, like the Notepad in Windows. This means that you will be able to get started in XML without having to spend time to get editing software. However, the simplicity of text editors has a few drawbacks. First and foremost this means that you will have to add all the XML markup by hand. This can be a major drawback if you work with long text files that need substantial amounts of markup. Some text editors have SGML add-ons that will ease this task significantly. If you are new to XML, it would probably be a good idea to opt for this solution simply because adding markup by hand is the best way to learn how XML works.


Screendump of XML Spy Interface
XML Spy : Specialised XML Editor

The other solution, specialised XML authoring software, is very well suited for adding XML markup, since all of the functions of XML are built into the editor in an easy to use manner. The major drawback with this kind of authoring tool is that it may be difficult and confusing to inexperienced users to get an overview of all the functions that are built into it. The final solution, that of standard word processing software with built-in XML support, lies somewhere between the two other options. It's easy to use, and you have a relatively clear idea of what you are doing. The drawback is, in principle, that word processing programs are relatively expensive. Since most people have some sort of word processing package on their machines, the more practical side of this problem is that this software may not be completely up to date, and therefore not able to support XML.


XML Parsers

Your first question would probable be: What's a parser ? A parser is a program that checks if XML documents are valid and well-formed. We will get back to the concepts of validity and well-formedness later in this chapter, but all you need to know for now is that this basically means whether or not your document is 'legal' XML. All software packages that supports XML should have a built-in parser in the form of an XML processor. This means that if you use a text editor that does not have XML support, you will need to get a standalone XML parser. Fortunately these parsers are widely available and usually free of charge. The most widely used SGML/XML parser is nsgmls, which is a part of the SP suite written by James Clark. It is, however, not the most user friendly one around since it runs from an MS-Dos command line. There are also a number of Java-based parsers available that might be worth looking into, but bear in mind that these are usually a bit slower than system-specific parsers.


DTD and XSL Editors

As I mentioned in the previous chapter, all XML document must conform to a set of rules laid in a special file called the Document Type Definition. Creating a DTD for a certain type of documents is a very time-consuming and complex task. As with the XML files themselves, you do not need anything else than a text editor to do this, but there has been developed software that deals specifically with the creation and documentation of DTDs.

XSL, the style language for XML and the third major part on the way to an XML page, is a language that has been under constant revision up until now. This means that since nobody has been absolutely sure of how XSL would finally be implemented, there does not exist a great number of editors that deal specifically with this style language. That is not to say they don't exist at all, they are just not as common as XML or DTD editors. Again, the best solution in terms of inexperienced XML programmers, is to use a standard text pad for XSL authoring also.

If you are interested in more information on XML-related software this is one of the best sites around: http://www.xmlsoftware.com/

For the examples used in this manual I have used the following software:

  • Editor : Text Pad 4.0 - 32 Bit (Ó Helios Software Solutions)
  • Editor : XML Spy 2.5 (Ó Icon Information-Sytems)
  • Parser : http://mama.stg.brown.edu/service/xmlvalid/
  • Browser : Microsoft Internet Explorer 5 (Ó Microsoft Corp.)

[back to index]

The XML Structure

We have already discussed the general principles behind generalised markup, and why this is a good idea. This chapter will go into more detail about the structure of XML and what is necessary to create simple XML instances. As I mentioned previously, one of the primary benefits of XML is that it adds structure to documents. The structure of any XML document instance is divided into two parts: a logical structure and a physical structure.

Logical Structure : Logical structure is the part of an XML document instance that describes how the document is built. In other words, how the different parts of the document are related to each other. This logical structure is divided into two separate parts:

  • The Prolog
  • The Document Element

The Prolog is the first structural element in the XML document, and is usually divided into two basic components: an XML declaration and a document type declaration. An XML declaration is a line that identifies which version of the XML specification you are using in your document (at the moment 1.0 is your only option). A simple XML declaration looks like this:


  <?xml version="1.0"?>


Note that XML is case-sensitive, and the XML declaration must be in lowercase. In addition to the specification version, the declaration can contain two other items of information. A standalone declaration allows the author to specify whether or not external markup declarations may exist. This option must be set to 'no' if you intend to use external document type definitions, but for simple XML document instances this should be set to yes. The final item of information in the XML declaration, is the encoding declaration. This only needs to be used if you, as an author, use other character encoding sets than US-ASCII or UTF-8. These two character encoding sets, which are default in XML, contain most of the characters used in Western European languages. A more complete XML declaration looks like this:


  <?xml version="1.0"encoding="UTF-8" standalone="yes"?>


The other part of the prolog, the document type declaration, is used to specify which DTD is used for the document in question, or which document class it belongs to. DTD's and document classes will be discussed later, so for the time being we will not concern ourselves with this part.

The Document Element comes directly after the prolog, and contains all the data in your XML document. This is similar to a root directory on a hard drive - it contains all of the data on that drive, but the data is divided into any number of folders and subfolders. If we use the example from chapter 1, <collection> would be the Document Element for this particular file. This top-level element can have any number of other elements , or entire documents nested inside. An element nested inside another is commonly referred to as a child element. The element that holds this child element is then, sensibly enough, know as the parent element.

To illustrate how the structure of XML works, we can try to convert the CD collection example into a properly structured XML file. Before we do that, here are a few fundamental points about the linguistic rules of XML:

  • An element in XML must contain a start tag and a matching end tag prefixed by a slash. As an example: <YEAR>1976</YEAR>
  • If an element happens to be empty, you can use an emty-element tag like <YEAR/> instead of writing both tags with nothing in between. You can not just drop the end-tag
  • XML is case-sensitive, so it is probably a good idea to use only uppercase letters inside the tags.
  • Element names must begin with an underscore ( _ ) or a letter. Subsequent letters in the element name may include: letters, digits, underscores, hyphens and periods. White spaces are not allowed in element names, it is therefore common to replace them by underscores (ex: CD_COLLECTION instead of CD COLLECTION)

In the example below, I have converted the CD collection to XML. To describe the structure of the document, I have placed line numbers at the beginning of each line with an explanation of the individual lines below the example.

  1. <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
  2. <COLLECTION>
  3.   <CD>
  4.     <ARTIST>Queen</ARTIST>
  5.     <TITLE>A Night at the Opera</TITLE>
  6.     <YEAR>1975</YEAR>
  7.   </CD>
  8.   <CD>
  9.     <ARTIST>Queen</ARTIST>
  10.     <TITLE>A Day at the Races</TITLE>
  11.     <YEAR>1975</YEAR>
  12.   </CD>
  13. </COLLECTION>

Here is the explanation:

  1. This is the XML declaration of the document. It states that the document conforms to the XML 1.0 specification and that it does not require information from other sources. The encoding information is not required unless you use an encoding scheme other than UTF-8 or US-ASCII, which are default. I included it here anyway, just for the sake of completeness.
  2. This is the document element of the XML instance. Obviously, the document element does not have to be named 'document', but can be anything that the author feels gives the best description of the content. The document element can contain any number of sub-elements, but it can not be repeated.
  3. Inside the document element you can put as many other elements as you like, but the relationships between the different elements are not determined at random. In the example above, the collection may contain any number of CD's where each inividual CD has information about artist, title and year. The description of each individual CD in the collection must be completed before information on the next CD can be entered. The reasons for this become a lot clearer when we go into detail about how a DTD works in the next chapter.
  • After the last piece of information is entered, the document element end-tag closes the XML instance.

[back to index]

How do we check the logical structure ?

As I have mentioned before, one of the main benefits of XML is that it adds structure to documents. This is of course very nice, but how do we check the correctness of our XML-files? The answer to this is to run the document you have created through an XML parser. The parser checks the document for two things: validity and well-formedness.


Valid Documents

The primary difference between valid and well-formed documents is their relationship to a DTD. The DTD is a set of rules that a document follows, and, among other things, explicitly states which elements may be contained within each other and what kind of data the various elements can contain. For an XML instance to be declared valid, the parser checks if the document is described according to the structure in the DTD. The main advantage of having to check for validity is that authors must create their documents against a predefined structure and benefit from a clear document model.


Well-formed Documents

A well-formed XML document must obey the syntactical rules of XML, outlined in the example with the CD collection, but it does not have to be checked against a DTD. This means that, as long as elements are properly structured within each other, XML authors will be able to create elements in response to their development. This flexibility allows authors greater control over document processing and design than in traditional SGML environments, where the structure had to be defined in a DTD before documents could be written. On our level, it means that we can check if the example we have used is correctly structured without having to write a DTD. To do so, copy the example into the text field at:



and then hit the 'parse' button. If you have copied the text without mistakes, you should get a message that there are no errors in your document. Now, try to remove the slash from one of the end tags and parse it again. This should return an error message and a reference to the incorrect line. Before we start to learn about DTD's, we will have to look briefly at a few concepts regarding the physical structure of the XML language.


[back to index]

Physical Structure : Since logical structure is about the organisation of the document, it makes sense that physical structure is about the actual content of your XML instance. The content of an XML file is contained in chunks of information called entities. Generally, like in the example we have used so far, this content is text, but it may also be binary data like an image file. Entities have names by which they can be identified, and they must be declared in the prolog (usually in a DTD). They are then referenced later somewhere inside the document element (the root element). On the most basic level, the root element of the xml instance is an entity because it is the outermost element of the document, and it contains all of the other elements. Because all well-formed XML documents must contain a root element, they also have at least one entity. Since entities are for content - not structure, the prolog and the DTD are not regarded as entities as they are part of the XML structure.

[back to index]

There are two kinds of entities: internal and external. Internal entities are defined completely within the document itself. This means that the entire content of the entity is found within the main document and that it is declared in the prolog. Internal entities are always text. External entities are, as opposed to internal entities, not located within the main document, but draw their content from an external file or source. The main document only contains a reference to the external file through an URL, or a link. An image file is a typical example of an external entity.

Both external and internal entities can be either parsed or unparsed. The content of parsed entities is text that follows the XML rules. Unparsed entities is binary data or text that does not follow XML guidelines, but follow the rules given in the DTD by the author. This gives us four possible combinations of entities:

  • Internal, parsed entity - An internal entity made up of parsable text.
  • Internal, unparsed entity - An external entity made up of unparsable text.
  • External, parsed entity - An external entity reference that points to parsable text. (once parsed, the text becomes part of the document)
  • External, unparsed entities - An entity reference that points to a binary file or unparsable text.

So, the example we have been working with so far contains only internal parsed entities. Since we will stay with this kind of examples for a while, the concept of external and internal entities will not be of major importance until later on. Just keep the basic concept in the back of your mind for the time being, we will get back to entities later.

[back to index]

Another basic concept concerning the physical structure of an XML documents is that of attributes. Attributes provide a method of associating values to the elements of an XML document instance without making the attributes part of the content of the individual elements. To use our familiar example again: the different tags in this example have been given names that reflect the content in a suitable way, but they may not be as clear as we would like them to be. If we look at the 'YEAR' element for instance, this does not really say anything about what this year reflects. It would most likely be the production year of the CD, but it could also be the year it was purchased. Attributes are simply a way of specifying this kind of information for elements in an XML document. This is done by adding a name/value pair inside the element start-tag like this:


<YEAR function="release"> or <YEAR function="purchase">

If you substitute one of the original lines in the test example with either one of these, you should be able to parse the document without any errors. Each attribute consists of an attribute name and an attribute value. The attribute name are strings that are placed in front of the equals sign and follow the same rules as tag names described above. That is, they must start with a letter or an underscore, and can not include white spaces. Letters, digits, underscores and hyphens are legal characters. Like element names, attribute names are also case-sensitive. It is often a good idea to separate elements and attributes by using capital letters for element names and lowercase for attribute names. The attribute value is whatever comes after the equals sign and must be contained in quotation marks (single or double). Unlike attribute names, there are very few limitations on the content of an attribute value. They may contain almost all kinds of characters and white spaces (provided that they are available in the encoding scheme you are using). The one important exception is quotation marks. It is common practice to surround the attribute value by double quotes. If the value itself contains double quotes, single quotes can be used to surround the attribute value instead. Thus


  <SIZE length='12"'>

   would be a valid attribute, whereas

  <SIZE length="12" ">

   wouldn't.


The text itself

So far, we have looked at the two most important aspects of the physical description in XML: attributes and entities. Before we move on to talk about the Document type definition in the next chapter, we need to have a look at a few special things regarding the textual content of an XML document. The text that goes between two element tags basically consists of different characters, digits, punctuation marks and so on. XML uses the Unicode character set and therefore understands all characters in the English and most other Western European languages. It is possible to make XML understand other character sets, but then this will have to be specified in the prolog. Unicode is what we call the default character set for XML. The text in an XML document is commonly referred to as character data, and everything else is then named markup. Markup includes all tags, processing instructions, DTD's and so on.


[back to index]

Entity References

When we said that you can use any sign in the English language in your character data, this is not entirely true. The ampersand (&), single and double quotes ( ‘ and " ) and tags ( < and > ) can not appear inside sections of character data. The reason for this is that these five characters are reserved for XML processing instructions. If you wish to use these characters inside a section of character data, you will have to replace them by so-called entity references. Entity references are used in XML in stead of specific characters that would otherwise be interpreted as part of the markup. The entity reference is a combination of characters written between an ampersand and a semicolon. Thus, the entity reference for a start tag is: &lt; The following five entity references are predefined in XML:

  • &amp; = &
  • &lt; = <
  • &gt; = >
  • &quot; = "
  • &apos; = '

Entity references are examples of external parsed entities, since it is a reference to a chunk of text that exists outside the document. Upon parsing, the entity reference is replaced by the external text. As I mentioned above, the five listed entity references are predefined in XML. This means that when the document is parsed, all occurrences of the entity reference &amp; inside a piece of character data is replaced by the ampersand sign ( & ). To see that this really works, try to add the following record to the collection example:


Tom Petty &amp; the Heartbreakers - Echo - 1999


When you parse the document, the output should display an & instead of &amp;

When we begin to work with DTD's we will have a closer look at how we can make our own, custom made entity references. For the time being, we can say that entity references can be used in XML in a similiar way as macros are used in word processing programs. As an very simple example, we could say that if you have repeated occurrences of a term like "Java Runtime Environment" in a text, it would make sense to replace it by an entity reference like &JRE; to save yourself some writing. In order for the XML processor to understand this, these kind of entity references have to be defined in the prolog of the XML instance, most typically in the DTD.


[back to index]

CDATA

As we have explained above, almost everything that is not inside a pair of tags is considered to be character data in XML. Also, all occurrences of &, <, >, " and ' must be replaced by their respective entity references for the XML document to be regarded as well-formed. So, what if you have a large number of these special characters in the text ? This could happen if you write an online tutorial in XML for instance. To avoid going insane by writing too many entity references into the character data, there is something called CDATA sections in XML.

CDATA are portions of text where the XML parser does not try to interpret the text - all text is pure character data. All occurrences of &amp; inside a CDATA section will simply be read as &amp; and vice versa. CDATA sections begin with: <![CDATA[ and ends with: ]]>. The only text that is not allowed within CDATA is the end tag ] ]>.


[back to index]

Comments

If you are familiar with HTML, you are probably familiar the concept of comments already. Comments is a special set of tags that start with <!-- and end with --> . All data written between these two tags is ignored by the XML processor. Comments are usually used to make small notes inside the XML document or to comment out entire sections of XML code, like this:


  <!-- Hm, I must remember to enter the data on my new CD right here -->

  <!--
  <CD>
    <ARTIST>The Beatles</ARTIST>
    <TITLE>Rubber Soul</TITLE>
    <YEAR type="release">1965</YEAR>
    <YEAR type="purchase">1996</YEAR>
  </CD>
  -->


In the first example, I just made a note to myself inside the Xml document. In the other example, I commented out an entire CD from the collection. When the XML processor parses the document, it will simply ignore the part about this CD. It is important to note that comments can be used to surround and hide tags, but they can never be used inside the tags themselves. This is not legal use of comments: <YEAR <!-- don't really know -->>. Since comments effectively "delete" sections of text, it is important to rember that the remaining text must still be a well-formed XML document. As with CDATA, the closing part with the two hyphens (square brackets in CDATA) can not be included in a comment. Because of this limitation, you can not nest sections of comments and CDATA.


[back to index]

Summary

Through this chapter I have tried to give a brief outline of how a document 'must' be structured in order to become well-formed XML. If the document does not follow both the logical and the physical structure, all attempts to process them will fail. Here is a quick recap of the most important rules:

  1. Start with an XML Declaration.
  2. Match start tags and end tags.
  3. End empty tags with />.
  4. One Element must completely contain all other elements (the root element).
  5. Tags may nest, but never overlap.
  6. Attribute values must be in quotes (single or double).
  7. Always use < to start tags.
  8. Always use & to start entities.
  9. Use only &amp; &lt; &gt; &apos; and &quot; as entity references (all others have to be specified in a DTD).


< Previous | Index | Next >


Black Pixel
Black Pixel
Black Pixel Black Pixel
Black Pixel