 |
Chapter 1
XML in relation to Markup Languages
This manual is intended for users with no previous experience in
web publishing. We must therefore assume that if you read this,
you have no experience with areas such as HTML, Java or Cgi. Not
to worry. You don't need to have any experience from HTML to learn
XML. The one advantage you have if you know something about HTML
already is that most textbooks teach XML in a way that relies heavily
on pointing out the differences between the two languages. This
is not so strange considering that HTML and XML are "siblings"
in the sense that they are derived from the same "parent"
language: SGML. This leads us to our angle on this manual. Instead
of comparing XML to HTML, we will try to see XML in relation to
markup languages in general and how this can be a worthwhile addition
to the World Wide Web.
[back
to index]
What is Markup ?
This brings us to our first item on the agenda: what exactly is
a markup language? In the more general sense of the term, it is
by no means a new word. It has been used for quite some time in
the print and design world as a means for the author/publisher of
a text to highlight sections that have some sort of special structural
or contextual meaning. This could be anything from individual words
that carry a special meaning or simply an indication of where one
chapter ends and another one starts. The tradition of marking up
texts goes all the way back to the time when scribes wrote their
comments in the margins around the edge of the manuscript, or used
different inks to make certain words stand out from the rest of
the text.
With the invention of the printing press, the importance of document
editing grew as printing and publishing became separate industries.
Up until this point in time a finished manuscript was usually the
product of one particular copyist (and possibly one or more illuminators).
After the printing press became widely available, a manuscript went
through several stages where it would be edited and marked up by
hand on a draft copy. This draft would eventually go to a typesetter,
usually with several layers of hand written comments, where it would
be formatted and sent to the printing press. Even if the document
had been marked up beforehand, the typesetter would normally decide
on what kind of types were to be used in the printed edition.
Up through the history of printing and publishing, certain standard
features in displaying documents have developed. These standards
are so commonly used that most of us rarely think about them as
rules, but automatically know what they mean. These are some of
the more common ones in Western publishing:
- Italics for titles of books, newspapers and other documents
- Bold or large type for headings
- Indentation at the start of paragraphs
These rules made it possible for the publisher or editor of a manuscript
to mark up the content of the text, and then the typesetter would
know how it was to be displayed in the printed publication. With
the introduction of computers, in addition to making a lot of the
manual work superfluous, it became easier to edit the document on
it's way from the author to the printing press. But in order to
retain the same level of control over document layout, a way to
code the text was needed so that the output device would know how
the document was structured and how it was supposed to look. This
is why electronic markup was conceived.
[back
to index]
How does electronic markup work ?
Most of us use electronic markup every day without thinking about
it (or knowing about it). Markup basically consists of codes, or
tags, that are added to the text to tell the program you are using
how the text is structured and how it is supposed to be displayed.
The most common use of electronic markup is to change the appearance
of a text by adding formatting, such as bold font, italics, font
sizes and margins just to mention a few things. All word processors
use a markup language to control the layout and appearance of documents,
but this is as a general rule not visible to the user. If we use
this document as an example, I will try to illustrate one of the
problems with this kind of markup. Here is a small bit of information
from my CD archive:
Artist : Queen
Title : A night at the opera
Year : 1975
This all looks rather neat and orderly. This is possible because
the word processor's markup language is able to code the information
you put into the document. Now, if we save these three lines in
a file called test.doc and try to open this in the Notepad we will
see something like this:
Why did this happen ? To explain this we need to take a look at
the different types of markup languages that exist. We can as a
very general rule distinguish between two fundamentally different
types of markup. Specific markup one the one hand and generalised
markup on the other. The difference between the two types is fairly
important to understand the principles behind XML, so I will briefly
try to describe the main differences between them.
Specific markup languages are used to generate code for
one specific application or program, and thus built to serve a particular
need. The example above is an illustration of this. The original
document was written in Word 7.0, a word processor that that supports
the RTF (Rich Text Format) markup language. In addition to using
the RTF language, Word (and other word processing programs) saves
the document as a binary file, which is not readable as text. When
we opened the CD information in Notepad, which is a plain-text application,
it simply displayed everything in the binary Word-file without applying
any kind of text formatting. There are many reasons why document
information is stored in this way. Speed and efficiency are two
of the major reasons why this kind of solution was implemented,
but with the appearance of powerful Pentium processors this has
become less of an issue (though it still important). Another fairly
important reason why software developers still uses this system
is that it makes it harder for you, as a user, to move to a competitors
product. Apart from the fact the different specific markup languages
are not directly portable between different applications and operating
systems they have another major drawback: they do not describe structure,
but style and formatting.
Generalized markup languages were introduced to remedy the
two major shortcomings of specific markup languages:
- Lack of portability to different systems
- Markup was not content oriented
RTF, which is supported by most word processors, is widely portable
across systems and applications. The problem with it is that it
is implemented in a slightly different manner, depending the word
processing program. Even if this had not been a problem, and it
isn't really - since you will always have the option to save documents
in this format directly (look at the 'save as..' option in Word),
it still does not mark up the content of the document. By content
in this case I mean the logical structure of the document.
The first step towards a content-oriented markup language was made
by dr C. F. Goldfarb and two of his colleagues in the 1970's. Their
proposal for an independent text markup was based on two basic principles:
- The markup should describe the structure of a document, and
not style or formatting.
- The syntax of the markup should be strictly enforced, so that
the code could clearly be read by a software program or a human.
The result was the Document Composition Facility Generalized Markup
Language (GML). This is the predecessor to the Standardised Generalized
Markup Language (SGML), which was accepted as an international standard
by the International Organization for Standardisation in 1986 (ISO
8879). Theoretically this is all very nice, but how does generalised
markup work in practise ?
[back
to index]
The basics of SGML
SGML is a very complex language, therefore I am not going to go
into detail about every aspect of it. There are, however, a few
of the governing principles behind SGML that are fundamental to
our understanding of XML. SGML works on the principle that documents
are typically made up of repeated occurrences of some basic elements.
If we use my CD collection again, this could be seen as one large
collection consisting of several individual CD's. Each CD contains
information on artist name, title and production year. An SGML representation
of this would look something like this:
<!doctype system "-//records//DTD
collection//EN">
<collection>
<cd>
<artist>Queen</artist>
<title>A Night at the
Opera</title>
<year>1975</year>
</cd>
<cd>
<artist>Queen</artist>
<title>A Day at the Races</title>
<year>1976 </year>
</cd>
</collection>
In this representation, the information is neatly structured within
tags that give some sort of meaningful representation of the content.
Even if this seems very logical to us, computers do not understand
the concept of CD collections unless we find a way to tell them.
Thus, every SGML document needs 'something' to tell the software
about the structure of the described document. This 'something'
is called a Document Type Definition (DTD), and is usually an external
file that lets you specify how your document is structured. A DTD
lets you specify how the data of the documents you are working with
is structured, and what each individual element may sensibly contain.
I will not go into detail about how DTD's work at this stage, since
this is something we will get back to in a later chapter.
The example above is a typical example of what is referred to as
an SGML document instance, and contains some of the main features
of SGML:
- The document type is declared at the beginning
of the document. This is in most cases a reference to an external
file, either locally on your own machine or a web address.
- Each element of text is contained within a set
of tags, which are made up of the name of the elements surrounded
by angle brackets
- Tags always come in pairs (this is, strictly speaking,
not necessary in SGML, but it is required in XML - so get used
to it). A start-tag at the beginning of each element is
matched by a similar end-tag at the end. The only difference
between the two is that the end-tag is prefixed by a slash.
- All text appears within at least one set of tags. Information
may also be 'nested' inside several levels of tags, but this is
something we will get back to.
As you may have noticed, there is no information in the example
above that says anything about how the text is supposed to be displayed.
If you think that this kind of information is taken care of by the
DTD, you are mistaken. The sole purpose of the DTD is to hold information
on the structure of the document. In order to display the coded
information in an SGML file, we need to create a stylesheet that
carries information on how the different elements of the SGML document
are to be displayed by for instance a browser or a printing program.
Even if stylesheets are also an essential part of XML, this is something
we will not cover in much detail until later in this manual.
To summarise this section we can say that the visible result of
an SGML (or XML) page is usually the result of three different pages:
- The SGML document instance, which contains the coded text.
- A Document Type Definition (DTD), that holds all the structural
information.
- One or more stylesheets that deals with the display of the document.
[back
to index]
Where does XML come into the picture ?
As I mentioned in the very beginning of this chapter, XML is derived
from SGML. It is a so-called subset of SGML. The main difference
between the two languages is that XML is optimised for use over
the World Wide Web by being able to work with HTML for data display
and presentation. Up until this point in time (and probably for
several years to come) HTML has been the primary way of presenting
information on the Web. Even though HTML is derived from SGML, it
does not have very much in common with its parent language anymore.
HTML today is entirely about presentation of information and this
is reflected in the tagging scheme of the language. Most HTML-tags
are created for the sole purpose of controlling the layout of text
and images in the browser. This means that content-related searches
on the WWW has become something of a problem. The principle behind
search engines like Alta Vista or Lycos is to search the Web for
so-called meta-information in all the different pages out there.
Meta information is basically keywords that the author of a text
has put into an HTML file to describe the content of the document.
Thus, your ability to find relevant information on the Web relies
entirely on the publishers' own idea of what is relevant in their
publications. XML is an attempt to remedy this problem, not necessarily
by replacing HTML, but by working together with it. This is possible
because the two languages work at a different level with regard
to how they structure data.
So why develop a new language based on SGML when SGML had been
used for many years in very advanced publishing applications ? The
simple answer to this is that SGML is by most people, even regular
users, regarded as an overly complex language, with many optional
functions. It was therefore decided that a simpler, web oriented
subset of SGML would be preferable. This subset was to retain the
characteristics of SGML, but at the same time be almost as simple
to implement as HTML (which is actually a very simple language).
The resulting language, XML (Extensible Markup Language), removed
many of the complexities, and all of the optional functions of SGML,
while still retaining the primary benefits:
- XML is a generalised markup language, which enables the authors
to define their own set of tags.
- XML documents are self-describing. This means that a valid document
contains the set of rules to which its data must conform.
- XML documents can be validated. This means that you can
use a program called an XML validator to check if a document is
structured according to the rules laid out in the DTD that the
document refers to.
In addition to retaining the primary benefits of SGML, XML has
a number of advantages compared to SGML when it comes to delivery
across the World Wide Web:
- XML is a much smaller language than SGML. This is because the
designers of the XML specification tried to cut everything that
was not needed for Web delivery.
- XML includes a specification for a hyperlinking scheme, which
is described as a separate language called XLL (Extensible Linking
Language). The ability to link together separate pages was one
of the major reasons behind the success of HTML after it was introduced
in the early 1990's. Even if SGML allows a hyperlinking mechanism
to be defined, it has never been a part of the original language
specification.
- XML also includes a specification for a style language called
XSL (Extensible Style Language). This provides support for a style
sheet mechanism, something which is also not included in the SGML
language specification.
[back
to index]
XML is not intended only as a 'substitute' for HTML in delivering
information on the internet. To understand the intended use of XML
we need to take a look at the original goals of the work group that
developed the XML specification. Their original intentions for the
new markup language was expressed in the following ten points:
- XML shall be straightforwardly usable over the Internet.
This meant that the new language specification had to simplify
the existing sgml structure. In addition to this the new language
had to consider the needs of the applications that run in a network
environment.
- XML shall support a wide variety of applications. This
basically means that xml is not intended solely as an internet
language, but something that can be implemented in a wide range
of software, ranging from text editing tools to database programs.
- XML must be compatible with SGML. The idea behind this
is that any valid XML-document is also a valid sgml document.
This means that an xml file can be checked for validity against
a DTD by running it through an SGML validator. The reverse is,
of course, not the case, since an XML validator will not understand
all the advanced functions of SGML.
- It shall be easy to write programs that process XML documents.
The reason why this particular point was included in the list,
was simply based on the belief that the easier the language, the
more tools will become available. The adoption of XML as a standard
relies heavily on the availabilty of programs that are able to
use the new language, and by keeping it simple, people will be
able to produce their own tools without too much effort. (or at
least download freeware programs made by others)
- The number of optional features in XML must be kept at a
minimum, ideally zero. Based on the experiences with compatibility
problems in SGML software due to all the optional features of
that language, XML tries to avoid optional features.
- XML documents should be human-legible and reasonably clear.
Since XML used plain text to work with data, it is easier
for the user to understand the language if it is clearly readable.
Also it allows the user to work relatively efficiently with XML
in a standard text editor, even if advanced XML software become
available.
- The XML design should be prepared quickly. This point
does not really have anything to with the nature of the markup
language as such. It was added because there was general concern
that if XML was not made available quickly as a way to extend
html, someone else would come up with a different solution.
- The design of XML shall be formal and concise. This focus
on the wording of the actual xml specification. The thought behind
it is that the language would be more widely adopted if it was
easy to understand and use.
- XML documents shall be easy to create. This is a logical
extension of point 6. If the document is easy to read by human
users, creating the document should be similarly easy. In principle,
it is not a problem to create an XML document in a standard text
editor. In practise, however, this a relatively time-consuming
process. Since specialised XML editors are not commonly available
yet, it is a bit to early to say if this goal has been accomplished.
- Terseness in XML markup is of minimal importance. SGML
supports a number of minimisation techniques that reduces the
amount of typing the author has to do. I have mentioned one of
them already: the possibility to leave out end tags for many elements.
This is not allowed in XML.
Some of the points in the argument above probably don't make much
sense at the moment. This is not really the idea behind this chapter
either. It's purpose was to give you, the reader, a general idea
about the theory behind markup languages and XML. Like most theory,
it will be a lot easier to understand once you have had a little
practise with XML. So, if you read through this chapter again after
you have been through the examples in the next few chapters, it
will make more sense. With the worst theoretical part behind us,
let's start on an actual XML example...
|
 |