Black Pixel
Black Pixel
Menu
Black Pixel
Black Pixel
Black Pixel
    :Contents
    :Chapter 1
    :Chapter 2
    :Chapter 3
    :Chapter 4
    :Chapter 5
    :Literature
Black Pixel
Black Pixel


Valid HTML 4.0!
Valid CSS


Black Pixel
Black Pixel
Ground Zero - My test site
Black Pixel
Black Pixel
Black Pixel
Black Pixel
Black Pixel
Black Pixel

Chapter 1

XML in relation to Markup Languages

This manual is intended for users with no previous experience in web publishing. We must therefore assume that if you read this, you have no experience with areas such as HTML, Java or Cgi. Not to worry. You don't need to have any experience from HTML to learn XML. The one advantage you have if you know something about HTML already is that most textbooks teach XML in a way that relies heavily on pointing out the differences between the two languages. This is not so strange considering that HTML and XML are "siblings" in the sense that they are derived from the same "parent" language: SGML. This leads us to our angle on this manual. Instead of comparing XML to HTML, we will try to see XML in relation to markup languages in general and how this can be a worthwhile addition to the World Wide Web.

[back to index]

What is Markup ?

This brings us to our first item on the agenda: what exactly is a markup language? In the more general sense of the term, it is by no means a new word. It has been used for quite some time in the print and design world as a means for the author/publisher of a text to highlight sections that have some sort of special structural or contextual meaning. This could be anything from individual words that carry a special meaning or simply an indication of where one chapter ends and another one starts. The tradition of marking up texts goes all the way back to the time when scribes wrote their comments in the margins around the edge of the manuscript, or used different inks to make certain words stand out from the rest of the text.

With the invention of the printing press, the importance of document editing grew as printing and publishing became separate industries. Up until this point in time a finished manuscript was usually the product of one particular copyist (and possibly one or more illuminators). After the printing press became widely available, a manuscript went through several stages where it would be edited and marked up by hand on a draft copy. This draft would eventually go to a typesetter, usually with several layers of hand written comments, where it would be formatted and sent to the printing press. Even if the document had been marked up beforehand, the typesetter would normally decide on what kind of types were to be used in the printed edition.

Up through the history of printing and publishing, certain standard features in displaying documents have developed. These standards are so commonly used that most of us rarely think about them as rules, but automatically know what they mean. These are some of the more common ones in Western publishing:

  • Italics for titles of books, newspapers and other documents
  • Bold or large type for headings
  • Indentation at the start of paragraphs

These rules made it possible for the publisher or editor of a manuscript to mark up the content of the text, and then the typesetter would know how it was to be displayed in the printed publication. With the introduction of computers, in addition to making a lot of the manual work superfluous, it became easier to edit the document on it's way from the author to the printing press. But in order to retain the same level of control over document layout, a way to code the text was needed so that the output device would know how the document was structured and how it was supposed to look. This is why electronic markup was conceived.


[back to index]

How does electronic markup work ?

Most of us use electronic markup every day without thinking about it (or knowing about it). Markup basically consists of codes, or tags, that are added to the text to tell the program you are using how the text is structured and how it is supposed to be displayed. The most common use of electronic markup is to change the appearance of a text by adding formatting, such as bold font, italics, font sizes and margins just to mention a few things. All word processors use a markup language to control the layout and appearance of documents, but this is as a general rule not visible to the user. If we use this document as an example, I will try to illustrate one of the problems with this kind of markup. Here is a small bit of information from my CD archive:

 

 Artist : Queen  

 Title : A night at the opera  

 Year : 1975

 

This all looks rather neat and orderly. This is possible because the word processor's markup language is able to code the information you put into the document. Now, if we save these three lines in a file called test.doc and try to open this in the Notepad we will see something like this:


Screendump of Word document in Text Editor

 

Why did this happen ? To explain this we need to take a look at the different types of markup languages that exist. We can as a very general rule distinguish between two fundamentally different types of markup. Specific markup one the one hand and generalised markup on the other. The difference between the two types is fairly important to understand the principles behind XML, so I will briefly try to describe the main differences between them.


Specific markup languages are used to generate code for one specific application or program, and thus built to serve a particular need. The example above is an illustration of this. The original document was written in Word 7.0, a word processor that that supports the RTF (Rich Text Format) markup language. In addition to using the RTF language, Word (and other word processing programs) saves the document as a binary file, which is not readable as text. When we opened the CD information in Notepad, which is a plain-text application, it simply displayed everything in the binary Word-file without applying any kind of text formatting. There are many reasons why document information is stored in this way. Speed and efficiency are two of the major reasons why this kind of solution was implemented, but with the appearance of powerful Pentium processors this has become less of an issue (though it still important). Another fairly important reason why software developers still uses this system is that it makes it harder for you, as a user, to move to a competitors product. Apart from the fact the different specific markup languages are not directly portable between different applications and operating systems they have another major drawback: they do not describe structure, but style and formatting.


Generalized markup languages were introduced to remedy the two major shortcomings of specific markup languages:

  • Lack of portability to different systems
  • Markup was not content oriented

RTF, which is supported by most word processors, is widely portable across systems and applications. The problem with it is that it is implemented in a slightly different manner, depending the word processing program. Even if this had not been a problem, and it isn't really - since you will always have the option to save documents in this format directly (look at the 'save as..' option in Word), it still does not mark up the content of the document. By content in this case I mean the logical structure of the document.

The first step towards a content-oriented markup language was made by dr C. F. Goldfarb and two of his colleagues in the 1970's. Their proposal for an independent text markup was based on two basic principles:

  • The markup should describe the structure of a document, and not style or formatting.
  • The syntax of the markup should be strictly enforced, so that the code could clearly be read by a software program or a human.

The result was the Document Composition Facility Generalized Markup Language (GML). This is the predecessor to the Standardised Generalized Markup Language (SGML), which was accepted as an international standard by the International Organization for Standardisation in 1986 (ISO 8879). Theoretically this is all very nice, but how does generalised markup work in practise ?


[back to index]

The basics of SGML

SGML is a very complex language, therefore I am not going to go into detail about every aspect of it. There are, however, a few of the governing principles behind SGML that are fundamental to our understanding of XML. SGML works on the principle that documents are typically made up of repeated occurrences of some basic elements. If we use my CD collection again, this could be seen as one large collection consisting of several individual CD's. Each CD contains information on artist name, title and production year. An SGML representation of this would look something like this:


  <!doctype system "-//records//DTD collection//EN">
  <collection>
    <cd>
      <artist>Queen</artist>
      <title>A Night at the Opera</title>
      <year>1975</year>
    </cd>
    <cd>
      <artist>Queen</artist>
      <title>A Day at the Races</title>
      <year>1976 </year>
    </cd>
  </collection>


In this representation, the information is neatly structured within tags that give some sort of meaningful representation of the content. Even if this seems very logical to us, computers do not understand the concept of CD collections unless we find a way to tell them. Thus, every SGML document needs 'something' to tell the software about the structure of the described document. This 'something' is called a Document Type Definition (DTD), and is usually an external file that lets you specify how your document is structured. A DTD lets you specify how the data of the documents you are working with is structured, and what each individual element may sensibly contain. I will not go into detail about how DTD's work at this stage, since this is something we will get back to in a later chapter.

The example above is a typical example of what is referred to as an SGML document instance, and contains some of the main features of SGML:

  • The document type is declared at the beginning of the document. This is in most cases a reference to an external file, either locally on your own machine or a web address.
  • Each element of text is contained within a set of tags, which are made up of the name of the elements surrounded by angle brackets
  • Tags always come in pairs (this is, strictly speaking, not necessary in SGML, but it is required in XML - so get used to it). A start-tag at the beginning of each element is matched by a similar end-tag at the end. The only difference between the two is that the end-tag is prefixed by a slash.
  • All text appears within at least one set of tags. Information may also be 'nested' inside several levels of tags, but this is something we will get back to.

As you may have noticed, there is no information in the example above that says anything about how the text is supposed to be displayed. If you think that this kind of information is taken care of by the DTD, you are mistaken. The sole purpose of the DTD is to hold information on the structure of the document. In order to display the coded information in an SGML file, we need to create a stylesheet that carries information on how the different elements of the SGML document are to be displayed by for instance a browser or a printing program. Even if stylesheets are also an essential part of XML, this is something we will not cover in much detail until later in this manual.

To summarise this section we can say that the visible result of an SGML (or XML) page is usually the result of three different pages:

  • The SGML document instance, which contains the coded text.
  • A Document Type Definition (DTD), that holds all the structural information.
  • One or more stylesheets that deals with the display of the document.

[back to index]

Where does XML come into the picture ?

As I mentioned in the very beginning of this chapter, XML is derived from SGML. It is a so-called subset of SGML. The main difference between the two languages is that XML is optimised for use over the World Wide Web by being able to work with HTML for data display and presentation. Up until this point in time (and probably for several years to come) HTML has been the primary way of presenting information on the Web. Even though HTML is derived from SGML, it does not have very much in common with its parent language anymore. HTML today is entirely about presentation of information and this is reflected in the tagging scheme of the language. Most HTML-tags are created for the sole purpose of controlling the layout of text and images in the browser. This means that content-related searches on the WWW has become something of a problem. The principle behind search engines like Alta Vista or Lycos is to search the Web for so-called meta-information in all the different pages out there. Meta information is basically keywords that the author of a text has put into an HTML file to describe the content of the document. Thus, your ability to find relevant information on the Web relies entirely on the publishers' own idea of what is relevant in their publications. XML is an attempt to remedy this problem, not necessarily by replacing HTML, but by working together with it. This is possible because the two languages work at a different level with regard to how they structure data.

So why develop a new language based on SGML when SGML had been used for many years in very advanced publishing applications ? The simple answer to this is that SGML is by most people, even regular users, regarded as an overly complex language, with many optional functions. It was therefore decided that a simpler, web oriented subset of SGML would be preferable. This subset was to retain the characteristics of SGML, but at the same time be almost as simple to implement as HTML (which is actually a very simple language).

The resulting language, XML (Extensible Markup Language), removed many of the complexities, and all of the optional functions of SGML, while still retaining the primary benefits:

  • XML is a generalised markup language, which enables the authors to define their own set of tags.
  • XML documents are self-describing. This means that a valid document contains the set of rules to which its data must conform.
  • XML documents can be validated. This means that you can use a program called an XML validator to check if a document is structured according to the rules laid out in the DTD that the document refers to.

In addition to retaining the primary benefits of SGML, XML has a number of advantages compared to SGML when it comes to delivery across the World Wide Web:

  • XML is a much smaller language than SGML. This is because the designers of the XML specification tried to cut everything that was not needed for Web delivery.
  • XML includes a specification for a hyperlinking scheme, which is described as a separate language called XLL (Extensible Linking Language). The ability to link together separate pages was one of the major reasons behind the success of HTML after it was introduced in the early 1990's. Even if SGML allows a hyperlinking mechanism to be defined, it has never been a part of the original language specification.
  • XML also includes a specification for a style language called XSL (Extensible Style Language). This provides support for a style sheet mechanism, something which is also not included in the SGML language specification.

[back to index]

XML is not intended only as a 'substitute' for HTML in delivering information on the internet. To understand the intended use of XML we need to take a look at the original goals of the work group that developed the XML specification. Their original intentions for the new markup language was expressed in the following ten points:

  1. XML shall be straightforwardly usable over the Internet. This meant that the new language specification had to simplify the existing sgml structure. In addition to this the new language had to consider the needs of the applications that run in a network environment.
  2. XML shall support a wide variety of applications. This basically means that xml is not intended solely as an internet language, but something that can be implemented in a wide range of software, ranging from text editing tools to database programs.
  3. XML must be compatible with SGML. The idea behind this is that any valid XML-document is also a valid sgml document. This means that an xml file can be checked for validity against a DTD by running it through an SGML validator. The reverse is, of course, not the case, since an XML validator will not understand all the advanced functions of SGML.
  4. It shall be easy to write programs that process XML documents. The reason why this particular point was included in the list, was simply based on the belief that the easier the language, the more tools will become available. The adoption of XML as a standard relies heavily on the availabilty of programs that are able to use the new language, and by keeping it simple, people will be able to produce their own tools without too much effort. (or at least download freeware programs made by others)
  5. The number of optional features in XML must be kept at a minimum, ideally zero. Based on the experiences with compatibility problems in SGML software due to all the optional features of that language, XML tries to avoid optional features.
  6. XML documents should be human-legible and reasonably clear. Since XML used plain text to work with data, it is easier for the user to understand the language if it is clearly readable. Also it allows the user to work relatively efficiently with XML in a standard text editor, even if advanced XML software become available.
  7. The XML design should be prepared quickly. This point does not really have anything to with the nature of the markup language as such. It was added because there was general concern that if XML was not made available quickly as a way to extend html, someone else would come up with a different solution.
  8. The design of XML shall be formal and concise. This focus on the wording of the actual xml specification. The thought behind it is that the language would be more widely adopted if it was easy to understand and use.
  9. XML documents shall be easy to create. This is a logical extension of point 6. If the document is easy to read by human users, creating the document should be similarly easy. In principle, it is not a problem to create an XML document in a standard text editor. In practise, however, this a relatively time-consuming process. Since specialised XML editors are not commonly available yet, it is a bit to early to say if this goal has been accomplished.
  10. Terseness in XML markup is of minimal importance. SGML supports a number of minimisation techniques that reduces the amount of typing the author has to do. I have mentioned one of them already: the possibility to leave out end tags for many elements. This is not allowed in XML.

Some of the points in the argument above probably don't make much sense at the moment. This is not really the idea behind this chapter either. It's purpose was to give you, the reader, a general idea about the theory behind markup languages and XML. Like most theory, it will be a lot easier to understand once you have had a little practise with XML. So, if you read through this chapter again after you have been through the examples in the next few chapters, it will make more sense. With the worst theoretical part behind us, let's start on an actual XML example...



< Previous | Index | Next >


Black Pixel
Black Pixel
Black Pixel Black Pixel
Black Pixel