Now that you're all warmed up with parsers and have enough knowledge to make you slightly dangerous, we'll analyze one of the two important styles of XML processing: event streams. We'll look at some examples that show the basic theory of stream processing and graduate with a full treatment of the standard Simple API for XML (SAX).
In the world of computer science, a stream is a sequence of data chunks to be processed. A file, for example, is a sequence of characters (one or more bytes each, depending on the encoding). A program using this data can open a filehandle to the file, creating a character stream, and it can choose to read in data in chunks of whatever size it chooses. Streams can be dynamically generated too, whether from another program, received over a network, or typed in by a user. A stream is an abstraction, making the source of the data irrelevant for the purpose of processing.
To summarize, here are a stream's important qualities:
It consists of a sequence of data fragments.
The order of fragments transmitted is significant.
The source of data (e.g., file or program output) is not important.
XML streams are just clumpy character streams. Each data clump, called a token in parser parlance, is a conglomeration of one or more characters. Each token corresponds to a type of markup, such as an element start or end tag, a string of character data, or a processing instruction. It's very easy for parsers to dice up XML in this way, requiring minimal resources and time.
What makes XML streams different from character streams is that the context of each token matters; you can't just pump out a stream of random tags and data and expect an XML processor to make sense of it. For example, a stream of ten start tags followed by no end tags is not very useful, and definitely not well-formed XML. Any data that isn't well-formed will be rejected. After all, the whole purpose of XML is to package data in a way that guarantees the integrity of a document's structure and labeling, right?
These contextual rules are helpful to the parser as well as the front-end processor. XML was designed to be very easy to parse, unlike other markup languages that can require look-ahead or look-behind. For example, SGML does not have a rule requiring nonempty elements to have an end tag. To know when an element ends requires sophisticated reasoning by the parser. This requirement leads to code complexity, slower processing speed, and increased memory usage.
Copyright © 2002 O'Reilly & Associates. All rights reserved.