HTML Parsing

HtmlParser is a concrete subclass of StreamFilter. It uses a HtmlScanner to scan a stream of characters in the HTML document. HtmlParser uses the tokens returned from the scanner to build the parse tree. HtmlParser has a method with the selector the same as each token type. For example, if there is a token type with type #tag, then the instance method #tag of HtmlParser is sent. For more information, look at the method #parse: and the methods in the category 'token type actions'.

HtmlDtd defines the HTML elements and entities defined in the HTML DTD [http://www11.w3.org/hypertext/WWW/MarkUp/HTML.dtd.html]. HtmlDtd also contains other information needed by the parser. For each HTML element in HtmlDtd, it contains (name, content, attributes, parents, mappedObjectClass hTextFlag). Name is the symbol of the new tag. Content and attributes are defined in the official definition of Dtd. Parents are the other tags that can be the parent of this tag. If you set parent to nil, then any tags can be a parent of this new tag. MappedObject is the class symbol of the object generated by the parser when this new tag is parsed. hTextFlag specifies whether that Element can be rendered into the same paragraph with other tags.

#(#dd #empty #() #(#dl) #WebComposite false)

Above is an example of the definition of a HTML element. DL stands for definition list. It is usually used with pairs of DT and DD where DT is the term and DD is the definition of DT. Because DD can only be contained inside a DL, its only parent tag possible is DL. The node generated by the parser from parsing DD belongs to the class WebComposite. Every new definition starts with a new line when displayed to the user and DL should not be combined with other tags in the same paragraph after rendering. That is why hTextFlag is set to false. There is no attributes that can be used with DD, so that collection is empty. #empty is defined by the standard HTML DTD.

HtmlScanner gets HTML elements and entities information from HtmlDtd. HtmlScanner has HtmlScannerStates to represent part of its current state and behavior. These State [ ralph's book ] objects extract tokens and determine the next state. Each state needs to implement the method #handle that reads and processes a character from the inputStream. A state may append the character to the buffer, may inform the HtmlScanner to change state, or create a token.

Below is the implementation of #handle in StateAttr.

handle "Accumulating an attribute, until '=' or '>' ." self getNonBlankChar. scanner hereChar = $> ifTrue: [self changeState: StateText. ^scanner endOneBlockWithType: #attribute value: scanner buffer asLowercase]. scanner hereChar = $= ifTrue: [scanner endOneBlockWithType: #attribute value: scanner buffer asLowercase. ^self changeState: StateEquals]. ^scanner appendHereCharToBuffer

In StateAttr, it first gets a character that is not space, cr and lf. The character is compared with $> and $= to check if it is time to create a token and then change the next State. Otherwise, the character is appended to the buffer in the scanner that is used later.

HtmlParser gets a token by sending HtmlScanner the message #nextToken. HtmlScanner continues to 'handle' its current state object until a token is found. Below shows the code for HtmlScanner>>nextToken.

nextToken "Returns the next token from the inputStream." self clearToken. currentState handle. [self isTokenFound] whileFalse: [currentState handle]. self isEndOfInput ifTrue: [^nil]. ^self token