EternalLines.com: Components: HTML Parser: Documentation 


ThtmlTokenType

The following lexical tokens are recognised:

htEOF Returned if no more tokens are available.
htText Plain text.
htCharRef Character reference, eg. Ӓ
htCharRefHex Hex Character reference, eg. ᨫ
htEntityRef Entity reference, eg. &
htStartTag Beginning of start tag.
htEndTag End tag.
htLineBreak A line-break sequence.
htTagAttrName Tag attribute name.
htTagAttrValueStartStart of a tag attribute value. It's followed by text tokens, followed by htTagAttrValueEnd.
htTagAttrValueEnd End of a tag attribute value.
htEmptyTag Empty tag.
htComment HTML comment.
htCommentEnd HTML comment end.
htEmptyComment Empty comment, ie <!>
htPITarget Start of Processing Instruction (PI) tag.
htPI PI information
htDeclaration Start of SGML declaration tag.
htDeclarationText Declaration tag information.
htCDATA CDATA marked section


TelHTMLLexicalParser

  Properties

    Before parsing

    property  Options: ThtmlLexicalParserOptions

                  loResolveReferences - If this option is active, resolvable
                  references (eg &amp;) will be returned as text tokens instead
                  of reference tokens.

    property  Text: String
    property  FileName: String

                  Text or FileName specified the source of the HTML Text.
                  If the HTML text is in a file, set the FileName property,
                  otherwise, if the HTML text is in memory, set the Text property.

    property  Encoding: String

                  If the character encoding of the HTML text is known, set the
                  Encoding property to the name of the encoding, for
                  example 'ISO-8859-1'. If the encoding is unknown, leave this
                  property blank, in which case the Parser will attempt to
                  automatically detect the encoding.

    During parsing

    property  TokenType: ThtmlTokenType

                  Returns the current token type.

    property  TokenStr: String
    property  TokenWideStr: WideString

                  Returns the string associated with the current token.
                  Its meaning differs depending on the value of TokenType.

    property  TagID: Integer
    property  AttributeID: Integer

                  Returns an integer ID for the last tag or tag-attribute
                  encountered. See unit cHTMLUtils.pas for a list of
                  ID values. All known HTML tags and tag-attributes have
                  a defined ID. Use the integer IDs instead of the actual names
                  for better performance.


  Methods

    procedure Reset;

                  Call Reset to restart parsing from the first character.

    procedure Abort;

                  Abort can be called from any events to stop the
                  current call to Parse at the current token.

    procedure GetNextToken;
    procedure Parse;

                  Two methods of parsing HTML text is available:

                  i) Call Parse, which will repeatadly call
                     GetNextToken until the end of the text is reached.

                  ii) Manually call GetNextToken for every token.

                  Both methods will fire events.


  Events

    property OnToken;
    property OnTokenStr;
    property OnTokenWideStr;

                  The OnToken events are called for every token encountered.
                  OnTokenStr and OnTokenWideStr are called with
                  the token string as a parameter.

    property OnText;
    property OnWideText;
    property OnContentText;
    property OnContentWideText;

                  The OntText events are called for every text token.
                  OnContentText is only called for text that appear
                  as content.

    property OnStartTag;
    property OnStartTagStr;
    property OnEndTag;
    property OnEndTagStr;

                  The OnTag events are called for every start or
                  end tag encountered.

    property OnTagAtt;
    property OnTagAttrStr;
    property OnTagAttrValue;
    property OnTagAttrValueWide;

                  OnTagAttr is called when a tag attribute name is
                  encountered. OnTagAttrValue is called when the
                  tag value is comletely parsed.

    property OnComment;
    property OnCommentWide;

                  The OnComment events are called when HTML comment tags
                  are encountered.
    	


Lexical Parser Example
procedure Parse;
var Parser: TelHTMLLexicalParser;
begin
  // Create the parser component
  Parser := TelHTMLLexicalParser.Create(nil);

  // Set properties
  Parser.Text := '<HTML><BODY>Example HTML text</BODY></HTML>';

  // Set event handlers
  Parser.OnToken := OnParserToken;

  // Parse all tokens
  Parser.Parse;

  // Destroy the component
  Parser.Free;
end;
        


TelHTMLParser

  Properties

    Before parsing

    property Options: ThtmlParserOptions

                  poDisableNotifications - If this option is active, no
                  notification events will be triggered during parsing.

                  poStopOnError - If this option is set, parsing will
                  halt with an exception if an error is encountered. If this
                  option is not set, errors are reported but ignored.

                  poDontProduceDocument - Set this option if you do not
                  want to produce an HTML Document Object.

    property Text: String
    property FileName: String

                  Text or FileName specified the source of the HTML Text.
                  If the HTML text is in a file, set the FileName property,
                  otherwise, if the HTML text is in memory, set the Text property.

    property Encoding: String

                  If the character encoding of the HTML text is known, set the
                  Encoding property to the name of the encoding, for
                  example 'ISO-8859-1'. If the encoding is unknown, leave this
                  property blank, in which case the Parser will attempt to
                  automatically detect the encoding.

    During parsing

    property State: ThtmlParserState

                  The state object is available during parsing and can be used
                  by event handlers to get state information.


  Methods

    function ParseDocument: ThtmlDocument

                  ParseDocument parses the HTML text.
                  If poDisableNotifications is not set, the function
                  triggers event handlers while parsing.
                  If poDontProduceDocument is not set, the function
                  returns the HTML Document Object. 


  Events

    property OnMessage;

                  The OnMessage events is called to report human-readable
                  messages from the parser. Messages can either be warnings
                  or errors.

    property OnToken;
    property OnTokenStr;
    property OnTokenWideStr;

                  The OnToken events are called for every token encountered.
                  OnTokenStr and OnTokenWideStr are called with
                  the token string as a parameter.

    property OnStartTag;
    property OnStartTagAttributes;
    property OnEndTag;

                  The On***Tag events are triggered when the parser
                  encounters tags.

    property OnResolveEntityRef;
    property OnText;
    property OnUnresolvedText;

                  These events are triggered for document content. The
                  event handler has the option to resolve/modify the text
                  before the document objects are created.

    property OnElementOpen;
    property OnElementClose;

                  The OnElementOpen and OnElementClose events are
                  triggered when actual Element Objects are opened and
                  closed.

    property OnDocumentObject;

                  OnDocumentObject is triggered for every Document Object
                  created, at all levels of nesting.