PublicShow sourcesgml.pl -- SGML, XML and HTML parser

This library allows you to parse SGML, XML and HTML data into a Prolog data structure. The library defines several families of predicates:

High-level predicates
Most users will only use load_html/3, load_xml/3 or load_sgml/3 to parse arbitrary input into a DOM structure. These predicates all call load_structure/3, which provides more options and may be used for processing non-standard documents.

The DOM structure can be used by library(xpath) to extract information from the document.

The low-level parser
The actual parser is written in C and consists of two parts: one for processing DTD (Document Type Definitions) and one for parsing data. The data can either be parsed to a Prolog (DOM) term or the parser can perform callbacks for the DOM events.
Utility predicates
Finally, this library provides prmitives for classifying characters and strings according to the XML specification such as xml_name/1 to verify whether an atom is a valid XML name (identifier). It also provides primitives to quote attributes and CDATA elements.
Source dtd(+Type, -DTD) is det
DTD is a DTD object created from the file dtd(Type). Loaded DTD objects are cached. Note that DTD objects may not be shared between threads. Therefore, dtd/2 maintains the pool of DTD objects using a thread_local predicate. DTD objects are destroyed if a thread terminates.
Errors
- existence_error(source_sink, dtd(Type))
Source load_dtd(+DTD, +DtdFile, +Options)
Load DtdFile into a DTD. Defined options are:
dialect(+Dialect)
Dialect to use (xml, xmlns, sgml)
encoding(+Encoding)
Encoding of DTD file
Arguments:
DTD- is a fresh DTD object, normally created using new_dtd/1.
Source load_structure(+Source, -ListOfContent, :Options) is det
Parse Source and return the resulting structure in ListOfContent. Source is handed to open_any/5, which allows for processing an extensible set of input sources.

A proper XML document contains only a single toplevel element whose name matches the document type. Nevertheless, a list is returned for consistency with the representation of element content.

The encoding(+Encoding) option is treated special for compatibility reasons:

  • If Encoding is one of iso-8859-1, us-ascii or utf-8, the stream is opened in binary mode and the option is passed to the SGML parser.
  • If Encoding is present, but not one of the above, the stream is opened in text mode using the given encoding.
  • Otherwise (no Encoding), the stream is opened in binary mode and doing the correct decoding is left to the parser.
Source load_sgml_file(+File, -DOM) is det
Load SGML from File and unify the resulting DOM structure with DOM.
deprecated
- New code should use load_sgml/3.
Source load_xml_file(+File, -DOM) is det
Load XML from File and unify the resulting DOM structure with DOM.
deprecated
- New code should use load_xml/3.
Source load_html_file(+File, -DOM) is det
Load HTML from File and unify the resulting DOM structure with DOM.
deprecated
- New code should use load_html/3.
Source load_html(+Input, -DOM, +Options) is det
Load HTML text from Input and unify the resulting DOM structure with DOM. Options are passed to load_structure/3, after adding the following default options:
dtd(DTD)
Pass the DTD for HTML as obtained using dtd(html, DTD).
dialect(Dialect)
Current dialect from the Prolog flag html_dialect
max_errors(-1)
syntax_errors(quiet)
Most HTML encountered in the wild contains errors. Even in the context of errors, the resulting DOM term is often a reasonable guess at the intend of the author.

You may also want to use the library(http/http_open) to support loading from HTTP and HTTPS URLs. For example:

:- use_module(library(http/http_open)).
:- use_module(library(sgml)).

load_html_url(URL, DOM) :-
    load_html(URL, DOM, []).
Source load_xml(+Input, -DOM, +Options) is det
Load XML text from Input and unify the resulting DOM structure with DOM. Options are passed to load_structure/3, after adding the following default options:
  • dialect(xml)
Source load_sgml(+Input, -DOM, +Options) is det
Load SGML text from Input and unify the resulting DOM structure with DOM. Options are passed to load_structure/3, after adding the following default options:
  • dialect(sgml)
Source xml_quote_attribute(+In, -Quoted) is det
Source xml_quote_cdata(+In, -Quoted) is det
Backward compatibility for versions that allow to specify encoding. All characters that cannot fit the encoding are mapped to XML character entities (&#dd;). Using ASCII is the safest value.
Source xml_name(+Atom) is semidet
True if Atom is a valid XML name.
Source xml_basechar(+CodeOrChar) is semidet
Source xml_ideographic(+CodeOrChar) is semidet
Source xml_combining_char(+CodeOrChar) is semidet
Source xml_digit(+CodeOrChar) is semidet
Source xml_extender(+CodeOrChar) is semidet
XML character classification predicates. Each of these predicates accept both a character (one-character atom) and a code (integer).
See also
- http://www.w3.org/TR/2006/REC-xml-20060816
Source xml_is_dom(@Term) is semidet
True if term statisfies the structure as returned by load_structure/3 and friends.
Source xml_quote_attribute(+In, -Quoted) is det
Source xml_quote_cdata(+In, -Quoted) is det
Backward compatibility for versions that allow to specify encoding. All characters that cannot fit the encoding are mapped to XML character entities (&#dd;). Using ASCII is the safest value.
Source xml_basechar(+CodeOrChar) is semidet
Source xml_ideographic(+CodeOrChar) is semidet
Source xml_combining_char(+CodeOrChar) is semidet
Source xml_digit(+CodeOrChar) is semidet
Source xml_extender(+CodeOrChar) is semidet
XML character classification predicates. Each of these predicates accept both a character (one-character atom) and a code (integer).
See also
- http://www.w3.org/TR/2006/REC-xml-20060816
Source xml_basechar(+CodeOrChar) is semidet
Source xml_ideographic(+CodeOrChar) is semidet
Source xml_combining_char(+CodeOrChar) is semidet
Source xml_digit(+CodeOrChar) is semidet
Source xml_extender(+CodeOrChar) is semidet
XML character classification predicates. Each of these predicates accept both a character (one-character atom) and a code (integer).
See also
- http://www.w3.org/TR/2006/REC-xml-20060816
Source xml_basechar(+CodeOrChar) is semidet
Source xml_ideographic(+CodeOrChar) is semidet
Source xml_combining_char(+CodeOrChar) is semidet
Source xml_digit(+CodeOrChar) is semidet
Source xml_extender(+CodeOrChar) is semidet
XML character classification predicates. Each of these predicates accept both a character (one-character atom) and a code (integer).
See also
- http://www.w3.org/TR/2006/REC-xml-20060816
Source xml_basechar(+CodeOrChar) is semidet
Source xml_ideographic(+CodeOrChar) is semidet
Source xml_combining_char(+CodeOrChar) is semidet
Source xml_digit(+CodeOrChar) is semidet
Source xml_extender(+CodeOrChar) is semidet
XML character classification predicates. Each of these predicates accept both a character (one-character atom) and a code (integer).
See also
- http://www.w3.org/TR/2006/REC-xml-20060816

Undocumented predicates

The following predicates are exported, but not or incorrectly documented.

Source open_dtd(Arg1, Arg2, Arg3)
Source iri_xml_namespace(Arg1, Arg2, Arg3)
Source xml_name(Arg1, Arg2)
Source xml_quote_cdata(Arg1, Arg2, Arg3)
Source get_sgml_parser(Arg1, Arg2)
Source new_sgml_parser(Arg1, Arg2)
Source dtd_property(Arg1, Arg2)
Source iri_xml_namespace(Arg1, Arg2)
Source xsd_number_string(Arg1, Arg2)
Source sgml_parse(Arg1, Arg2)
Source free_sgml_parser(Arg1)
Source new_dtd(Arg1, Arg2)
Source load_dtd(Arg1, Arg2)
Source xsd_time_string(Arg1, Arg2, Arg3)
Source sgml_register_catalog_file(Arg1, Arg2)
Source set_sgml_parser(Arg1, Arg2)
Source free_dtd(Arg1)
Source xml_quote_attribute(Arg1, Arg2, Arg3)