uri.pl -- Process URIs

This library provides high-performance C-based primitives for manipulating URIs. We decided for a C-based implementation for the much better performance on raw character manipulation. Notably, URI handling primitives are used in time-critical parts of RDF processing. This implementation is based on RFC-3986:

http://labs.apache.org/webarch/uri/rfc/rfc3986.html

The URI processing in this library is rather liberal. That is, we break URIs according to the rules, but we do not validate that the components are valid. Also, percent-decoding for IRIs is liberal. It first tries UTF-8; then ISO-Latin-1 and finally accepts %-characters verbatim.

Earlier experience has shown that strict enforcement of the URI syntax results in many errors that are accepted by many other web-document processing tools.

uri_components(+URI, -Components) is det

uri_components(-URI, +Components) is det

Break a URI into its 5 basic components according to the RFC-3986 regular expression:

^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
 12            3  4          5       6  7        8 9

Arguments:

Components - is a term uri_components(Scheme, Authority, Path, Search, Fragment). If a URI is parsed, i.e., using mode (+,-), components that are not found are left uninstantiated (variable). See uri_data/3 for accessing this structure.

uri_data(?Field, +Components, ?Data) is semidet

Provide access the uri_component structure. Defined field-names are: scheme, authority, path, search and fragment

uri_data(+Field, +Components, +Data, -NewComponents) is semidet

NewComponents is the same as Components with Field set to Data.

uri_normalized(+URI, -NormalizedURI) is det

NormalizedURI is the normalized form of URI. Normalization is syntactic and involves the following steps:

6.2.2.1. Case Normalization
6.2.2.2. Percent-Encoding Normalization
6.2.2.3. Path Segment Normalization

iri_normalized(+IRI, -NormalizedIRI) is det

NormalizedIRI is the normalized form of IRI. Normalization is syntactic and involves the following steps:

6.2.2.1. Case Normalization
6.2.2.3. Path Segment Normalization

See also: - This is similar to uri_normalized/2, but does not do normalization of %-escapes.

uri_normalized_iri(+URI, -NormalizedIRI) is det

As uri_normalized/2, but percent-encoding is translated into IRI Unicode characters. The translation is liberal: valid UTF-8 sequences of %-encoded bytes are mapped to the Unicode character. Other %XX-sequences are mapped to the corresponding ISO-Latin-1 character and sole % characters are left untouched.

See also: - uri_iri/2.

uri_is_global(+URI) is semidet

True if URI has a scheme. The semantics is the same as the code below, but the implementation is more efficient as it does not need to parse the other components, nor needs to bind the scheme.

uri_is_global(URI) :-
        uri_components(URI, Components),
        uri_data(scheme, Components, Scheme),
        nonvar(Scheme).

uri_resolve(+URI, +Base, -GlobalURI) is det

Resolve a possibly local URI relative to Base. This implements http://labs.apache.org/webarch/uri/rfc/rfc3986.html#relative-transform

uri_normalized(+URI, +Base, -NormalizedGlobalURI) is det

NormalizedGlobalURI is the normalized global version of URI. Behaves as if defined by:

uri_normalized(URI, Base, NormalizedGlobalURI) :-
        uri_resolve(URI, Base, GlobalURI),
        uri_normalized(GlobalURI, NormalizedGlobalURI).

iri_normalized(+IRI, +Base, -NormalizedGlobalIRI) is det

NormalizedGlobalIRI is the normalized global version of IRI. This is similar to uri_normalized/3, but does not do %-escape normalization.

uri_normalized_iri(+URI, +Base, -NormalizedGlobalIRI) is det

NormalizedGlobalIRI is the normalized global IRI of URI. Behaves as if defined by:

uri_normalized(URI, Base, NormalizedGlobalIRI) :-
        uri_resolve(URI, Base, GlobalURI),
        uri_normalized_iri(GlobalURI, NormalizedGlobalIRI).

uri_query_components(+String, -Query) is det

uri_query_components(-String, +Query) is det

Perform encoding and decoding of an URI query string. Query is a list of fully decoded (Unicode) Name=Value pairs. In mode (-,+), query elements of the forms Name(Value) and Name-Value are also accepted to enhance interoperability with the option and pairs libraries. E.g.

?- uri_query_components(QS, [a=b, c('d+w'), n-'VU Amsterdam']).
QS = 'a=b&c=d%2Bw&n=VU%20Amsterdam'.

?- uri_query_components('a=b&c=d%2Bw&n=VU%20Amsterdam', Q).
Q = [a=b, c='d+w', n='VU Amsterdam'].

uri_authority_components(+Authority, -Components) is det

uri_authority_components(-Authority, +Components) is det

Break-down the authority component of a URI. The fields of the structure Components can be accessed using uri_authority_data/3.

uri_authority_data(+Field, ?Components, ?Data) is semidet

Provide access the uri_authority structure. Defined field-names are: user, password, host and port

uri_encoded(+Component, +Value, -Encoded) is det

uri_encoded(+Component, -Value, +Encoded) is det

Encoded is the URI encoding for Value. When encoding (Value->Encoded), Component specifies the URI component where the value is used. It is one of query_value, fragment or path. Besides alphanumerical characters, the following characters are passed verbatim (the set is split in logical groups according to RFC3986).

query_value, fragment: "-._~" | "!$'()*,;" | ":@" | "/?"
path: "-._~" | "!$&'()*,;=" | ":@" | "/"

uri_iri(+URI, -IRI) is det

uri_iri(-URI, +IRI) is det

Convert between a URI, encoded in US-ASCII and an IRI. An IRI is a fully expanded Unicode string. Unicode strings are first encoded into UTF-8, after which %-encoding takes place.

Errors: - syntax_error(Culprit) in mode (+,-) if URI is not a legally percent-encoded UTF-8 string.

uri_file_name(+URI, -FileName) is semidet

uri_file_name(-URI, +FileName) is det

Convert between a URI and a local file_name. This protocol is covered by RFC 1738. Please note that file-URIs use absolute paths. The mode (-, +) translates a possible relative path into an absolute one.