rdf_litindex.pl -- Search literals

This module finds literals of the RDF database based on words, stemming and sounds like (metaphone). The normal user-level predicate is

rdf_find_literals/2

rdf_set_literal_index_option(+Options:list)

Set options for the literal package. Currently defined options

verbose(Bool): If true, print progress messages while building the index tables.
index_threads(+Count): Number of threads to use for initial indexing of literals
index(+How): How to deal with indexing new literals. How is one of self (execute in the same thread), thread(N) (execute in N concurrent threads) or default (depends on number of cores).
stopgap_threshold(+Count): Add a token to the dynamic stopgap set if it appears in more than Count literals. The default is 50,000.

rdf_find_literal(+Spec, -Literal) is nondet

rdf_find_literals(+Spec, -Literals) is det

Find literals in the RDF database matching Spec. Spec is defined as:

Spec ::= and(Spec,Spec)
Spec ::= or(Spec,Spec)
Spec ::= not(Spec)
Spec ::= sounds(Like)
Spec ::= stem(Like)             % same as stem(Like, en)
Spec ::= stem(Like, Lang)
Spec ::= prefix(Prefix)
Spec ::= between(Low, High)     % Numerical between
Spec ::= ge(High)               % Numerical greater-equal
Spec ::= le(Low)                % Numerical less-equal
Spec ::= Token

sounds(Like) and stem(Like) both map to a disjunction. First we compile the spec to normal form: a disjunction of conjunctions on elementary tokens. Then we execute all the conjunctions and generate the union using ordered-set algorithms.

Stopgaps are ignored. If the final result is only a stopgap, the predicate fails.

To be done: - Exploit ordering of numbers and allow for > N, < N, etc.

rdf_token_expansions(+Spec, -Extensions)

Determine which extensions of a token contribute to finding literals.

rdf_delete_literal_index(+Type)

Fully delete a literal index

rdf_tokenize_literal(+Literal, -Tokens) is semidet

Tokenize a literal. We make this hookable as tokenization is generally domain dependent.

rdf_stopgap_token(-Token) is nondet

True when Token is a stopgap token. Currently, this implies one of:

exclude_from_index(token, Token) is true
default_stopgap(Token) is true
Token is an atom of length 1
Token was added to the dynamic stopgap token set because it appeared in more than stopgap_threshold literals.

rdf_literal_index(+Type, -Index) is det

True when Index is a literal map containing the index of Type. Type is one of:

token: Tokens are basically words of literal values. See rdf_tokenize_literal/2. The token map maps tokens to full literal texts.
stem: Index of stemmed tokens. If the language is available, the tokens are stemmed using the matching snowball stemmer. The stem map maps stemmed to full tokens.
metaphone: Phonetic index of tokens. The metaphone map maps phonetic keys to tokens.

rdf_find_literal(+Spec, -Literal) is nondet

rdf_find_literals(+Spec, -Literals) is det

Find literals in the RDF database matching Spec. Spec is defined as:

Spec ::= and(Spec,Spec)
Spec ::= or(Spec,Spec)
Spec ::= not(Spec)
Spec ::= sounds(Like)
Spec ::= stem(Like)             % same as stem(Like, en)
Spec ::= stem(Like, Lang)
Spec ::= prefix(Prefix)
Spec ::= between(Low, High)     % Numerical between
Spec ::= ge(High)               % Numerical greater-equal
Spec ::= le(Low)                % Numerical less-equal
Spec ::= Token

Stopgaps are ignored. If the final result is only a stopgap, the predicate fails.

To be done: - Exploit ordering of numbers and allow for > N, < N, etc.