SWI-Prolog -- Manual

3.2.12 Memory management considerations

Storing RDF triples in main memory provides much better performance than using external databases. Unfortunately, although memory is fairly cheap these days, main memory is severely limited when compared to disks. Memory usage breaks down to the following categories. Rough estimates of the memory usage is given for 64-bit systems. 32-bit system use slightly more than half these amounts.

Actually storing the triples. A triple is stored in a C struct of 144 bytes. This struct both holds the quintuple, some bookkeeping information and the 10 next-pointers for the (max) to hash tables.
The bucket array for the hashes. Each bucket maintains a head, and tail pointer, as well as a count for the number of entries. The bucket array is allocated if a particular index is created, which implies the first query that requires the index. Each bucket requires 24 bytes.
Bucket arrays are resized if necessary. Old triples remain at their original location. This implies that a query may need to scan multiple buckets. The garbage collector may relocate old indexed triples. It does so by copying the old triple. The old triple is later reclaimed by GC. Reindexed triples will be reused, but many reindexed triples may result in a significant memory fragmentation.
Resources are maintained in a seperate table to support rdf_resource/1. A resources requires approximately 32 bytes.
Identical literals are shared (see rdf_current_literal/1) and stored in a skip list. A literal requires approximately 40 bytes, excluding the atom used for the lexical representation.
Resources are stored in the Prolog atom-table. Atoms with the average length of a resource require approximately 88 bytes.

The hash parameters can be controlled with rdf_set/1. Applications that are tight on memory and for which the query characteristics are more or less known can optimize performance and memory by fixing the hash-tables. By fixing the hash-tables we can tailor them to the frequent query patterns, we avoid the need for to check multiple hash buckets (see above) and we avoid memory fragmentation due to optimizing triples for resized hashes.

set_hash_parameters :-
      rdf_set(hash(s,   size, 1048576)),
      rdf_set(hash(p,   size, 1024)),
      rdf_set(hash(sp,  size, 2097152)),
      rdf_set(hash(o,   size, 1048576)),
      rdf_set(hash(po,  size, 2097152)),
      rdf_set(hash(spo, size, 2097152)),
      rdf_set(hash(g,   size, 1024)),
      rdf_set(hash(sg,  size, 1048576)),
      rdf_set(hash(pg,  size, 2048)).

[det]rdf_set(+Term)

Set properties of the RDF store. Currently defines:

hash(+Hash, +Parameter, +Value)

Set properties for a triple index. Hash is one of s, p, sp, o, po, spo, g, sg or pg. Parameter is one of:

size: Value defines the number of entries in the hash-table. Value is rounded down to a power of 2. After setting the size explicitly, auto-sizing for this table is disabled. Setting the size smaller than the current size results in a permission_error exception.
average_chain_len: Set maximum average collision number for the hash.
optimize_threshold: Related to resizing hash-tables. If 0, all triples are moved to the new size by the garbage collector. If more then zero, those of the last Value resize steps remain at their current location. Leaving cells at their current location reduces memory fragmentation and slows down access.

The garbage collector

The RDF store has a garbage collector that runs in a separate thread named =__rdf_GC=. The garbage collector removes the following objects:

Triples that have died before the the generation of last still active query.
Entailment matrices for rdfs:subPropertyOf relations that are related to old queries.

In addition, the garbage collector reindexes triples associated to the hash-tables before the table was resized. The most recent resize operation leads to the largest number of triples that require reindexing, while the oldest resize operation causes the largest slowdown. The parameter optimize_threshold controlled by rdf_set/1 can be used to determine the number of most recent resize operations for which triples will not be reindexed. The default is 2.

Normally, the garbage collector does it job in the background at a low priority. The predicate rdf_gc/0 can be used to reclaim all garbage and optimize all indexes.Warming up the database

The RDF store performs many operations lazily or in background threads. For maximum performance, perform the following steps:

Load all the data without doing queries or retracting data in between. This avoids creating the indexes and therefore the need to resize them.

Perform each of the indexed queries. The following call performs this. Note that it is irrelevant whether or not the query succeeds.

warm_indexes :-
    ignore(rdf(s, _, _)),
    ignore(rdf(_, p, _)),
    ignore(rdf(_, _, o)),
    ignore(rdf(s, p, _)),
    ignore(rdf(_, p, o)),
    ignore(rdf(s, p, o)),
    ignore(rdf(_, _, _, g)),
    ignore(rdf(s, _, _, g)),
    ignore(rdf(_, p, _, g)).

Duplicate adminstration is initialized in the background after the first call that returns a significant amount of duplicates. Creating the adminstration can be forced by calling rdf_update_duplicates/0.

Predicates:

[det]rdf_gc: Run the RDF-DB garbage collector until no garbage is left and all tables are fully optimized. Under normal operation a seperate thread with identifier =__rdf_GC= performs garbage collection as long as it is considered `useful'.
Using rdf_gc/0 should only be needed to ensure a fully clean database for analysis purposes such as leak detection.
[det]rdf_update_duplicates: Update the duplicate administration of the RDF store. This marks every triple that is potentionally a duplicate of another as duplicate. Being potentially a duplicate means that subject, predicate and object are equivalent and the life-times of the two triples overlap.
The duplicates marks are used to reduce the administrative load of avoiding duplicate answers. Normally, the duplicates are marked using a background thread that is started on the first query that produces a substantial amount of duplicates.