- Documentation
- Reference manual
- Packages
- SWI-Prolog Semantic Web Library 3.0
- Two RDF APIs
- library(semweb/rdf_db): The RDF database
- Query the RDF database
- Enumerating objects
- Modifying the RDF database
- Update view, transactions and snapshots
- Type checking predicates
- Loading and saving to file
- Graph manipulation
- Literal matching and indexing
- Predicate properties
- Prefix Handling
- Miscellaneous predicates
- Memory management considerations
- library(semweb/rdf_db): The RDF database
- Two RDF APIs
- SWI-Prolog Semantic Web Library 3.0
3.2.12 Memory management considerations
Storing RDF triples in main memory provides much better performance than using external databases. Unfortunately, although memory is fairly cheap these days, main memory is severely limited when compared to disks. Memory usage breaks down to the following categories. Rough estimates of the memory usage is given for 64-bit systems. 32-bit system use slightly more than half these amounts.
- Actually storing the triples. A triple is stored in a C struct of 144 bytes. This struct both holds the quintuple, some bookkeeping information and the 10 next-pointers for the (max) to hash tables.
- The bucket array for the hashes. Each bucket maintains a
head, and tail pointer, as well as a count for the number
of entries. The bucket array is allocated if a particular index is
created, which implies the first query that requires the index. Each
bucket requires 24 bytes.
Bucket arrays are resized if necessary. Old triples remain at their original location. This implies that a query may need to scan multiple buckets. The garbage collector may relocate old indexed triples. It does so by copying the old triple. The old triple is later reclaimed by GC. Reindexed triples will be reused, but many reindexed triples may result in a significant memory fragmentation.
- Resources are maintained in a seperate table to support rdf_resource/1. A resources requires approximately 32 bytes.
- Identical literals are shared (see rdf_current_literal/1) and stored in a skip list. A literal requires approximately 40 bytes, excluding the atom used for the lexical representation.
- Resources are stored in the Prolog atom-table. Atoms with the average length of a resource require approximately 88 bytes.
The hash parameters can be controlled with rdf_set/1. Applications that are tight on memory and for which the query characteristics are more or less known can optimize performance and memory by fixing the hash-tables. By fixing the hash-tables we can tailor them to the frequent query patterns, we avoid the need for to check multiple hash buckets (see above) and we avoid memory fragmentation due to optimizing triples for resized hashes.
set_hash_parameters :- rdf_set(hash(s, size, 1048576)), rdf_set(hash(p, size, 1024)), rdf_set(hash(sp, size, 2097152)), rdf_set(hash(o, size, 1048576)), rdf_set(hash(po, size, 2097152)), rdf_set(hash(spo, size, 2097152)), rdf_set(hash(g, size, 1024)), rdf_set(hash(sg, size, 1048576)), rdf_set(hash(pg, size, 2048)).
- [det]rdf_set(+Term)
- Set properties of the RDF store. Currently defines:
- hash(+Hash, +Parameter, +Value)
- Set properties for a triple index. Hash is one of
s
,p
,sp
,o
,po
,spo
,g
,sg
orpg
. Parameter is one of:- size
- Value defines the number of entries in the hash-table.
Value is rounded down to a power of 2. After setting
the size explicitly, auto-sizing for this table is disabled. Setting the
size smaller than the current size results in a
permission_error
exception. - average_chain_len
- Set maximum average collision number for the hash.
- optimize_threshold
- Related to resizing hash-tables. If 0, all triples are moved to the new size by the garbage collector. If more then zero, those of the last Value resize steps remain at their current location. Leaving cells at their current location reduces memory fragmentation and slows down access.
The garbage collector
The RDF store has a garbage collector that runs in a separate thread named =__rdf_GC=. The garbage collector removes the following objects:
- Triples that have died before the the generation of last still active query.
- Entailment matrices for
rdfs:subPropertyOf
relations that are related to old queries.
In addition, the garbage collector reindexes triples associated to
the hash-tables before the table was resized. The most recent resize
operation leads to the largest number of triples that require
reindexing, while the oldest resize operation causes the largest
slowdown. The parameter optimize_threshold
controlled by rdf_set/1
can be used to determine the number of most recent resize operations for
which triples will not be reindexed. The default is 2.
Normally, the garbage collector does it job in the background at a low priority. The predicate rdf_gc/0 can be used to reclaim all garbage and optimize all indexes.Warming up the database
The RDF store performs many operations lazily or in background threads. For maximum performance, perform the following steps:
- Load all the data without doing queries or retracting data in between. This avoids creating the indexes and therefore the need to resize them.
- Perform each of the indexed queries. The following call performs
this. Note that it is irrelevant whether or not the query succeeds.
warm_indexes :- ignore(rdf(s, _, _)), ignore(rdf(_, p, _)), ignore(rdf(_, _, o)), ignore(rdf(s, p, _)), ignore(rdf(_, p, o)), ignore(rdf(s, p, o)), ignore(rdf(_, _, _, g)), ignore(rdf(s, _, _, g)), ignore(rdf(_, p, _, g)).
- Duplicate adminstration is initialized in the background after the first call that returns a significant amount of duplicates. Creating the adminstration can be forced by calling rdf_update_duplicates/0.
Predicates:
- [det]rdf_gc
- Run the RDF-DB garbage collector until no garbage is left and all tables
are fully optimized. Under normal operation a seperate thread with
identifier =__rdf_GC= performs garbage collection as long as it is
considered `useful'.
Using rdf_gc/0 should only be needed to ensure a fully clean database for analysis purposes such as leak detection.
- [det]rdf_update_duplicates
- Update the duplicate administration of the RDF store. This marks every
triple that is potentionally a duplicate of another as duplicate. Being
potentially a duplicate means that subject, predicate and object are
equivalent and the life-times of the two triples overlap.
The duplicates marks are used to reduce the administrative load of avoiding duplicate answers. Normally, the duplicates are marked using a background thread that is started on the first query that produces a substantial amount of duplicates.