5.2.4 Why has the representation of double quoted text changed?
Prolog defines two forms of quoted text. Traditionally, single quoted text is mapped to atoms while double quoted text is mapped to a list of character codes (integers) or characters represented as 1-character atoms. Representing text using atoms is often considered inadequate for several reasons:
- It hides the conceptual difference between text and program symbols.
Where content of text often matters because it is used in I/O, program
symbols are merely identifiers that match with the same symbol
elsewhere. Program symbols can often be consistently replaced, for
example to obfuscate or compact a program.
- Atoms are globally unique identifiers. They are stored in a shared
table. Volatile strings represented as atoms come at a significant price
due to the required cooperation between threads for creating atoms.
Reclaiming temporary atoms using Atom garbage collection is a
costly process that requires significant synchronisation.
- Many Prolog systems (not SWI-Prolog) put severe restrictions on the length of atoms or the maximum number of atoms.
Representing text as a list of character codes or 1-character atoms also comes at a price:
- It is not possible to distinguish (at runtime) a list of integers or
atoms from a string. Sometimes this information can be derived from
(implicit) typing. In other cases the list must be embedded in a
compound term to distinguish the two types. For example,
s("hello world")
could be used to indicate that we are dealing with a string.Lacking runtime information, debuggers and the toplevel can only use heuristics to decide whether to print a list of integers as such or as a string (see portray_text/1).
While experienced Prolog programmers have learned to cope with this, we still consider this an unfortunate situation.
- Lists are expensive structures, taking 2 cells per character (3 for SWI-Prolog in its current form). This stresses memory consumption on the stacks while pushing them on the stack and dealing with them during garbage collection is unnecessarilly expensive.
We observe that in many programs, most strings are only handled as a single unit during their lifetime. Examining real code tells us that double quoted strings typically appear in one of the following roles:
- A DCG literal
- Although represented as a list of codes is the correct representation for handling in DCGs, the DCG translator can recognise the literal and convert it to the proper representation. Such code need not be modified.
- A format string
- This is a typical example of text that is conceptually not a program identifier. Format is designed to deal with alternative representations of the format string. Such code need not be modified.
- Getting a character code
- The construct
[X] = "a"
is a commonly used template for getting the character code of the letter 'a'. ISO Prolog defines the syntax0'a
for this purpose. Code using this must be modified. The modified code will run on any ISO compliant processor. - As argument to list predicates to operate on strings
- Here, we see code such as
append("name:", Rest, Codes)
. Such code needs to be modified. In this particular example, the following is a good portable alternative:phrase("name:", Codes, Rest)
- Checks for a character to be in a set
- Such tests are often performed with code such as this:
memberchk(C, "~!@#$")
. This is a rather inefficient check in a traditional Prolog system because it pushes a list of character codes cell-by-cell the Prolog stack and then traverses this list cell-by-cell to see whether one of the cells unifies with C. If the test is successful, the string will eventually be subject to garbage collection. The best code for this is to write a predicate as below, which pushes noting on the stack and performs an indexed lookup to see whether the character code is in `my_class'.my_class(0'~). my_class(0'!). ...
An alternative to reach the same effect is to use term expansion to create the clauses:
term_expansion(my_class(_), Clauses) :- findall(my_class(C), string_code(_, "~!@#$", C), Clauses). my_class(_).
Finally, the predicate string_code/3 can be exploited directly as a replacement for the memberchk/2 on a list of codes. Although the string is still pushed onto the stack, it is more compact and only a single entity.
We offer the predicate list_strings/0 to help porting your program.