2.19.1 Wide character encodings on streams
Although characters are uniquely coded using the UCS standard internally, streams and files are byte (8-bit) oriented and there are a variety of ways to represent the larger UCS codes in an 8-bit octet stream. The most popular one, especially in the context of the web, is UTF-8. Bytes 0 ... 127 represent simply the corresponding US-ASCII character, while bytes 128 ... 255 are used for multi-byte encoding of characters placed higher in the UCS space. Especially on MS-Windows the 16-bit Unicode standard, represented by pairs of bytes, is also popular.
Prolog I/O streams have a property called encoding which specifies the used encoding that influences get_code/2 and put_code/2 as well as all the other text I/O predicates.
The default encoding for files is derived from the Prolog flag
encoding, which is
initialised from the environment. If the environment variable LANG
ends in "UTF-8", this encoding is assumed. Otherwise the default is text
and the translation is left to the wide-character functions of the C
library.33The Prolog native UTF-8
mode is considerably faster than the generic mbrtowc() one.
The encoding can be specified explicitly in load_files/2
for loading Prolog source with an alternative encoding, open/4
when opening files or using set_stream/2
on any open stream. For Prolog source files we also provide the encoding/1
directive that can be used to switch between encodings that are
compatible with US-ASCII (ascii
, iso_latin_1
, utf8
and many locales). See also section
3.1.3 for writing Prolog files with non-US-ASCII characters and section
2.16.1.8 for syntax issues. For additional information and Unicode
resources, please visit
http://www.unicode.org/.
SWI-Prolog currently defines and supports the following encodings:
- octet
- Default encoding for
binary
streams. This causes the stream to be read and written fully untranslated. - ascii
- 7-bit encoding in 8-bit bytes. Equivalent to
iso_latin_1
, but generates errors and warnings on encountering values above 127. - iso_latin_1
- 8-bit encoding supporting many Western languages. This causes the stream to be read and written fully untranslated.
- text
- C library default locale encoding for text files. Files are read and
written using the C library functions mbrtowc() and wcrtomb(). This may
be the same as one of the other locales, notably it may be the same as
iso_latin_1
for Western languages andutf8
in a UTF-8 context. - utf8
- Multi-byte encoding of full UCS, compatible with
ascii
. See above. - unicode_be
- Unicode Big Endian. Reads input in pairs of bytes, most significant byte first. Can only represent 16-bit characters.
- unicode_le
- Unicode Little Endian. Reads input in pairs of bytes, least significant byte first. Can only represent 16-bit characters.
Note that not all encodings can represent all characters. This implies that writing text to a stream may cause errors because the stream cannot represent these characters. The behaviour of a stream on these errors can be controlled using set_stream/2. Initially the terminal stream writes the characters using Prolog escape sequences while other streams generate an I/O exception.
2.19.1.1 BOM: Byte Order Mark
From section
2.19.1, you may have got the impression that text files are
complicated. This section deals with a related topic, making life often
easier for the user, but providing another worry to the programmer.
BOM or Byte Order Marker is a technique for identifying
Unicode text files as well as the encoding they use. Such files start
with the Unicode character 0xFEFF, a non-breaking, zero-width space
character. This is a pretty unique sequence that is not likely to be the
start of a non-Unicode file and uniquely distinguishes the various
Unicode file formats. As it is a zero-width blank, it even doesn't
produce any output. This solves all problems, or ... Some formats start
off as US-ASCII and may contain some encoding mark to switch to UTF-8,
such as the encoding="UTF-8"
in an XML header. Such formats
often explicitly forbid the use of a UTF-8 BOM. In other cases there is
additional information revealing the encoding, making the use of a BOM
redundant or even illegal.
The BOM is handled by SWI-Prolog open/4
predicate. By default, text files are probed for the BOM when opened for
reading. If a BOM is found, the encoding is set accordingly and the
property bom(true)
is available through stream_property/2.
When opening a file for writing, writing a BOM can be requested using
the option bom(true)
with
open/4.