UTF-8 encoder and decoder and simple wide-string type in Standard ML
Add isValidUtf8Prefix
76f337555775 — Chris Cannam 4 months ago
Add isValidUtf8, add rejection of surrogate pairs, and update tests
cc9c7ce6eeec — Chris Cannam 2 years ago
Split out codepoint-io sig, mention stream in README

heads

tip
browse log

clone

read-only
https://hg.sr.ht/~cannam/sml-utf8
read/write
ssh://hg@hg.sr.ht/~cannam/sml-utf8
UTF-8 encoder and decoder, simple wide-string type, and stream I/O
==================================================================

This library contains a Standard ML implementation of a wide-string
type and an I/O type, with fast encoder and decoder to and from UTF-8.

(Although the encoder and decoder are provided as separate structures,
they are given only minimal signatures and aren't really intended to
be used separately. The general-purpose interface is through the
string structure and its signature, and the I/O structure.)

The decoder is designed for safe interoperability: it identifies
invalid and overlong encodings and substitutes the replacement
character for each such sequence as soon as it is recognised. It does
the same thing with codepoints above the 17-plane Unicode limit.

Copyright 2015-2018 Chris Cannam. Decoder inspired by Utf8.sml by
Martin Elsman (https://github.com/melsman/unicode). Encoder influenced
by utf8.sml by John Reppy.

MIT/X11 licence. See the file COPYING for details.