Let's say we want to check that a Unicode code point is within the valid range before working out how many bytes it's UTF-8 representation is going to be:
void assert_valid(uint32_t cp) { if ( cp >= 0xd800 && cp <= 0xdbff ) { throw std::domain_error("UTF32 code point is in the leading UTF16 surrogate pair range"); } else if ( cp >= 0xdc00 && cp <= 0xdfff ) { throw std::domain_error("UTF32 code point is in the trailing UTF16 surrogate pair range"); } else if ( cp == 0xfffe || cp == 0xffff ) { throw std::domain_error("UTF32 code point is invalid"); } else if ( cp > 0x10ffff ) { throw std::domain_error("UTF32 code point is beyond the allowable range"); } }
And we can now use it like this:
std::size_t u8length(uint32_t cp) { check_valid(cp); if ( cp < 0x00080 ) return 1u; else if ( cp < 0x00800 ) return 2u; else if ( cp < 0x10000 ) return 3u; else return 4u; }
We now get an exception thrown if the Unicode is invalid. But what if we don't like the exception that is thrown? Maybe we want to throw our own exception class because it has some extra facilities?
We can refactor both functions to take the exception class to throw as a parameter:
template<typename E = std::domain_error> inline void assert_valid(uint32_t cp) { if ( cp >= 0xd800 && cp <= 0xdbff ) { throw E("UTF32 code point is in the leading UTF16 surrogate pair range"); } else if ( cp >= 0xdc00 && cp <= 0xdfff ) { throw E("UTF32 code point is in the trailing UTF16 surrogate pair range"); } else if ( cp == 0xfffe || cp == 0xffff ) { throw E("UTF32 code point is invalid"); } else if ( cp > 0x10ffff ) { throw E("UTF32 code point is beyond the allowable range"); } } template<typename E = std::domain_error> inline std::size_t u8length(uint32_t cp) { check_valid<E>(cp); if ( cp < 0x00080 ) return 1u; else if ( cp < 0x00800 ) return 2u; else if ( cp < 0x10000 ) return 3u; else return 4u; }
Most of our users of u8length can still write u8length(cp)
and get the default exception, but in my code I can now use it as u8length<fostlib::exceptions::unicode_error>(cp)
and I get my exception class instead.
If we stopped there it's a little bit interesting, but hardly worthy of much note. With another couple of refactorings though we can have this same code returns errors for those who want that rather than exceptions. First of all let's factor out the throwing of the exception:
template<typename E> inline void raise(const char *what) { throw E(what); } template<typename E = std::domain_error> inline void assert_valid(uint32_t cp) { if ( cp >= 0xd800 && cp <= 0xdbff ) { raise<E>("UTF32 code point is in the leading UTF16 surrogate pair range"); } else if ( cp >= 0xdc00 && cp <= 0xdfff ) { raise<E>("UTF32 code point is in the trailing UTF16 surrogate pair range"); } else if ( cp == 0xfffe || cp == 0xffff ) { raise<E>("UTF32 code point is invalid"); } else if ( cp > 0x10ffff ) { raise<E>("UTF32 code point is beyond the allowable range"); } }
The next thing to note is that the domains of our return values allow for errors to be returned. assert_valid
can return a bool and u8length
could return zero if the code point is invalid. This would then look like this:
template<typename E = std::domain_error> inline bool assert_valid(uint32_t cp) { if ( cp >= 0xd800 && cp <= 0xdbff ) { raise<E>("UTF32 code point is in the leading UTF16 surrogate pair range"); } else if ( cp >= 0xdc00 && cp <= 0xdfff ) { raise<E>("UTF32 code point is in the trailing UTF16 surrogate pair range"); } else if ( cp == 0xfffe || cp == 0xffff ) { raise<E>("UTF32 code point is invalid"); } else if ( cp > 0x10ffff ) { raise<E>("UTF32 code point is beyond the allowable range"); } else { return true; } return false; } template<typename E = std::domain_error> inline std::size_t u8length(uint32_t cp) { if ( !check_valid<E>(cp) ) return 0u; else if ( cp < 0x00080 ) return 1u; else if ( cp < 0x00800 ) return 2u; else if ( cp < 0x10000 ) return 3u; else return 4u; }
Of course we'll still always get the exception because raise
throws. What we can do though is to choose a type that means “don't throw” and then specialise raise
on that type. void
seems like it's probably a good choice:
template<> inline void raise<void>(const char *) {}
Now we have two choices about how to use u8length:
auto bytes = u8length(cp); dosomething(bytes, cp);
The use of exceptions by default is important from a security context. Unicode errors are a common attack vector, and throwing an exception if something is wrong simply makes the software a good deal safer as error handling code is just too easy to get wrong. But of course, sometimes we will want to do this sort of thing:
if ( auto bytes = u8length<void>(cp); bytes ) { // C++17 dosomething(bytes, cp); } else { // Handle error }
Most API uses get the safety of the exception being thrown, but you can also opt out of that and just get an error return for the cases where that is preferable.
Real code for raise and the UTF encodings can be found in the f5-cord library* [*This code isn't in master yet. Check out the previously linked commits instead.]. Note that this version is also constexpr
, which is also pretty cool.
The (approximate) final versions are reproduced below. And yes, raise
is probably a stupid name now.
namespace f5 { /// Raise an error of type E giving it the specified error text template<typename E> constexpr inline void raise(f5::cord::lstring error) { throw E(error.c_str()); } /// Specialisation for when we want an error return template<> constexpr inline void raise<void>(f5::cord::lstring) { } /// A UTF-8 code point typedef unsigned char utf8; /// A UTF-32 code point typedef uint32_t utf32; /// Check that the UTF32 code point is valid. Throw an exception if not. template<typename E = std::domain_error> constexpr inline bool check_valid(utf32 cp) { if ( cp >= 0xd800 && cp <= 0xdbff ) { raise<E>("UTF32 code point is in the leading UTF16 surrogate pair range"); } else if ( cp >= 0xdc00 && cp <= 0xdfff ) { raise<E>("UTF32 code point is in the trailing UTF16 surrogate pair range"); } else if ( cp == 0xfffe || cp == 0xffff ) { raise<E>("UTF32 code point is invalid"); } else if ( cp > 0x10ffff ) { raise<E>("UTF32 code point is beyond the allowable range"); } else { return true; } return false; } /// Return the number of UTF8 values that this code point will /// occupy. If the code point falls in an invalid range then an /// exception will be thrown. template<typename E = std::domain_error> constexpr inline std::size_t u8length(utf32 cp) { if ( not check_valid<E>(cp) ) return 0u; else if ( cp < 0x00080 ) return 1u; else if ( cp < 0x00800 ) return 2u; else if ( cp < 0x10000 ) return 3u; else return 4u; } }
And some tests to show that it is also usable at compile time:
static_assert(f5::check_valid(' '), "Space is a valid UTF32 code point"); static_assert(not f5::check_valid<void>(0xff'ff'ff), "0xff'ff'ff is not a valid code point"); static_assert(f5::u8length(0) == 1, "Zero is still 1 byte long"); static_assert(f5::u8length(' ') == 1, "Space is 1 byte long"); static_assert(f5::u8length(0xa3) == 2, "Pounds are 2 bytes long"); static_assert(f5::u8length(0xe01) == 3, "Thai chickens are 3 bytes long"); static_assert(f5::u8length(0x1d11e) == 4, "The treble clef is 4 bytes long");
The most obvious improvement is for raise
to perfectly forward any arguments on to the exception constructor, but we still can't partially specialise functions so the implementation will be a bit more complex as the specialisation will have to be done in a class.