Thoughts on TinyJSON

Created 31st March, 2008 05:37 (UTC), last edited 31st March, 2008 08:27 (UTC)

Via the Boost development list I came across Thomas Jensen's TinyJSON parser. As I've also been spending time on writing a JSON parser using the Boost tools and figure we might be able to learn something from each other's approaches.

Firstly though I think our goals are slightly different. I'm writing a JSON parser to fit in with the requirements of using JSON within FOST.3™ whereas his is a more general header only library. It would be hard to take my JSON parser without also taking a lot of the FOST.3™ foundation classes — there are good reasons for that which I'll get to in a moment.

The JSON object

In terms of what comes out of the parser the biggest difference is that I produce a JSON object based on Boost.Variant and he produces one based on Boost.Any. I think he's right that using Boost.Variant will introduce some extra complexity, but I think the simplification of accessing the final structure and the better type safety are both well worth it, but I'm not sure that it is compatible with his aims.

I split the JSON object itself into two parts. The first is a variant structure which is able to handle the simple values and is based on this Boost.Variant¹ [1t_null is simply a type representing the empty value called Null in FOST.3™.]:

boost::variant< t_null, bool, int64_t, double, wstring >

The complete class looks like this (I've cut some members for brevity):

class F3UTIL_DECLSPEC Variant {
    boost::variant< t_null, bool, int64_t, double, wstring > m_v;
public:
    Variant() : m_v( Null ) {}
    explicit Variant( bool b ) : m_v( b ) {}
    explicit Variant( char c ) : m_v( int64_t( c ) ) {}
    explicit Variant( int i ) : m_v( int64_t( i ) ) {}
    explicit Variant( unsigned int i ) : m_v( int64_t( i ) ) {}
    explicit Variant( long l ) : m_v( int64_t( l ) ) {}
    explicit Variant( unsigned long l ) : m_v( int64_t( l ) ) {}
    explicit Variant( int64_t i ) : m_v( i ) {}
    explicit Variant( float f ) : m_v( double( f ) ) {}
    explicit Variant( double d ) : m_v( d ) {}
    explicit Variant( const char *s ) : m_v( widen( s ) ) {}
    explicit Variant( const wchar_t *s ) : m_v( wstring( s ) ) {}
    explicit Variant( const wstring &s ) : m_v( s ) {}

    bool isnull() const;

    template< typename T >
    Nullable< T > get() const {
        const T *p = boost::get< T >( &m_v );
        if ( p )
            return *p;
        else
            return Null;
    }

    bool operator ==( const Variant &v ) const;
    bool operator !=( const Variant &v ) const { return !( *this == v ); }

    template< typename T > Variant &operator =( T t ) { m_v = Variant( t ); return *this; }

    template< typename T >
    typename T::result_type apply_visitor( T &t ) const {
        return boost::apply_visitor( t, m_v );
    }
};

This includes a number of type promoting constructors and forwarders for Boost's get (the use of Nullable is a standard FOST.3™ idiom) and the static visitor.

The actual JSON object is created from this base (again I've cut some members):

class F3UTIL_DECLSPEC Json {
public:
    typedef FSLib::Variant atom_t;
    typedef std::vector< boost::shared_ptr< Json > > array_t;
    typedef FSLib::wstring key_t;
    typedef std::map< key_t, boost::shared_ptr< Json > > object_t;
    typedef boost::variant< atom_t, array_t, object_t > element_t;
    BOOST_STATIC_ASSERT( sizeof( array_t::size_type ) == sizeof( object_t::size_type ) );

    Json();
    template< typename T > explicit
    Json( const T &t ) : m_element( atom_t( t ) ) {
    }
    explicit Json( const atom_t &a ) : m_element( a ) {
    }
    Json( const array_t &a ) : m_element( a ) {
    }
    Json( const object_t &o ) : m_element( o ) {
    }
    explicit Json( const element_t &e ) : m_element( e ) {
    }

    template< typename T >
    Nullable< T > get() const {
        const atom_t *p = boost::get< atom_t >( &m_element );
        if ( p )
            return ( *p ).get< T >();
        else
            return Null;
    }

    template< typename T >
    Json &operator =( const T &t ) { m_element = atom_t( t ); return *this; }
    Json &operator =( const array_t &a ) { m_element = a; return *this; }
    Json &operator =( const object_t &o ) { m_element = o; return *this; }

    bool operator ==( const Json &r ) const;
    bool operator !=( const Json &r ) const { return !( *this == r ); }

    template< typename T >
    typename T::result_type apply_visitor( T &t ) const {
        return boost::apply_visitor( t, m_element );
    }

private:
    element_t m_element;
};

I wouldn't be at all surprised if all of this machinery was far too much for Thomas. The problem here is that it moves TinyJSON away from just a JSON parser to being a full blown JSON API — not quite so tiny any more.

Neither can I see a way of making this more lightweight by avoiding the wrapper class because you can't do this:

typedef boost::variant< t_null, int, double, std::vector< Json* >, std::map< string, Json* > > Json;

This sort of recursion is only possible with a full blown struct or class which means a load of constructors and forwarders and realistically a whole load of other machinery.

Unicode

Even harder is correct Unicode support. The first thing to realise about Unicode in JSON is that it uses UTF-16. If you're on Windows this isn't such a big deal, but for various other platforms this is likely to cause some difficulties :/

Here is the string parser that I use:

struct string_closure : boost::spirit::closure< string_closure, FSLib::wstring, std::vector< utf16 >, utf16 > {
    member1 text;
    member2 buffer;
    member3 character;
};
const struct json_string_parser : public grammar< json_string_parser, string_closure::context_t > {
    template< typename scanner_t >
    struct definition {
        definition( json_string_parser const& self ) {
            top = string[ self.text = arg1 ];
            string =
                    chlit< wchar_t >( L'"' )
                    >> *(
                        ( chlit< wchar_t >( L'\\' ) >> L'\"' )[ push_back( string.buffer, L'"' ) ]
                        | ( chlit< wchar_t >( L'\\' ) >> L'\\' )[ push_back( string.buffer, L'\\' ) ]
                        | ( chlit< wchar_t >( L'\\' ) >> L'/' )[ push_back( string.buffer, L'/' ) ]
                        | ( chlit< wchar_t >( L'\\' ) >> L'b' )[ push_back( string.buffer, utf16( 0x08 ) ) ]
                        | ( chlit< wchar_t >( L'\\' ) >> L'f' )[ push_back( string.buffer, utf16( 0x0c ) ) ]
                        | ( chlit< wchar_t >( L'\\' ) >> L'n' )[ push_back( string.buffer, L'\n' ) ]
                        | ( chlit< wchar_t >( L'\\' ) >> L'r' )[ push_back( string.buffer, L'\r' ) ]
                        | ( chlit< wchar_t >( L'\\' ) >> L't' )[ push_back( string.buffer, L'\t' ) ]
                        | ( chlit< wchar_t >( L'\\' ) >> L'u' >> uint_parser< utf16, 16, 4, 4 >()[ push_back( string.buffer, arg1 ) ] )
                        | ( anychar_p[ string.character = arg1 ]
                                - ( chlit< wchar_t >( L'"' ) | chlit< wchar_t >( L'\\' ) )
                            )[ push_back( string.buffer, string.character ) ]
                    ) >> chlit< wchar_t >( L'"' )[ string.text = string.buffer /* this is hard */ ];
        }
        rule< scanner_t, string_closure::context_t > string;
        rule< scanner_t > top;

        rule< scanner_t > const &start() const { return top; }
    };
} json_string_p;

This parser uses Boost.Phoenix and closures which I think makes it a little easier to follow — but of course I would say that :)

There are a couple of things to notice:

  1. The parsing is done into an explicit UTF-16 character buffer, std::vector< utf16 >.
  2. How to do the assignment from the buffer to the string type is not obvious.

Because JSON is UTF-16 the second piont becomes even harder to deal with if you try to mix the buffer character type with a different final string character type. I'm lucky because I have access to all of FOST.3™'s Unicode support and the FSLib::wstring is a std::wstring like class which has explicit Unicode support and can be constructed and assigned to directly from a UTF-16 buffer.

Consider the following JSON strings:

"\u2014"
"\u5b6b\u5b50"
"\xd834\xdd1e"

Here they are decoded:

"—"
"孫子"
"𝄞"

The first one is just an mdash, the second is Sun Tzu's name in Chinese, but the last is hard. If you're not using a good browser you probably won't even see it. It's a treble cleff and is a single Unicode code point which has to be represented as two UTF-16 code points. This needs to be converted to a four bytes in UTF-8 not six (F0 9D 84 9E) — the UTF-16 to UTF-8 converter has to go via UTF-32 to get this right.

Whether Thomas wants to deal with this, or how it should be dealt with in a lightweight library is really an open question.

What I think Thomas can do is to use my string parser above and parameterise it on a conversion function that can be used to convert from the UTF-16 buffer to the required string type. He can keep library light by providing a fairly simple implementation that throws an exception on anything non-ASCII, but also allow for better Unicode handling when users supply a more capable (and heavier weight) implementation.


Categories:

Discussion for this page