Thursday, December 27, 2007

Pushing Haskell's type system to the limits: A reflective JSON serializer

Since it looks like Haskell web development might start to take off soon, I've been thinking about how to write a good JSON serializer library. The obvious way to start is to define a simple type to represent JSON data structures:

import Data.Ratio
import qualified Data.Map as M

data JsonData = JDString String |
JDNumber Rational |
JDList [JsonData] |
JDObject (M.Map String JsonData)

instance Show JsonData where
...

It's easy enough to convert other data structures to the JsonData type, but serializing more complex types takes a lot of boilerplate code. So, maybe the the solution is to create a new class containing a ‘toJson’ function:

class Json a where
toJson :: a -> JsonData

instance Json Integer where
toJson i = show i

instance Json [a] where
toJson [] = []
toJson l@(x:xs) = "[" ++ (join "," l) ++ "]"
where join _ [] = []
join _ l@[x] = l
join k (x:ys) = x : k : (join k ys)

Once instances have been added for other basic datatypes, this works nicely enough. However, there are two significant problems. First, since Haskell's String type is just a synonym for [Char], there is no way to handle Strings any differently from normal lists. Second, if you use this approach for serializing simple record types, you end up with a lot of boilerplate instance declarations like the following:

import qualified Data.Map as M

data MyRecord1 = MyRecord1 { _x :: Int,
_y :: Int,
_z :: Int }
data MyRecord2 = MyRecord2 { _rec :: MyRecord1,
_a :: Int,
_b :: Int,
_c :: Int }

instance Json MyRecord1 where
toJson r = JDObject (M.fromList [("x", _x r),
("y", _y r),
("z", _z r)])

instance Json MyRecord2 where
toJson r = JDObject (M.fromList [
("rec", toJson (_rec r)),
("a", _a r),
("b", _b r),
("c", _c r)
])

Ideally, we would like to be able to give some sort of default procedure for serializing a record, so that unless we specify otherwise, the record will be converted into a JSON object with appropriate field names. The default should work correctly for nested records, and using it should require at most a single line of code per record type. This ideal can certainly be achieved in a dynamic, reflective language such as Common Lisp, Python or Ruby, but it is absolutely out of the question in Haskell 98. Surprisingly, with the addition of a number of Generic programming extensions to GHC, it is now possible to implement a generic JSON serializer for Haskell too. As you might expect, this involves some pretty hairy manipulation of the type system, so I'd like to set the bait before getting into the details. Here goes. Once we're done, the following code works:

import Json -- Generic Json serialization library (to be written)

data OuterRecord { x :: Int,
y :: String,
r :: InnerRecord }

data InnerRecord { z :: Int }

$(derive[''OuterRecord])
$(derive[''InnerRecord])

> toJson (OuterRecord { x=1,
y="foo",
r=(InnerRecord { z=7 })})

=> {"x":1,"y":"foo","r":{"z":7}} :: JsonData

If you have the luxury of designing your record types with JSON in mind, you just need to add one such line of code for every type. If you do need to customize serialization of a particular type, it's just a matter of creating an instance:

data CustomRecord { dontSerializeMe :: Int,
serializeMe :: Int }
$(derive[''CustomRecord])

instance Json CustomRecord where
toJson r = show (serializeMe r)

> toJson (CustomRecord { dontSerializeMe=1,
serializeMe=2 })

=> 2 :: JsonData

(In fact, the finished implementation of the serializer provides an easier way of excluding fields from serialization.) The code uses the Scrap Your Boilerplate approach to generics. Although GHC comes with the Data.Generics module, which implements the system described in the first SYB paper, my code is based on the third SYB paper (Scrap Your Boilerplate With Class). For this reason, you will need to install the syb-with-class package from Hackage in order to compile it. The current version (0.3) does not compile on ghc-6.8 out of the box, but only because of a trivial build config issue that's easily fixed (see below).

SYBWC makes it possible to define extensible generic functions. An ordinary generic function operates over instances of GHC's Data.Generics.Basics.Data class. For example, the ‘gshow’ function has the following type:

gshow :: (Data a) => a -> String

‘gshow’ works for the most common Haskell datatypes, but there is no way to customize it for your own datatypes. With SYBWC, generic functions can easily be extended using instance declarations (hence “with class”). Suppose we want to define our own generic show function using SYBWC. We begin, ironically enough, with a small amount of boilerplate code:

import Data.Generics.SYB.WithClass.Basic

class MyShow a where
myShow :: a -> String

-- Note that the type of the myShowD field is identical
-- to the type of 'myShow' in the MyShow class.
data MyShowD a = MyShowD { myShowD :: a -> String }

myShowProxy :: Proxy myShowD
myShowProxy = error "This value should never be evaluated!"

instance MyShow a => Sat (MyShowD a) where
dict = MyShowD { myShowD = myShow }

This is all rather involved, and I refer you to the SYBWC paper for the gorey details. The good thing is that this code follows exactly the same template whatever the type signature of your generic function, so you don't really need to understand any of it.

With the boilerplate out of the way, the rest is pretty straightforward. First, we need to define a generic implementation of myShow that will be called if there isn't a more specific implementation for a given type. Focusing on the instance declaration alone, we need the following:

instance Data JsonD a => MyShow a where
...

Note that the ‘Data’ class in Data.Generics.SYB.WithClass takes two type arguments. This is crucial to the increased flexibility of the SYBWC implementation of generics. In effect, the declaration above adds a new superclass to the Data class, telling GHC that every value that is an instance of ‘Data’ is also an instance of ‘Json’. This is very powerful technique, since it lets us treat any field of a record passed to ‘toJson’ as an instance of ‘Json’ — that's what's going to help us to recursively serialize record types. Defining a good generic show function is a project in itself, so I'll move on from this example and get straight to the JSON serializer. Here's the JSON boilerplate:

class Json a where
toJson :: a -> JsonData
exclude :: a -> [String]

data JsonD a = JsonD { toJsonD :: a -> JsonData }

jsonProxy :: Proxy JsonD
jsonProxy = error "This value should never be evaluated!"

instance Json t => Sat (JsonD t) where
dict = JsonD { toJsonD = toJson }

Now we have to decide how our generic serialization function is going to work. For record types, it should recursively serialize each value in the record and then create a JSON object with the appropriate field names. For other types, it's less obvious what should be done; I've chosen the following behavior. For algebraic types, the serializer outputs a list containing the name of the type followed by all of the arguments to the type constructor (if any). For primitive types (e.g. ‘Int’), a runtime error is signaled, since no sensible generic behavior can be defined. Here's the definition for the generic version of ‘toJson’:

instance Data JsonD t => Json t where
toJson x
| isAlgType (dataTypeOf jsonProxy x) =
case (map stripInitialUnderscores (getFields x)) of
[] -> JDList (JDString (dataTypeName (dataTypeOf jsonProxy x)) :
(gmapQ jsonProxy (toJsonD dict) x))
fs -> JDObject (M.fromList (zip fs (gmapQ jsonProxy (toJsonD dict) x)))
| True =
error ("Unable to serialize the primitive type '" ++
dataTypeName (dataTypeOf jsonProxy x) ++ "'")

getFields :: Data JsonD a => a -> [String]
getFields = constrFields . (toConstr jsonProxy)

The ‘dataTypeOf’ function returns a DataType value representing a Haskell data type. Its first argument is a “proxy”, which fixes the superclass of the Data class that we're dealing with (in our case, the ‘Json’ class). Once the DataType of the value has been obtained, we check that it is an algebraic type, signaling an error if it is not. Now we call the ‘getFields’ function, which returns [] for ordinary algebraic data types and a list of field names for record types. In either case, we use the ‘gmapQ’ function (which also takes a proxy argument) to map ‘toJson’ over all the constructor arguments. Crucially, however, there is no direct recursive call to ‘toJson’. Instead, the following expression is used to obtain a suitable function:

(toJsonD dict)

This allows the most specialized instance of ‘toJson’ to be selected for each field of the record. If ‘toJson’ were called directly, the code would still type-check and compile, but the specialized instances of ‘toJson’ would not be called, leading to the wrong behavior at runtime. Again, the SYBWC paper gives the details of how this works.

At this point, we have a working generic serialization function for record types. All we need to do now is add instances of the ‘Json’ class for common Haskell types. For example:

instance Json Int where
toJson i = show i

With this definition, we can now serialize nested records containing integers:

data OuterRecord = { _x :: Int,
_y :: Int,
_rec :: InnerRecord }
data InnerRecord = { _a :: Int,
_b :: Int }

$(derive[''OuterRecord])
$(derive[''InnerRecord])

> toJson (OuterRecord { _x=1,
_y=2,
_rec=(InnerRecord { _a=3, _b=4 }) })

=> {"x":1,"y":2,"rec":{"a":3,"b":4}} :: JsonData

The ‘$(derive ...)’ lines invoke some Template Haskell code that automatically derives an instance of SYBWC's ‘Data’ class for the given datatype. That's basically it. We have a generic, extensible JSON serializer, and for the most part, we haven't sacrificed any type safety.1 The only problem that remains is that of treating strings differently from other lists. This is easily accomplished in the instance declaration for lists with the help of GHC's ‘cast’ function:

instance (Typeable a, Json a) => Json [a] where
toJson l =
case (cast l) :: Maybe String of
Just s -> JDString s
Nothing -> JDList (map toJson l)

[It has been pointed out to me that the use of a cast here is not required. An alternative is to use the same trick that the Prelude uses to ensure that ‘show’ treats strings properly (namely, addition of a ‘showList’ method to the Show class).]

The finished library adds an ‘exclude’ member to the ‘Json’ class. This can be used to exclude fields of a record from serialization without having to write a full custom instances. For example:

data CustomRecord { _dontSerializeMe :: Int,
_serializeMe :: Int }
$(derive[''CustomRecord])

instance Json CustomRecord where
excluded _ = ["_dontSerializeMe"]
toJson = genericToJson

> toJson (CustomRecord { _dontSerializeMe=1, _serializeMe=2 })

=> {"serializeMe":2} :: JsonData

I have not found a way of providing a default implementation for ‘toJson’ which is compatible with Haskell's type system, so if you want the default serialization behavior, you must manually set ‘toJson’ to ‘genericToJson’ (a function provided by the library).

Installing syb-with-class


In order to compile syb-with-class successfully for ghc-6.8, you need to make the following changes to the file ‘syb-with-class.cabal’. (Disclaimer: this is just a quick hack, not a proper fix.)

  • In the line beginning ‘Build-Depends:’, add ‘array’ to the dependencies.

  • In the line beginning ‘GHC-Options:’, add the option ‘-fglasgow-exts’.

You can now use the following commands to build and install the package:

runghc Setup.hs configure
runghc Setup.hs build
runghc Setup.hs install

Code

{-# OPTIONS_GHC
-XFlexibleInstances
-XOverlappingInstances
-XMultiParamTypeClasses
-XFlexibleContexts
-XUndecidableInstances
-XTemplateHaskell
#-}

module Json (Json, toJson, genericToJson, JsonData(..)) where

import Data.Generics.SYB.WithClass.Basics
import Data.Generics.SYB.WithClass.Instances
import Data.Generics.SYB.WithClass.Context
import Data.Generics.SYB.WithClass.Derive
import Language.Haskell.TH.Syntax
import qualified Data.Map as M
import qualified Text.Printf as P
import Data.Char
import Data.Ratio
import Data.Array

-- | The 'JsonData' type is a Haskell representation of a JSON
-- data structure.
data JsonData = JDString String |
JDNumber Rational |
JDList [JsonData] |
JDObject (M.Map String JsonData)

join :: a -> [a] -> [a]
join _ [] = []
join _ l@[x] = l
join k (x:ys) = x : k : (join k ys)

concatJoin :: String -> [String] -> String
concatJoin k l = concat (join k l)

alistToJsonDict :: [(String, String)] -> String
alistToJsonDict l =
"{" ++
concatJoin "," (map (\(k,v) -> (escapeString k) ++ ":" ++ v) l)
++ "}"

-- Special characters which will be pretty printed.
escapeMap :: M.Map Char String
escapeMap = M.fromList [
('\\', "\\"), ('"', "\""), ('\'', "'"), ('\n', "n"),
('\r', "r"), ('\f', "f"), ('\t', "t"), ('\b', "\b")]
escape :: Char -> Maybe String
escape c = M.lookup c escapeMap

-- Characters which can safely be printed as literals.
allowedMap = M.fromList (zip " !@#$%^&*()_-+={[}]|:;<>,.?/~`" (repeat True))
allowed' c o
| o >= ord 'a' && o <= ord 'z' = True
| o >= ord 'A' && o <= ord 'Z' = True
| o >= ord '0' && o <= ord '9' = True
| True = M.member c allowedMap
allowed c = allowed' c (ord c)

hexEscape :: Char -> String
hexEscape c = "\\u" ++ P.printf "%04x" (ord c)

escapeString' :: String -> String
escapeString' [] = "\""
escapeString' (c:cs)
| allowed c =
c : (escapeString' cs)
| True =
(maybe (hexEscape c) (\s -> "\\" ++ s) (escape c)) ++
(escapeString' cs)

escapeString s = '"' : escapeString' s

instance Show JsonData where
show (JDString s) = escapeString s
show (JDNumber n)
-- Show as an integer if possible, otherwise as double precision float.
| denominator n == 1 =
show (numerator n)
| True =
show ((fromIntegral (numerator n) :: Double) /
(fromIntegral (denominator n) :: Double))
show (JDList l) = "[" ++ concatJoin "," (map show l) ++ "]"
show (JDObject o) = alistToJsonDict (map (\(k,v) -> (k, show v)) (M.toList o))


--
-- Json class plus SYB boilerplate.
--
-- | New instances can be added for the 'Json' class to customize
-- JSON serialization.
class Json a where
toJson :: a -> JsonData
exclude :: a -> [String]
exclude _ = []

data JsonD a = JsonD { toJsonD :: a -> JsonData }

jsonProxy :: Proxy JsonD
jsonProxy = error "This value should never be evaluated!"

instance Json t => Sat (JsonD t) where
dict = JsonD { toJsonD = toJson }


--
-- Implementations of toJson for different data types.
--
instance Json Int where
toJson i = JDNumber (fromIntegral i)
instance Json Integer where
toJson i = JDNumber (fromIntegral i)
instance Json Float where
toJson i = JDNumber (toRational i)
instance Json Double where
toJson i = JDNumber (toRational i)
instance Integral a => Json (Ratio a) where
toJson i = JDNumber ((fromIntegral (numerator i)) %
(fromIntegral (denominator i)))
instance Json Char where
toJson c = JDString [c]

instance Json a => Json (M.Map String a) where
toJson x = JDObject (M.map toJson x)

-- Handles lists, with special case for [Char].
instance (Typeable a, Json a) => Json [a] where
toJson l =
case (cast l) :: Maybe String of
Just s -> JDString s
Nothing -> JDList (map toJson l)

-- TODO: Add instances for the other array types supported by GHC.
instance (Typeable a, Json a, Ix i) => Json (Array i a) where
toJson a = toJson (elems a)

stripInitialUnderscores "" = ""
stripInitialUnderscores ('_':s) = stripInitialUnderscores s
stripInitialUnderscores s = s

getFields :: Data JsonD a => a -> [String]
getFields = constrFields . (toConstr jsonProxy)

-- | This is the implementation of 'toJson' for the generic instance declaration,
-- but it's useful to be able to use the same implentation for
-- other instance declarations which override the default implementation
-- of 'exclude'.
genericToJson :: (Data JsonD a, Json a) => a -> JsonData
genericToJson x
| isAlgType (dataTypeOf jsonProxy x) =
case (map stripInitialUnderscores (filter (not . (flip elem) (exclude x)) (getFields x))) of
[] -> JDList (JDString (dataTypeName (dataTypeOf jsonProxy x)) :
(gmapQ jsonProxy (toJsonD dict) x))
fs -> JDObject (M.fromList (zip fs (gmapQ jsonProxy (toJsonD dict) x)))
| True =
error ("Unable to serialize the primitive type '" ++ dataTypeName (dataTypeOf jsonProxy x) ++ "'")

instance Data JsonD t => Json t where
toJson = genericToJson

-- Instances for tuples up to n=5.
-- Tuples are converted to (hetrogenous) JSON lists.
instance (Json a, Json b) => Json (a, b) where
toJson (a,b) = JDList [toJson a, toJson b]
instance (Json a, Json b, Json c) => Json (a,b,c) where
toJson (a,b,c) = JDList [toJson a, toJson b, toJson c]
instance (Json a, Json b, Json c, Json d) => Json (a,b,c,d) where
toJson (a,b,c,d) = JDList [toJson a, toJson b, toJson c, toJson d]
instance (Json a, Json b, Json c, Json d, Json e) => Json (a,b,c,d,e) where
toJson (a,b,c,d,e) = JDList [toJson a, toJson b, toJson c, toJson d, toJson e]




1. The one exception is the possibility of a runtime error if we try to serialize a primitive type for which no instance has been defined. In practice, given instances for all the basic Haskell String/numeric types, this sort of error is very unlikely to occur. [^]