lingnerd: 2008

Sunday, August 10, 2008

RJson update

I've been meaning to write a little more about RJson for a while. The library has come quite a long way since the post describing the original version. One feature suggested to me by Jeremy Shaw, which should go in the next release, is automatic boxing and unboxing of types with single unary constructors. For example, suppose we have the following type:


data Counter = Counter Int deriving Show

We derive instances of ‘ToJson’ and ‘FromJson’ for this type using the following template Haskell code:


$(derive[''Counter])

Now, if we don't define custom ‘ToJson’ or ‘FromJson’ instances, the default behavior is this:


toJsonString (Counter 7)
--> "7"

fromJsonString (undefined :: Counter) "7"
--> Counter 7

This is a big improvement over the default behavior in older versions, which would have converted (Counter 7) to a single-element JSON array.

Thursday, April 24, 2008

Pattern matching and generic functions

I've been working some more on Shlisp. The interpreter is becoming reasonably complete in terms of implementing the core language functionality. It now uses Boehm GC to implement garbage collection, and setjmp, longjmp and stack copying to implement continuations (effectively by adding continuations to C++ itself, using the technique described here).

Now that the interpreter is fairly reliable, I've been writing some actual Shlisp code. This post is about the pattern-matching and generic function features. Shlisp makes use of the same pattern matcher for three purposes: ordinary pattern matching, lambda argument lists and generic function dispatch. Although this means that the pattern-matching syntax isn't perfectly optimized for any one of these domains, I think the reduction in complexity is adequate compensation. The basic pattern matcher does more or less what you would expect. Here's an example of destructuring a sequence:


(match '(1 2 3 4)
  (() ())
  ((x y | zs) zs))

=> (3 4)

The ‘|’ is used to introduce a pattern which must match each element of the rest of a sequence. This pattern can itself be complex, so it is easy to define a function which accepts (say) a single argument followed by a list of pairs:


(defun foo args
  (match args
    ((x | (fst snd)) (some-more-code ...))))

By convention, pattern-matching derivatives of definition macros have a colon appended to their names. For example the previous function can be defined more concisely as follows:


(defun: foo
  ((x | (fst snd)) (some-more-code ...)))

The ‘match’ macro returns no value if none of the specified patterns match, but an error will be raised if none of the patterns for a ‘defun:’ match. The following syntax is used to match on the type of a value:


(defun: as-integer
  (((x : integer) x))
  (((x : double) (floor x))))

This syntax extends to the definition of simple methods for generic functions. A generic function is defined like this:


(defgeneric generic-add)

And methods are added like this:


(defmethod generic-add ((x : number) (y : number))
  (+ x y))
(defmethod generic-add ((x : string) (y : string))
  (concat x y))

The ‘number’ type is a supertype of all numeric types. As befits a generic function system, we can add an additional method for a subtype of the ‘number’ type:


(defmethod generic-add ((x : integer) (y : integer))
  (+ 10 x y)) ; Silly (but different) behavior.

In the case of generic functions with overlapping argument types, the dispatch algorithm selects the method with the most specific argument list, going from left to right along the arguments. The ‘defmethod’ form is only able to dispatch on fixed numbers of arguments, but there is also a ‘defmethod:’ form which integrates pattern matching with generic dispatch. This form accepts an arbitrary pattern, but still ensures that methods with overlapping argument types are dispatched appropriately. In the case of ‘generic-add’, we can allow each method to receive any number of arguments:


(defmethod: generic-add ((| (xs : number)))
  (apply + xs))

As I mentioned in the previous post, it is possible to add generic dispatch to a builtin, non-generic function. For example, if we wanted to extend ‘+’ so that it could be used to concatenate symbols, the following code would do the trick:


(add-generic +)

(defmethod: + ((| (xs : symbol)))
  (apply concat xs))

The generic dispatch mechanism is activated only if the builtin addition operation receives bad arguments, so the performance of ordinary arithmetic operations needn't suffer too much.

Thursday, March 27, 2008

Shlisp

I've been thinking recently about how to go about writing a simple self-hosting lisp compiler targeting LLVM. The basic plan of action is this:

Write a quick and dirty interpreter for the language (“shlisp”).

Write a compiler for shlisp in shlisp itself.

Use the interpreter to bootstrap the compiler.

So far, I've made good progress in writing an interpreter in C++ for shlisp. It differs in some interesting respects from Common Lisp and Scheme:

No cons

The basic data structure is the ‘sequence’, which is supposed to be implemented as some sort of general-purpose extensible array, of the kind you find in Perl, Python, etc. Having linked lists as a default data structure is arguably a bad idea. Lisp code is littered with explicit manipulation of lists, which makes it hard to reuse list-manipulating code on other kinds of sequences. Furthermore, the use of cons pairs to create lists allows for “improper lists,” creating all sorts of complications regarding which functions will and will not accept improper lists. Eventually, I want to implement some sort of STL-like generic sequence library. For now, I've made the interface to sequences relatively neutral w.r.t. their underlying implementation (most of the sequence-manipulating functions could be efficiently implemented for linked lists or other sequential structures). I have found that it is very easy to work with sequences given a decent library of utility functions — possibly even easier than with linked lists. The one exception is that it is not possible to construct sequences recursively using recursive function calls. However, most code of this sort can be written just as easily and declaratively using unfold-style combinators. For example, here's the code for the ‘range’ function, which generates a list of consecutive integers in a given range:


    (defun range (start end)
      (generate
        (\(x stop yield)
          (if (>= x end)
            (stop)
            (yield (+ 1 x))))
        start))

That's just as concise as the directly recursive version. The ‘generate’ combinator has a simple definition (though the lack of tail-call elimination in the interpreter makes this version hugely inefficient):


    (defun generate (f start)
      (let ((accum (seq start)))
        (letf loop ()
          (f
            (last accum)
            (\() accum)
            (\(x)
              (push accum x)
              (loop))))))

Shlisp uses ‘\’ instead of ‘lambda’. ‘letf’ is the equivalent of Scheme's named let.

A final advantage of sequences over cons-based lists is that many natural imperative operations (e.g. pushing an item onto a sequence) can be implemented as functions instead of macros.

No distinction between strings and symbols

There doesn't seem to be much point in making this distinction these days (and indeed, there are many historic lisps that didn't make it). Interning strings at runtime isn't a sufficiently useful operation to be built into the language, and constant strings can always be interned by an optimizing compiler.

No boolean values

Zero and the empty sequence are considered false; everything else is true. I don't see much point in having boolean values in a dynamically typed language. In Scheme, the main use for #t and #f in practice seemed to be to provide return values for exceptional cases (e.g. a function that normally returned an integer might return #f on failure). I prefer to pass success and failure continuations to the function in these cases. For example:


    (parse-integer "123"
      (\() (write "Parsing failed."))
      (\(n) (write "Success!")))

Another benefit to eliminating #t and #f is that (come the compiler) there is more room for tagged representations of other simple types.

A pragmatic approach to macro hygiene

Shlisp is a lisp-1 (i.e. functions and variables share the same namespace), so there is potentially an issue with unhygienic macros. Hygienic macro systems are nice, but I don't feel up to implementing one at the moment. Instead, I've added the ‘global’ special form for explicitly accessing symbols in the global environment. The following expressions illustrate how it is used:


    (let ((+ -)) (+ 1 2))           => -1
    (let ((+ -)) ((global +) 1 2))  => 3

‘global’ works for functions and macros, but not special forms. The upshot of all this is that well-written macros can be guaranteed to work so long as none of the special forms they use have been shadowed by other bindings. This is an improvement over Common Lisp, where well-written macros can only be guaranteed to work so long as none of the special forms or functions or variables they use have been shadowed. Shlisp is lexically scoped, so I do not think there should be any problem implementing ‘global’ efficiently in a compiler.

Very basic support for generic functions

I wanted to find some sort of middle ground between Scheme (no support for generic functions) and Common Lisp (very elaborate support). Although I'd like to add a proper generic function system at some point, I don't want to have to have it built into the language (the amount of C++ code needs to be kept to a minimum). Currently, I've settled on the following solution. Any function may raise a BADARGS error if given an argument list which it can't handle. Functions also have a delegates property, which is a list of functions. If a BADARGS error is raised, the argument list is passed to each of the delegate functions in turn, until one of the functions executes without raising BADARGS. This alone is enough to implement simple generic functions. I expect that in the future, the delegates list for a generic function will contain a single delegate that implements a more sophisticated generic dispatch algorithm.

...and that's it

The interpreter's currently about 2500 lines of C++. I'm hoping that once all of the features I want have been implemented it won't go over 4000 or so. With luck, it won't be necessary to write a garbage collector (since as long as the interpreter can bootstrap the compiler without running out of memory, there's no need).

Tuesday, January 29, 2008

Deserializing JSON

I've improved the code from the last post so that it is now capable of deserializing as well as serializing. The improved library is on hackage. I'll post some example code soon.

lingnerd