Thursday, March 27, 2008

Shlisp

I've been thinking recently about how to go about writing a simple self-hosting lisp compiler targeting LLVM. The basic plan of action is this:

  • Write a quick and dirty interpreter for the language (“shlisp”).

  • Write a compiler for shlisp in shlisp itself.

  • Use the interpreter to bootstrap the compiler.


So far, I've made good progress in writing an interpreter in C++ for shlisp. It differs in some interesting respects from Common Lisp and Scheme:

No cons


The basic data structure is the ‘sequence’, which is supposed to be implemented as some sort of general-purpose extensible array, of the kind you find in Perl, Python, etc. Having linked lists as a default data structure is arguably a bad idea. Lisp code is littered with explicit manipulation of lists, which makes it hard to reuse list-manipulating code on other kinds of sequences. Furthermore, the use of cons pairs to create lists allows for “improper lists,” creating all sorts of complications regarding which functions will and will not accept improper lists. Eventually, I want to implement some sort of STL-like generic sequence library. For now, I've made the interface to sequences relatively neutral w.r.t. their underlying implementation (most of the sequence-manipulating functions could be efficiently implemented for linked lists or other sequential structures). I have found that it is very easy to work with sequences given a decent library of utility functions — possibly even easier than with linked lists. The one exception is that it is not possible to construct sequences recursively using recursive function calls. However, most code of this sort can be written just as easily and declaratively using unfold-style combinators. For example, here's the code for the ‘range’ function, which generates a list of consecutive integers in a given range:

(defun range (start end)
(generate
(\(x stop yield)
(if (>= x end)
(stop)
(yield (+ 1 x))))
start))

That's just as concise as the directly recursive version. The ‘generate’ combinator has a simple definition (though the lack of tail-call elimination in the interpreter makes this version hugely inefficient):

(defun generate (f start)
(let ((accum (seq start)))
(letf loop ()
(f
(last accum)
(\() accum)
(\(x)
(push accum x)
(loop))))))

Shlisp uses ‘\’ instead of ‘lambda’. ‘letf’ is the equivalent of Scheme's named let.

A final advantage of sequences over cons-based lists is that many natural imperative operations (e.g. pushing an item onto a sequence) can be implemented as functions instead of macros.

No distinction between strings and symbols


There doesn't seem to be much point in making this distinction these days (and indeed, there are many historic lisps that didn't make it). Interning strings at runtime isn't a sufficiently useful operation to be built into the language, and constant strings can always be interned by an optimizing compiler.

No boolean values


Zero and the empty sequence are considered false; everything else is true. I don't see much point in having boolean values in a dynamically typed language. In Scheme, the main use for #t and #f in practice seemed to be to provide return values for exceptional cases (e.g. a function that normally returned an integer might return #f on failure). I prefer to pass success and failure continuations to the function in these cases. For example:

(parse-integer "123"
(\() (write "Parsing failed."))
(\(n) (write "Success!")))

Another benefit to eliminating #t and #f is that (come the compiler) there is more room for tagged representations of other simple types.

A pragmatic approach to macro hygiene


Shlisp is a lisp-1 (i.e. functions and variables share the same namespace), so there is potentially an issue with unhygienic macros. Hygienic macro systems are nice, but I don't feel up to implementing one at the moment. Instead, I've added the ‘global’ special form for explicitly accessing symbols in the global environment. The following expressions illustrate how it is used:

(let ((+ -)) (+ 1 2)) => -1
(let ((+ -)) ((global +) 1 2)) => 3

‘global’ works for functions and macros, but not special forms. The upshot of all this is that well-written macros can be guaranteed to work so long as none of the special forms they use have been shadowed by other bindings. This is an improvement over Common Lisp, where well-written macros can only be guaranteed to work so long as none of the special forms or functions or variables they use have been shadowed. Shlisp is lexically scoped, so I do not think there should be any problem implementing ‘global’ efficiently in a compiler.

Very basic support for generic functions


I wanted to find some sort of middle ground between Scheme (no support for generic functions) and Common Lisp (very elaborate support). Although I'd like to add a proper generic function system at some point, I don't want to have to have it built into the language (the amount of C++ code needs to be kept to a minimum). Currently, I've settled on the following solution. Any function may raise a BADARGS error if given an argument list which it can't handle. Functions also have a delegates property, which is a list of functions. If a BADARGS error is raised, the argument list is passed to each of the delegate functions in turn, until one of the functions executes without raising BADARGS. This alone is enough to implement simple generic functions. I expect that in the future, the delegates list for a generic function will contain a single delegate that implements a more sophisticated generic dispatch algorithm.

...and that's it


The interpreter's currently about 2500 lines of C++. I'm hoping that once all of the features I want have been implemented it won't go over 4000 or so. With luck, it won't be necessary to write a garbage collector (since as long as the interpreter can bootstrap the compiler without running out of memory, there's no need).

4 comments:

John Connors said...

The decision to implement on top of sequences rather than lists is interesting. Does this mean you represent code simply as a sequence rather than a list? Or are you still using lists as a concrete implementation of an abastract sequence thing..?

Alex Drummond said...

Code is actually represented as a sequence in the underlying implementation (they're STL vectors in the interpreter). It seems to have surprisingly few consequences -- the big O differences don't really matter much when you're writing macros.

Anonymous said...

I agree that interning strings is not a sufficient argument for implementing symbols. However, I like the slightly more "clean" look it can give to code when doing comparisons. For example (in Ruby syntax) the difference between
if query == "post"
versus
if query == :post
Not a lot of difference, I admit. But what about a reader macro that interprets :post as a one-word string, equivalent to "post" ?

Alex Drummond said...

Currently, you can use "..." for arbitrary symbols and 'foo for symbols which don't contain any obviously inappropriate characters (e.g. spaces, quotes, etc.) So if you're OK with 'foo instead of :foo this shouldn't be too much of an issue.