\[\newcommand{\andalso}{\quad\quad} \newcommand{\infabbrev}[2]{\infax{#1 \quad\eqdef\quad #2}} \newcommand{\infrule}[2]{\displaystyle \dfrac{#1}{#2}} \newcommand{\ar}{\rightarrow} \newcommand{\Int}{\mathtt{Int}} \newcommand{\Bool}{\mathtt{Bool}} \newcommand{\becomes}{\Downarrow} \newcommand{\trule}[1]{(\textbf{#1})} \newcommand{\FV}[1]{\mathtt{fv}(#1)} \newcommand{\FTV}[1]{\mathtt{ftv}(#1)} \newcommand{\BV}[1]{\mathtt{bv}(#1)} \newcommand{\compiles}[1]{\text{C}\llbracket{#1}\rrbracket} \newcommand{\exec}[1]{\text{E}\llbracket{#1}\rrbracket} \renewcommand{\t}[1]{\mathtt{#1}} \newcommand{\ite}[3]{\text{if }#1\text{ then }#2\text{ else }#3} \]

Introduction

Goals

Off we go on our Adventure in Haskell Compilers! It will be intense, long, informative, and hopefully fun.

It's important to stress several points about the goals before we start our discussion:

  1. This is not a rigorous introduction to type systems, it is a series of informal discussions of topics structured around a reference implementation with links provided to more complete and rigorous resources on the topic at hand. The goal is to give you an overview of the concepts and terminology as well as a simple reference implementation to play around with.

  2. None of the reference implementations are industrial strength, many of them gloss over fundamental issues that are left out for simplicity reasons. Writing an industrial strength programming language involves work on the order of hundreds of person-years and is an enormous engineering effort.

  3. You should not use the reference compiler for anything serious. It is intended for study and reference only.

Throughout our discussion we will stress the importance of semantics and the construction of core calculi. The frontend language syntax will be in the ML-family syntax out of convenience rather than principle. Choice of lexical syntax is arbitrary, uninteresting, and quite often distracts from actual substance in comparative language discussion. If there is one central theme it is that the design of the core calculus should drive development, not the frontend language.

Prerequisites

An intermediate understanding at the level of the Real World Haskell book is recommended. We will avoid advanced type-level programming that is often present in modern Haskell, and instead will make heavy use of more value-level constructs. A strong familiarity with monads, monad transformers, applicatives, and the standard Haskell data structures is strongly recommended.

Some familiarity with the standard 3rd party libraries will be useful. Many of these are briefly overviewed in What I Wish I Knew When Learning Haskell.

In particular we will use:

Library Description
containers Core data structures
unordered-containers Core data structures
text Text datastructure
bytestring Text datastructure
transformers
mtl
filepath
directory
process
parsec Parser combinators
happy Parser generator
alex Lexer generator
pretty Pretty print combinators
ansi-wl-pprint Pretty print combinators
pretty-show Haskell pretty printer
graphscc Topological sorting
haskeline GNU Readline integration
repline Interactive shell builder
cereal
deepseq
uniqueid
uniplate
optparse-applicative Commandline argument
hoopl
fgl
llvm-general LLVM Codegen
smtLib
sbv

In later chapters some experience with C, LLVM and x86 Assembly will be very useful, although not strictly required.

Concepts

We are going to set out to build a statically typed functional programming language with a native code generation backend. What does all this mean?

Functional Languages

In mathematics a function is defined as a correspondence that assigns exactly one element of a set to each element in another set. If a function \(f(x) = a\) then the function evaluated at \(x\) will always have the value \(a\). Central to the notion of all mathematics is the notion of equational reasoning, where if \(a= f(x)\) then for an expression \(g(f(x), f(x))\), this is always equivalent to \(g(a, a)\). In other words, the values computed by functions can always be substituted freely at all occurrences.

The central idea of pure functional programming is to structure our programs in such a way that we can reason about them as a system of equations just like we can in mathematics. The evaluation of a pure function is one in which side effects are prohibited; a function may only return a result without altering the world in any observable way.

The implementation may perform effects, but central to this definition is the unobservability of such effects. A function is said to be referentially transparent if replacing a function with its computed value output yields the same observable behavior.

By contrast impure functions are ones which allow unrestricted and observable side effects. The invocation of an impure function always allows for the possibility of performing any functionality before yielding a value.

// impure: mutation side effects
function f() {
  x += 3;
  return 42;
}

// impure: international side effects
function f() {
  launchMissiles();
  return 42;
}

The behavior of a pure function is independent of where and when it is evaluated, whereas the behavior of an impure function is intrinsically tied to its execution order.

Static Typing

Types are a formal language integrated with a programming language that refines the space of allowable behavior and degree of expressible programs for the language. Types are the world's most popular formal method for analyzing programs.

In a language like Python all expressions have the same type at compile time, and all syntactically valid programs can be evaluated. In the case where the program is nonsensical the runtime will bubble up exceptions during evaluation. The Python interpreter makes no attempt to analyze the given program for soundness at all before running it.

>>> True & "false"
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: unsupported operand type(s) for &: 'bool' and 'str'

By comparison Haskell will do quite a bit of work to try to ensure that the program is well-defined before running it. The language that we use to prescribe and analyze static semantics of the program is that of static types.

Prelude> True && "false"

<interactive>:2:9:
    Couldn't match expected type `Bool' with actual type `[Char]'
    In the second argument of `(&&)', namely `"false"'
    In the expression: True && "false"
    In an equation for `it': it = True && "false"

Catching minor type mismatch errors is the simplest example of usage, although they occur extremely frequently as we humans are quite fallible in our reasoning about even the simplest of program constructions! Although this is just the tip of the iceberg, the gradual trend over the last 20 years goes toward more expressive types in modern type systems which are capable of guaranteeing a large variety of program correctness properties.

  • Preventing resource allocation errors.
  • Enforcing security in communication protocols.
  • Side effect management.
  • Preventing buffer overruns.
  • Ensuring cryptographic properties for network protocols.
  • Modeling and verifying theorems in mathematics and logic.
  • Preventing data races and deadlocks in concurrent systems.

Even though type systems will never be able to capture all aspects of a program, more sophisticated type systems are increasingly able to model a large space of program behavior. They are one of the most exciting areas of modern computer science research. Put most bluntly, static types let you be dumb and offload the checking that you would otherwise have to do in your head to a system that can do the reasoning for you and work with you to interactively build your program.

Functional Compilers

A compiler is a program for turning high-level representation of ideas in a human readable language into another form. A compiler is typically divided into parts, a frontend and a backend. These are loose terms but the frontend typically deals with converting the human representation of the code into some canonicalized form while the backend converts the canonicalized form into another form that is suitable for evaluation.

The high level structure of our functional compiler is described by the following block diagram. Each describes a phase which is a sequence of transformations composed to transform the input program.

Phase
Source The frontend textual source language.
Parsing Source is parsed into an abstract syntax tree.
Desugar Redundant structure from the frontend language is removed and canonicalized.
Type Checking The program is type-checked and/or type-inferred yielding an explicitly typed form.
Transformation The core language is transformed to prepare for compilation.
Compilation The core language is lowered into a form to be compiled or interpreted.
(Code Generation) Platform specific code is generated, linked into a binary.

A pass may transform the input program from one form into another or alter the internal state of the compiler context. The high level description of the forms our final compiler will go through is the following sequence:

Internal forms used during compilation are intermediate representations and typically any non-trivial language will involve several.

Parsing

The source code is simply the raw sequence of text that specifies the program. Lexing splits the text stream into a sequence of tokens. Only the presence of invalid symbols is checked; programs that are meaningless in other ways are accepted. Whitespace is either ignored or represented as a unique token in the stream.

let f x = x + 1

For instance the previous program might generate a token stream like the following:

[
  TokenLet,
  TokenSym "f",
  TokenSym "x",
  TokenEq,
  TokenSym "x",
  TokenAdd,
  TokenNum 1
]

We can then scan the token stream via dispatch on predefined patterns of tokens called productions, and recursively build up the syntax datatype for the abstract syntax tree (AST).

type Name = String

data Expr
  = Var Name
  | Lit Lit
  | Op PrimOp [Expr]
  | Let Name [Name] Expr

data Lit
  = LitInt Int

data PrimOp
  = Add

So for example the following string is parsed into the resulting Expr value.

let f x = x + 1
Let "f" ["x"] (Op Add [Var "x", Lit (LitInt 1)])

Desugaring

Desugaring is the process by which the frontend AST is transformed into a simpler form of itself by reducing the number of complex structures by expressing them in terms of a fixed set of simpler constructs.

Haskell's frontend is very large and many constructs are simplified down. For example where clauses and operator sections are the most common examples. Where clauses are effectively syntactic sugar for let bindings and operator sections are desugared into lambdas with the left or right hand side argument assigned to a fresh variable.

Type Inference

Type inference is the process by which the untyped syntax is endowed with type information by a process known as type reconstruction or type inference. The inference process may take into account explicit type annotations.

let f x = x + 1
Let "f" [] (Lam "x" (Op Add [Var "x", Lit (LitInt 1)]))

Inference will generate a system of constraints which are solved via a process known as unification to yield the type of the expression.

Int -> Int -> Int  ~  a -> b
b  ~  Int -> c
f :: Int -> Int

In some cases this type will be incorporated directly into the AST and the inference will transform the frontend language into an explicitly typed core language.

Let "f" []
  (Lam "x"
    (TArr TInt TInt)
    (App
      (App
        (Prim "primAdd") (Var "x"))
      (Lit (LitInt 1))))

Transformation

The type core representation is often suitable for evaluation, but quite often different intermediate representations are more amenable to certain optimizations and make various semantic properties of the language explicit. These kind of intermediate forms will often attach information about free variables, allocations, and usage information directly in the AST structure.

The most important form we will use is called the Spineless Tagless G-Machine ( STG ), an abstract machine that makes many of the properties of lazy evaluation explicit directly in the AST.

Code Generation

After translating an expression to the core language we will either evaluate it with a high-level interpreter written in Haskell itself, or translate it to another intermediate language (such as LLVM IR or GHC's Cmm) which can be compiled into native code. This intermediate language abstracts over the process of assigning values to, and moving values between CPU registers and main memory.

As an example, the let statement below would be compiled into some intermediate representation, like LLVM IR.

let f x = x + 1
define i32 @f(i32 %x) {
entry:
  %add = add nsw i32 %x, 1
  ret i32 %add
}

From the intermediate representation the code can be compiled into the system's assembly language. Any additional code that is required for evaluation is linked into the resulting module.

f:
        movl    %edi, -4(%rsp)
        movl    -4(%rsp), %edi
        addl    $1, %edi
        movl    %edi, %eax
        ret

And ultimately this code will be assembled into platform specific instructions by the native code generator, encoded as a predefined sequence of CPU instructions defined by the processor specification.

0000000000000000 <f>:
   0:   89 7c 24 fc             mov    %edi,-0x4(%rsp)
   4:   8b 7c 24 fc             mov    -0x4(%rsp),%edi
   8:   81 c7 01 00 00 00       add    $0x1,%edi
   e:   89 f8                   mov    %edi,%eax
  10:   c3                      retq