//! Ok I'm in the mood to talk about this, so I'm gonna do it in literate
//! programming style ig. This is an attempt to write a type checker in datalog.
//!
//! Having written a (bad) type checker or two by now, they basically seem to be
//! constraint and inference systems. The algorithms implementing Hindley-Milner
//! typing seem to come down to "collect facts that may have unknowns in them,
//! put them into a big table, and fill in the unknowns by following them to other
//! facts until you have no unknowns left". This is the same sort of "unification"
//! process that Prolog uses, so why not just straight up write the type checker
//! in a logic lang to begin with? Rust's `polonius` research project is already a
//! borrow checker written in Datalog, so it doesn't seem a *crazy* approach, but it
//! does seem weird that no real compilers seem to do this. You're already writing
//! a rule-solving engine in your compiler to do your type checking, why not just
//! use a *real* rule-solving engine instead of half-assing it? Hell, given that an
//! incremental compiler is already kinda a database, adding a rule-based database
//! engine to your compiler seems even more useful.
//!
//! idk if this is actually a good idea of course, but so far it doesn't seem to be a
//! *terrible* idea. So we'll see how far we can push this.
//!
//! # What and why Datalog?
//!
//! Datalog is a logic programming language that is a subset of Prolog, though
//! it's not really quite a programming language since it's not Turing complete.
//! It seems to be more or less what you get when you take Horn clauses
//! (`rule(thing) :- foo(thing), bar(thing).` meaning "rule(thing) is true if
//! foo(thing) is true and bar(thing) is true") and stick them together with the
//! minimal possible amount of scaffolding around them. The result is a lot more like
//! a database query language than a programming language. Production databases
//! have been written using datalog as their query language, probably most notably
//! Datomic, and it seems to work pretty well. DL also tends to be evaluated
//! bottom-up while Prolog programs are evaluated top-down, but I'm not deep enough
//! into the math to have a good understanding of how that affects the solving
//! algorithms. Datalog rules form a graph that can recurse and form loops, as
//! long as the amount of data it's processing strictly decreases each time around
//! the loop, but they are otherwise pretty constrained. It's dreadfully easy to
//! become Turing-complete by accident.
//!
//! The trick with Datalog is that the variables in the rules can be both input and
//! output, same as Prolog, so you can express a lot more than you could with just
//! boolean algebra.
//!
//! Datalog has a few nice properties: Since it's not Turing-complete, Datalog
//! programs are guarenteed to terminate. The worst case time to evaluate a Datalog
//! program is exponential with the number of rules, but there are pretty good
//! optimizations and evaluation strategies out there that seem to result in most
//! Datalog programs running in polynomial time or better. (I'm not savvy enough
//! to know how to optimize a Datalog program for a particular evaluation strategy
//! though, and my programs are thus far not large, so it hasn't really come up
//! in practice.) And Datalog programs are never ambiguous; if you put in a set
//! of rules, you always get a single set of results, never different two sets of
//! results that are equally valid.
//!
//! There's a million different variants and syntaxes for writing datalog
//! out there, but all they're expressing in the end is:
//! `rule(thing) :- foo(thing), bar(thing).`
//! So it's not too hard to translate between them.
//!
//! Datalog is also one of those things that is rife with extensions, because
//! the core of it is so minimal that adding useful tools on to it is trivial and
//! alleviates a lot of pain. These have to be somewhat carefully chosen to make
//! sure the result still follows the rules of datalog. For example, generalized
//! negation of the form `rule(thing) :- not foo(thing).` doesn't work in Datalog,
//! it would make it possible to write contradictions and inconsistent programs that
//! require more powerful solving logic than non-Turing-complete Datalog can do.
//! However, most Datalog's I've seen out there implement "stratified negation",
//! where you can get away with negating rules in some *very* poorly-explained
//! circumstances. In practice it seems to be "the body including the negation must
//! include a non-negated rule, and circular or recursive definitions that include a
//! negation aren't allowed". So you can't write `rule(thing) :- not foo(thing).`
//! but you can write `rule(thing) :- bar(thing), not foo(thing).` as long as
//! `foo(thing)` doesn't rely on `rule(thing)`. In practice, I'm not sophisticated
//! enough to care a whole lot about the details of every possible extension out
//! there, just know they're very common and often pretty useful.
//!
//! # Why not Prolog?
//!
//! Basically, performance and complexity. Prolog is apparently a little
//! infamously slow when its programs become large and complex. Learning how
//! to write fast Prolog programs thus becomes something of an art, one which I
//! don't really want to pursue. Plus Prolog is kinda large and complex itself,
//! so embedding it into another large and complex program (a compiler) is less
//! appealing.
//!
//! The real reason though is I wanted to learn Datalog.
//!
//! # Datalog implementations
//!
//! I wanted something simple, embeddable and reasonably quick, so I'm using
//! `crepe`. `crepe` is a smol but reasonable Datalog implementation written as
//! a Rust macro, with pretty good interop with Rust, so you write your datalog
//! program in the macro, write some rust code to feed it info and print out
//! results, and off you go. Not the smoothest possible toolchain, but not bad
//! for me since I write so much Rust anyway. `souffle` seems to be the (academic)
//! industry standard out there, but I don't feel like compiling it and managing all
//! the attendant fluff and data files and so on, though I now understand that they
//! exist for pretty good reasons. If I were making an actual database or something
//! I'd look at Datomic, but currently I'm just learning how to use it as a solver
//! for relatively small rule sets.
//!
//! I'd kinda prefer something with a REPL, but that seems to be uncommon for
//! datalog for some reason. While with Prolog you tend to interactively write
//! a query and investigate its results, Datalog systems usually seem to take in
//! a fixed set of data and then spit out all the results it can derive from it.
//! Since Datalog is not Turing complete, this result is always finite.
//!
//! If I had to do it again, I'd probably look for a Datalog impl in a language
//! with better symbolic manipulation: Erlang, Elixir or Lisp. But `crepe` works
//! fine.
//!
//! # whew!
//!
//! Let's actually get going. We're going to write a typechecker for a simply-typed
//! lambda calculus, with a few extensions like boolean types and equality and addition
//! and stuff.
//!
//! Sadly this is Rust and not Lisp, so there's about 180 lines of scaffolding we have to
//! get through before we get to the actual datalog. It's actually kinda important
//! scaffolding though, so here goes.
use std::fmt;
// We're gonna be using interned symbols a lot, so here's a smol lib that implements them.
use pancake::*;
type Type = Sym;
#[derive(Copy, Clone, PartialEq, Eq, PartialOrd, Ord, Hash, Debug)]
enum BinOp {
Add,
Eq,
}
/// Now, the thing is that datalog, or at least crepe, doesn't really handle nested structures
/// nicely. So we have to flatten our AST down to non-recursive exprs, and tie those exprs
/// together with "pointers" of integer ID's. That way we can turn each of those statements into a
/// separate Horn clause to feed into our datalog program.
///
/// Flattening our AST has some other benefits too: http://www.cs.cornell.edu/~asampson/blog/flattening.html
/// So it's not really wasted work. We define some builder functions that make it a bit less painful to
/// write our syntax tree as methods on `ExprPool` below.
struct ExprPool {
// The refcell is a little ugly, but lets us build nested structures via builder methods
// without confusing the borrow checker.
exprs: std::cell::RefCell<Vec<FlAst>>,
}
/// Expression ID, our integer index "pointer" type to tie our nested expressions together.
#[derive(Debug, Copy, Clone, PartialEq, Eq, PartialOrd, Ord, Hash)]
struct Eid(usize);
impl fmt::Display for Eid {
fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
self.0.fmt(f)
}
}
/// The actual values of our literals.
#[derive(Debug, Copy, Clone)]
enum Lit {
Int(i64),
Bool(bool),
}
/// Here's the AST for our lambda calculus:
#[derive(Debug, Copy, Clone)]
enum FlAst {
/// Our literals/constants are just numbers and bools for now
Literal(Lit),
/// We'll add some basic math operators to keep life interesting
Op(BinOp, Eid, Eid),
/// A simple named variable
Var(Sym),
/// Declare a function: lambda(arg, signature, body)
/// The signature is the type of the function.
Lambda(Sym, Type, Eid),
/// Apply the given function with the given arg.
Apply(Eid, Eid),
}
impl ExprPool {
fn new() -> Self {
Self {
exprs: std::cell::RefCell::new(vec![]),
}
}
fn insert(&self, expr: FlAst) -> Eid {
self.exprs.borrow_mut().push(expr);
Eid(self.exprs.borrow().len() - 1)
}
fn get(&self, id: Eid) -> FlAst {
self.exprs.borrow()[id.0]
}
/// Recursively constructs a readable-ish string representing the given expr.
fn format(&self, id: Eid) -> String {
let expr = self.get(id);
use FlAst::*;
match expr {
Literal(Lit::Int(i)) => format!("{}", i),
Literal(Lit::Bool(b)) => format!("{}", b),
Op(op, lhs, rhs) => {
let op_str = match op {
BinOp::Add => "+",
BinOp::Eq => "==",
};
let lhs_str = self.format(lhs);
let rhs_str = self.format(rhs);
format!("({} {} {})", op_str, lhs_str, rhs_str)
}
Var(sym) => format!("{}", sym),
Lambda(arg, sig, body) => format!("(λ{}:({}) = {})", arg, sig, self.format(body)),
Apply(f, arg) => format!("(apply {} {})", self.format(f), self.format(arg)),
}
}
/// Shortcuts to let us build AST nodes easily.
fn int(&self, i: i64) -> Eid {
self.insert(FlAst::Literal(Lit::Int(i)))
}
fn bool(&self, b: bool) -> Eid {
self.insert(FlAst::Literal(Lit::Bool(b)))
}
fn op(&self, op: BinOp, lhs: Eid, rhs: Eid) -> Eid {
self.insert(FlAst::Op(op, lhs, rhs))
}
fn var(&self, v: impl AsRef<str>) -> Eid {
self.insert(FlAst::Var(Sym::new(v)))
}
fn lambda(&self, arg: impl AsRef<str>, sig: impl AsRef<str>, body: Eid) -> Eid {
self.insert(FlAst::Lambda(Sym::new(arg), Type::new(sig), body))
}
fn apply(&self, fun: Eid, arg: Eid) -> Eid {
self.insert(FlAst::Apply(fun, arg))
}
}
fn lambda_calc(ast: &ExprPool) {
use crepe::crepe;
// Ok this is our actual Datalog program. Crepe's macro generates a bunch of types
// in whatever module it's called in, so it's best to isolate it in its own function.
// It generates a runtime type called `Crepe`, but doesn't make it public so it's
// annoying to get in and out of functions. So we kinda just end up with one mega function
// that does all our processing and prints the results.
//
// Crepe takes typed inputs and produces typed outputs. Crepe's input and output types must be
// Copy, which is one reason we flattened our AST. We could probably fight with lifetimes and stuff
// instead, but this is easier IMO. So each type of node in our AST gets turned into a Datalog input
// rule each with an expression ID attached so they can refer to each other.
crepe! {
// We don't care about the actual value in our literals, just the type.
// Since literals can be integers or booleans, "flatten" those out more
// to be their own rules. We could get the same effect with Yet Another
// Sum Type, but this feels a little easier? It doesn't really make much
// difference either way.
@input
#[derive(Debug)]
struct ELitInt(Eid);
@input
#[derive(Debug)]
struct ELitBool(Eid);
@input
#[derive(Debug)]
struct EBinOp(Eid, BinOp, Eid, Eid);
@input
#[derive(Debug)]
struct EVar(Eid, Sym);
@input
#[derive(Debug)]
struct ELambda(Eid, Sym, Type, Eid);
@input
#[derive(Debug)]
struct EApply(Eid, Eid, Eid);
// Now we define our output structures. We can have multiple outputs,
// crepe just returns a tuple with one element per output type. We
// also make these Ord so we can sort them for nicer output; crepe
// uses hash sets internally.
@output
#[derive(Debug, PartialOrd, Ord)]
struct TypeOf(Eid, Type);
@output
#[derive(Debug, PartialOrd, Ord)]
struct TypeError(Eid);
// Since crepe is strongly typed, we have to provide type signatures
// for the other intermediate rules we want to define as well that are
// neither inputs nor outputs.
struct IsNumber(Type);
IsNumber(Sym::new("int"));
struct IsBool(Type);
IsBool(Sym::new("bool"));
// We have to declare what function types can exist in the program explicitly, for now.
// Can we instantiate these implicitly? Maybe, but we'd need some way
// to generate new identifiers for them, which might be possible. Something for me to
// explore in the future.
struct FunctionType(Type, Type, Type);
FunctionType(Sym::new("int -> int"), Sym::new("int"), Sym::new("int"));
FunctionType(Sym::new("bool -> bool"), Sym::new("bool"), Sym::new("bool"));
FunctionType(Sym::new("int -> bool"), Sym::new("int"), Sym::new("bool"));
FunctionType(Sym::new("bool -> int "), Sym::new("bool"), Sym::new("int"));
// The TypeOf rule is the meat of our type checker. It takes an expr id
// and outputs a type... or rather, it finds values of expr id and type
// where a rule is true.
//
// When reading Datalog/Prolog rules, something like:
//
// a(val) :- rule1(val), rule2(val).
// a(val) :- rule3(val);
//
// is read as "a(val) is true IF rule1(val) is true AND rule2(val) is true,
// OR a(val) is true IF rule3(val) is true."
//
// The *mind-bending trick* is that there's no inputs or outputs; every parameter
// can be an input or an output depending on how you use it. So here we define
// that the type of any `ELitInt` expression is "int" just by writing that rule.
// First off, we define that literals have type "int" or "bool".
TypeOf(id, Sym::new("int")) <-
ELitInt(id);
TypeOf(id, Sym::new("bool")) <-
ELitBool(id);
// Now to find the types of more interesting things:
TypeOf(id, t) <-
// The type of an add operation is...
EBinOp(id, BinOp::Add, lhs, rhs),
// ..."t" where "t" is the type of each of its operands which must be the same type,
TypeOf(lhs, t),
TypeOf(rhs, t),
// ...and if "t" is a number. (ie, *assert* that T is a number.)
IsNumber(t);
TypeOf(id, t) <-
// The type of an eq operation is...
EBinOp(id, BinOp::Eq, lhs, rhs),
// ...the type "optype" of its operands if they are the same type,
TypeOf(lhs, optype),
TypeOf(rhs, optype),
// ...and if "t" is a bool. (*assert* that T is a bool)
IsBool(t);
// Now we do variables. We haven't written any kind of scope rules, so all variables
// are global; they are valid as long as there is a lambda somewhere that declares them.
// We could define scope by writing `ContainsExpr(haystack, needle)` rule and adding
// `ContainsExpr(body, id)` here, saying "this var usage has the type of the var the
// lambda declares, as long as the var usage is inside the lambda's body."
// I did it, it works, but it's pretty tedious so I took it out.
TypeOf(id, t) <-
// The type of a variable is "t"...
EVar(id, name),
// where "sig" is the type of the lambda that declared it,
ELambda(_lambda_id, name, sig, _body),
// and there is a function type "sig" with the arg type "t".
FunctionType(sig, t, _rettype);
TypeOf(id, t) <-
// The type of a lambda expr with signature "t" is "t"...
ELambda(id, _name, t, body),
// ...where "t" is a function type...
FunctionType(t, _argtype, rettype),
// oh and also make sure the body type matches the declared rettype.
TypeOf(body, rettype);
TypeOf(id, t) <-
// The type of a function application (f arg) is "t"...
EApply(id, f, arg),
// Where "ftype" is the function expr's type,
TypeOf(f, ftype),
// and "argtype" is the arg expr's type,
TypeOf(arg, argtype),
// and there exists a function type "ftype" of type argtype -> t
// (aka, the type of f is actually a function which returns type t)
FunctionType(ftype, argtype, t);
// Now, the TypeOf() rule will find all expressions that typecheck,
// but won't give us the ones that don't typecheck. We can't just do `not TypeOf(...)`
// because that doesn't follow the stratified negation rules, but if we
// have an "expr" rule as well we can formulate it like this:
TypeError(id) <-
// This expression is a type error if it is an expression
Expr(id),
// and we can't find a type for it.
!TypeOf(id, _);
// And an input is an expression just if it's one of our declared expression types.
// So we kinda weirdly build a sum type for our expression variants here, which seems
// enough to fulfill our stratified negation invariants. In a fancier type checker
// this is useful for other things as well, so I don't feel too bad having to add it.
struct Expr(Eid);
Expr(id) <-
ELitInt(id);
Expr(id) <-
ELitBool(id);
Expr(id) <-
EBinOp(id, _, _, _);
Expr(id) <-
EVar(id, _);
Expr(id) <-
ELambda(id, _, _, _);
Expr(id) <-
EApply(id, _, _);
}
// And that's it! Back to Rust.
//
// Let's make a loop that takes our AST and turns it into crepe's data types,
// stuffing them into the evaluation engine.
let mut runtime = Crepe::new();
// We don't have to walk the expr tree recursively because we already know it's well
// connected and complete, so we can just walk through it as a list and stuff
// everything into crepe. The generated `Crepe` structure has an `extend()` method
// that takes an iterator of any type we've defined as an `@input` type.
for (id, expr) in ast.exprs.borrow().iter().cloned().enumerate() {
use FlAst::*;
let eid = Eid(id);
match expr {
Literal(Lit::Int(_)) => runtime.extend(Some(ELitInt(eid))),
Literal(Lit::Bool(_)) => runtime.extend(Some(ELitBool(eid))),
Op(b, lhs, rhs) => runtime.extend(Some(EBinOp(eid, b, lhs, rhs))),
Var(name) => runtime.extend(Some(EVar(eid, name))),
Lambda(name, sig, body) => runtime.extend(Some(ELambda(eid, name, sig, body))),
Apply(f, arg) => runtime.extend(Some(EApply(eid, f, arg))),
}
}
// BOOM! Evaluate our rules and collect results! The magic happens here!
let (res1, res2) = runtime.run();
// Now we just need to print out the results.
// The results are hashsets, so let's turn them into
// sorted structures to display the outputs.
// We also want them to be maps instead of sets, for reasons
// we will see below...
let mut successes = std::collections::BTreeMap::new();
successes.extend(res1.iter().map(|e| (e.0, e)));
let mut failures = std::collections::BTreeMap::new();
failures.extend(res2.iter().map(|e| (e.0, e)));
println!("{:<50} : type", "expr");
// just gonna abuse the formatting machinery here to make a table divider...
println!("{:-<50} : {:-<10}", "-", "-");
for (_, TypeOf(eid, t)) in &successes {
println!("{:<50} : {}", ast.format(*eid), t);
}
println!();
println!("Type errors:");
for (_, TypeError(eid)) in &failures {
println!("{}", ast.format(*eid));
}
// The thing is... if we have a bug or ambiguity in our type system,
// add type inference but give it a program without enough enough information
// to infer completely, something else like that, then it's possible for type
// checking an expression to fail without producing an error. Our "TypeError"
// rule looks for contradictions, but won't always find other bad cases.
// So here we will loop through *all* our exprs and make sure that each one
// has either a successful or unsuccessful type attached to it.
//
// It is very difficult to spot it when you add a test program and it doesn't actually
// show up in the "success" or "failure" outputs, or when you change a rule to
// experiment with something and it makes one output mysteriously vanish.
//
// I think in a bug-free implementation of an unambiguous type system this should
// never happen, but the world isn't always that nice. On the flip side, this also
// gives us a chance to investigate or resolve ambiguous cases in the type system
// in whatever ad-hoc code we find convenient, so this is a nice double-check.
// The test programs I made below don't have any cases that trigger this, but you
// can probably produce some if you tried or if you gave it bad/malformed input.
for (id, _e) in ast.exprs.borrow().iter().enumerate() {
let eid = Eid(id);
if !(successes.contains_key(&eid) || failures.contains_key(&eid)) {
println!(
"{} neither succeeded nor failed typechecking",
ast.format(eid)
);
}
}
}
fn main() {
let pool = ExprPool::new();
// Let's write some programs!
// (+ 5 (+ 3 4))
let add_op = pool.op(
BinOp::Add,
pool.int(5),
pool.op(BinOp::Add, pool.int(3), pool.int(4)),
);
// (== 3 4)
pool.op(BinOp::Eq, pool.int(3), pool.int(4));
// (== 99 (+ 100 -1))
pool.op(
BinOp::Eq,
pool.int(99),
pool.op(BinOp::Add, pool.int(100), pool.int(-1)),
);
// λ(x.int) : int = (+ 5 (+ 3 4))
let my_fun = pool.lambda("x", "int -> int", add_op);
// (apply λmy_fun 3)
pool.apply(my_fun, pool.int(3));
// just the var x
pool.var("x");
// λ(y.int) : int = (+ y y)
let double = pool.lambda(
"y",
"int -> int",
pool.op(BinOp::Add, pool.var("y"), pool.var("y")),
);
// (apply λdouble 12)
pool.apply(double, pool.int(12));
// here's some things that should fail to typecheck:
// (apply λdouble true)
pool.apply(double, pool.bool(true));
// (== 3 false)
pool.op(BinOp::Eq, pool.int(3), pool.bool(false));
// just the var "unknownvar"
pool.var("unknownvar");
// (apply 9 8)
pool.apply(pool.int(9), pool.int(8));
// (+ true (+ 10 11))
pool.op(
BinOp::Add,
pool.bool(true),
pool.op(BinOp::Add, pool.int(10), pool.int(11)),
);
lambda_calc(&pool);
}