Using ANTLR from Clojure

My most recent project involves parsing Java source files and distributing the syntax trees to other processes. Naturally, I have no desire to write a parser myself, especially since there are tools out there that will do that part for me. Clojure has a few libraries for writing parsers, such as Parsely and Instaparse, but they both require you to create your own grammar. That's fine for small grammars, but as you may expect Java has a horrifically complected syntax that I have no desire to distill - after all, this is a weekend project. Fortunately, there are grammars freely available for another widely-used parser generator: ANTLR.

ANTLR generates Java source files which implement a top-down parser, which is just fine for a Clojure project due to Clojure's Java interop features. Unfortunately, the latest version (4.2 at the time of this writing) isn't available in Maven Central and can't be fetched automatically via Leiningen. So, to integrate ANTLR into your project the first thing you'll need to do is install it.

Installing ANTLR

ANTLR comes as a single-JAR distribution which packages a parser generator with a runtime library. You need the JAR available at compile-time to generate code from your grammars, and at runtime to actually use that code. I downloaded the latest version from the ANTLR website and put it in a lib/ folder in my Leiningen project. Unfortunately Leiningen 2 currently does not cleanly support the kind of classpath wrangling you'd need to be able to use it from there, so I installed it into the local Maven repository using this command:

$ mvn install:install-file -Dfile=lib/antlr-4.2-complete.jar \
-DartifactId=antlr -Dversion=4.2 -DgroupId=org.antlr -Dpackaging=jar

Generating the parser

That takes care of the runtime dependency. Now, to generate your parsers from your grammars you just need to invoke the JAR. Assuming your grammars live under myproject/grammars and that your Java source should go in myproject/java, you can run:

$ cd grammars
$ java -jar ../lib/antlr-4.2-complete.jar Java.g4 \
  -o ../java/myproject/parsers/java \
  -package \
  -visitor -message-format gnu
$ cd ..

This tells ANTLR to generate Java sources in the package which parse the Java.g4 grammar and to also generate base classes for Visitors to use during parsing. We'll extend those base classes in Clojure to implement our parser client.


Now that we have a parser, we need to hook into it.

(ns myproject.parsing
  (:import [
            JavaLexer JavaParser JavaBaseListener]
            ANTLRInputStream CommonTokenStream]
            ParseTree ParseTreeWalker]))

(defn- make-listener []
    (proxy [JavaBaseListener] []
      (enterImportDeclaration [ctx] ... do something ...)
      (exitImportDeclaration [ctx] ... do something ...)))

(defn parse-java [source]
  "Parse 'source using the Java parser."
  (let [input (ANTLRInputStream. source)
        lexer (JavaLexer. input)
        tokens (CommonTokenStream. lexer)
        parser (JavaParser. tokens)
        tree (.compilationUnit parser)]
    (.walk (ParseTreeWalker.) (make-listener) tree)))

As you can see, we define a function which instantiates a Clojure proxy class and then use it to walk the parse tree. The proxy is a subclass of the JavaBaseListener which overrides the various contextual methods that we care about and want to respond to during our parsing. I only show two of them here, but there's significantly more in the real parser - 232 in total. Each method corresponds to entering or exiting a rule in your grammar, and is passed a ParserRuleContext subclass which contains methods for accessing sub-rules and tokens. You can define your own operations on these nodes by extending the listener.

Now, there are a couple of annoying things about this model. Because it hides the Visitor pattern from us, you can't maintain your own state on the call stack. For Clojure programmers, this is a real pain, because it means that you'll need to keep an atom or a ref around so that you can maintain state during the traversal. This also means that operations which control the traversal aren't possible, such as skipping a subtree.

In order to support more advanced traversals, you can use the generated AbstractParseTreeVisitor class to implement your own Visitor which performs the traversal.

(defn- make-visitor []
    (proxy [JavaBaseVisitor] []
      (visitImportDeclaration [ctx]
        ... do something ...
        (.visitChildren this ctx)))

(defn parse-java [source]
  "Parse 'source using the Java parser."
  (let [input (ANTLRInputStream. source)
        lexer (JavaLexer. input)
        tokens (CommonTokenStream. lexer)
        parser (JavaParser. tokens)
        tree (.compilationUnit parser)]
    (.visit (make-visitor) tree)))

Note: Because Clojure subclasses the base class at runtime, we don't need to worry about the generic type parameters on the base class.

The base class will default each method to visit its children and not perform any other operations. If you plan to override every method anyway, you probably should use reify instead and create an instance of the Visitor interface for your grammar.

Now we can control the traversal and maintain state in the parser using the call stack - exactly what was missing from the Listener. This opens up a much wider variety of operations that we can perform, and eliminates external state making the system much more Clojure-friendly.

Closing thoughts

Overall, working with ANTLR isn't too painful once you have the basics down. The documentation could use a lot of work - I had to buy the (overpriced) book in order to figure out how they meant for end users to control the parser. That said, the wide variety of grammars for common languages and the relative ease of using it from Clojure makes it a good tool to have in the kit.