Writing a DSL with Scala Parser Combinators

Scala parser combinators provide a very easy way to construct your own domain specific languages (DSLs). Let’s look at a simple example. Suppose that you work on a large application which produces various report files on a scheduled basis. You decide that you want to give the users of your application the ability to control what directories different file types are moved to, and to send e-mail alerts when specific files are generated. You could do this by allowing a power user to type the instructions in using little language such as the following:

if ($FILENAME contains "finance") {
	MOVE "finance"
	EMAIL "finance-team@somecompany.com"
}
else if ($FILENAME endsWith "xlsx") {
	MOVE "spreadsheets"
}
else if ($FILENAME startsWith "daily") {
	MOVE "daily-reports"
	EMAIL "report-team@somecompany.com"
}
else {
	MOVE "additional"
}

In this scenario, I’m assuming that this routine is invoked once for each file that is output, and the name of the file is put into the variable $FILENAME.

So how do Scala parser combinators work and how do they compare to the traditional way of constructing a DSL? Before looking at the Scala approach, it is worth clarifying that whenever you talk about a DSL, you need to be clear whether you are talking about an internal or external DSL. An internal DSL is one that is simply written in a normal programming language, but the objects and methods provided have been created to give the appearance of a domain specific language. This tends not to be done much in Java, since Java does not have a very flexible language syntax. It works better in languages such as Ruby or Scala where you have more freedom about how you define methods, such as being able to use spaces and symbols in the names, and being able to change the order of objects and the methods you are invoking on them. An external DSL is one where the code is written in its own format, and this code is parsed to understand its meaning. Traditionally this is done by writing the rules of the language in a formal grammar, usually Backus-Naur Form (BNF). The BNF grammar is then read in by a parser generator such as ANTLR or JavaCC, which generates a parser capable of understanding the language. This approach works well, but does have problems. One problem is debugging. If something goes wrong, a developer may have to debug through auto-generated code. A second annoyance is that whenever you update your grammar, you have to run the parser generation step, rather than just compiling some code. This is where parser combinators come in, as an alternative approach.

So, how do you use Scala parser combinators to parse a small language like the one above? In short, rather than writing the BNF grammar into a grammar file, and then generating a parser, you write a Scala class in which each BNF rule becomes a method. The Scala parser combinator library provides parsers, matchers and methods that correspond to all of the elements in a grammar. Let’s look at how this would work for the example above. I’ll start off with a few code snippets, and then give the full class, so you can run it yourself. If you haven’t run any Scala code before, using Eclipse with the Scala plugin is probably the easiest way to get started. Please note that although Scala parser combinators used to be included in the standard Scala library, from 2.11 onwards they are a separate download from http://www.scala-lang.org/.

To give your class access to the parser combinator code, the easiest thing to do is extend one of the parser traits, so I’ll start with this:

import scala.util.parsing.combinator.JavaTokenParsers
 
class FileHandler extends JavaTokenParsers {
 
}

Now we need to work on the grammar rules that will define the language. The easiest statements to define first will be the MOVE and EMAIL commands. They are simply the name of the command, followed by a string. Hence we could start by defining a string, and these two rules:

def string : Parser[Any] = """".*"""".r
def email : Parser[Any] = "EMAIL"~string	
def move : Parser[Any] = "MOVE"~string

Here, I have used the .r method on a string to define a regular expression saying that in our little language, a string is a double quote, followed by any characters, followed by another double quote. The whole thing is enclosed in triple double quotes, as in Scala, this means that you are defining a string literal, and you do not want to have to escape special characters in the string. For me, this is more readable that escaping the special characters, because the triple quotes are at the start and end of the string, whereas when you escape special characters, you end up with backslashes intermingled with the regular expression you are trying to write, and it can be easy to mistype the regex. Once we have defined a string, we can define the email and move rules by saying that they are the word EMAIL or MOVE respectively, followed by a string. The tilde symbol ~ is used whenever you want combine two parsers sequentially. Note that we don’t have to worry about whitespace, the tilde will take care of that.

Now we can start to define the if else statement. The if statement uses an expression, which tests the $FILENAME variable with various operators. In a larger language, we might need to define the $FILENAME as a variable. Here, because I know this is the only “variable” in the language, I’m not going to bother doing that, I’ll just write it into the expression rule:

def operator : Parser[Any] = "contains" | "endsWith" | "startsWith"
def expr : Parser[Any] = "$FILENAME"~operator~string

Here we have used the pipe symbol to say that an operator is any one of the three strings specified, and an expression is the $FILENAME, an operator and a string.

To build up the rules for the if statement, I’m going to say that:

  • We need the concept of a block – which is a pair of curly braces, with code in between.
  • A single if clause is the word “if”, followed by an expression in brackets, followed by a block.
  • An else if clause is the word “else”, followed by an if clause.
  • An else clause is the word “else”, just followed by a block.
  • The entire if statement is an if clause, optionally followed by one or more “if-else” clauses, optionally followed by an “else”.

One way to embody this in code is:

def block : Parser[Any] = "{"~rep(email | move)~"}"
def ifClause :Parser[Any] = "if"~"("~expr~")"~block
def elseIfClause : Parser[Any] = "else"~ifClause
def elseClause : Parser[Any] = "else"~block
def ifElse : Parser[Any] = ifClause~opt(rep(elseIfClause))~opt(elseClause)

Here you can see some additional parser combinator syntax – the rep method denotes repetition, the opt method denotes an optional element. The only thing that remains now is to define the “top level” rule, which will be the starting point for parsing code written in our little language. Based on the above language sample, a sensible choice for this rule is just to say that text written using this language will be composed of a list of commands, using any of the three commands MOVE, EMAIL or an if else statement:

def commandList = rep(email | move | ifElse)

So now the entire class looks like this:

import scala.util.parsing.combinator.JavaTokenParsers
 
class FileHandler extends JavaTokenParsers {
  	def string : Parser[Any] = """".*"""".r
	def email : Parser[Any] = "EMAIL"~string
	def move : Parser[Any] = "MOVE"~string 
 
	def operator : Parser[Any] = "contains" | "endsWith" | "startsWith"
	def expr : Parser[Any] = "$FILENAME"~operator~string
	def block : Parser[Any] = "{"~rep(email | move)~"}"
	def ifClause :Parser[Any] = "if"~"("~expr~")"~block
	def elseIfClause : Parser[Any] = "else"~ifClause
	def elseClause : Parser[Any] = "else"~block
	def ifElse : Parser[Any] = ifClause~opt(rep(elseIfClause))~opt(elseClause)
 
	def commandList = rep(email | move | ifElse)
}

Now we can test it. Create a file with the original language sample in it. I’ve called mine file-handler.txt. Then we can create a parser by subclassing our parser code, then pointing it at this file as input, and invoking our top level method “commandList”:

import java.io.FileReader
import java.io.FileInputStream
 
object FileHandlerTest extends FileHandler {
 
  def main(args: Array[String]): Unit = {
    val reader = new FileReader("/path-to-your-file/file-handler.txt")
    println(parseAll(commandList, reader))
  }
 
}

The output will show the code being parsed (line breaks added for clarity):

parsed: List(( 
((((((if~()~(($FILENAME~contains)~"finance"))~))
~(({~List((MOVE~"finance"), (EMAIL~"finance-team@somecompany.com")))~}))
~Some(List((else~((((if~()~(($FILENAME~endsWith)~"xlsx"))~))
~(({~List((MOVE~"spreadsheets")))~}))), 
(else~((((if~()~(($FILENAME~startsWith)~"daily"))~))
~(({~List((MOVE~"daily-reports"), 
(EMAIL~"report-team@somecompany.com")))~}))))))
~Some((else~(({~List((MOVE~"additional")))~})))))

Note that with the parser as written above, the language does not support nesting, because a block is defined to composed of either MOVE or EMAIL statements. i.e. it cannot contain a nested “if” statement. The generic way to change your grammar rules to support nesting is to change the definition of a block to recursively point back to your top level rule, which in our case is the one called “commandList”:

def block : Parser[Any] = "{"~commandList~"}"

You might like to make this change, then update your example input to include a nested if statement, and confirm that it can be parsed correctly.

For more information on Scala parser combinators, check out chapter 31 of “Programming in Scala”, available online here:

http://www.artima.com/pins1ed/combinator-parsing.html

This entry was posted in Scala and tagged . Bookmark the permalink.

2 Responses to Writing a DSL with Scala Parser Combinators

  1. Nick Kasvosve says:

    Very interesting post.
    A useful exercise would be to auto map the result, “parsed: List(( …))” at the last step into some domain model that be used downstream., say in a persistence layer, or reporting module.
    Also, I wonder how easy it is, if at all possible, to map a fixed file format to a BNF context free grammar and then parse the instance contents using parser combinators ( thinking MID file for example) .
    Nice work.

  2. Nirmal Tej Kumar says:

    IT is really very interesting and was very useful in exploring these features for my spectroscopy imaging project.
    DSlL is very interesting and promising domain for scientists with very specific requirements in scientific isntrumemntation,inaging and informatics.
    Currently working in Cryo-EM Imaging and Informatics project.
    Well done.Good job.
    Dr.Nirmal
    Brazil/USA.

Leave a Reply

Your email address will not be published. Required fields are marked *

501,820 Spambots Blocked by Simple Comments

HTML tags are not allowed.