if ($FILENAME contains "finance") { MOVE "finance" EMAIL "finance-team@somecompany.com" } else if ($FILENAME endsWith "xlsx") { MOVE "spreadsheets" } else if ($FILENAME startsWith "daily") { MOVE "daily-reports" EMAIL "report-team@somecompany.com" } else { MOVE "additional" }In this scenario, I’m assuming that this routine is invoked once for each file that is output, and the name of the file is put into the variable $FILENAME.
So how do Scala parser combinators work and how do they compare to the traditional way of constructing a DSL? Before looking at the Scala approach, it is worth clarifying that whenever you talk about a DSL, you need to be clear whether you are talking about an internal or external DSL. An internal DSL is one that is simply written in a normal programming language, but the objects and methods provided have been created to give the appearance of a domain specific language. This tends not to be done much in Java, since Java does not have a very flexible language syntax. It works better in languages such as Ruby or Scala where you have more freedom about how you define methods, such as being able to use spaces and symbols in the names, and being able to change the order of objects and the methods you are invoking on them. An external DSL is one where the code is written in its own format, and this code is parsed to understand its meaning. Traditionally this is done by writing the rules of the language in a formal grammar, usually Backus-Naur Form (BNF). The BNF grammar is then read in by a parser generator such as ANTLR or JavaCC, which generates a parser capable of understanding the language. This approach works well, but does have problems. One problem is debugging. If something goes wrong, a developer may have to debug through auto-generated code. A second annoyance is that whenever you update your grammar, you have to run the parser generation step, rather than just compiling some code. This is where parser combinators come in, as an alternative approach.
So, how do you use Scala parser combinators to parse a small language like the one above? In short, rather than writing the BNF grammar into a grammar file, and then generating a parser, you write a Scala class in which each BNF rule becomes a method. The Scala parser combinator library provides parsers, matchers and methods that correspond to all of the elements in a grammar. Let’s look at how this would work for the example above. I’ll start off with a few code snippets, and then give the full class, so you can run it yourself. If you haven’t run any Scala code before, using Eclipse with the Scala plugin is probably the easiest way to get started. Please note that although Scala parser combinators used to be included in the standard Scala library, from 2.11 onwards they are a separate download from http://www.scala-lang.org/.
To give your class access to the parser combinator code, the easiest thing to do is extend one of the parser traits, so I’ll start with this:
import scala.util.parsing.combinator.JavaTokenParsers
class FileHandler extends JavaTokenParsers {
}
Now we need to work on the grammar rules that will define the language. The easiest statements to define first will be the MOVE and EMAIL commands. They are simply the name of the command, followed by a string. Hence we could start by defining a string, and these two rules:
def string : Parser[Any] = """".*"""".r
def email : Parser[Any] = "EMAIL"~string
def move : Parser[Any] = "MOVE"~string
Here, I have used the .r method on a string to define a regular expression saying that in our little language, a string is a double quote, followed by any characters, followed by another double quote. The whole thing is enclosed in triple double quotes, as in Scala, this means that you are defining a string literal, and you do not want to have to escape special characters in the string. For me, this is more readable that escaping the special characters, because the triple quotes are at the start and end of the string, whereas when you escape special characters, you end up with backslashes intermingled with the regular expression you are trying to write, and it can be easy to mistype the regex. Once we have defined a string, we can define the email and move rules by saying that they are the word EMAIL or MOVE respectively, followed by a string. The tilde symbol ~ is used whenever you want combine two parsers sequentially. Note that we don't have to worry about whitespace, the tilde will take care of that.
Now we can start to define the if else statement. The if statement uses an expression, which tests the $FILENAME variable with various operators. In a larger language, we might need to define the $FILENAME as a variable. Here, because I know this is the only "variable" in the language, I'm not going to bother doing that, I'll just write it into the expression rule:
def operator : Parser[Any] = "contains" | "endsWith" | "startsWith"
def expr : Parser[Any] = "$FILENAME"~operator~string
Here we have used the pipe symbol to say that an operator is any one of the three strings specified, and an expression is the $FILENAME, an operator and a string.
To build up the rules for the if statement, I'm going to say that:
- We need the concept of a block - which is a pair of curly braces, with code in between.
- A single if clause is the word "if", followed by an expression in brackets, followed by a block.
- An else if clause is the word "else", followed by an if clause.
- An else clause is the word "else", just followed by a block.
- The entire if statement is an if clause, optionally followed by one or more "if-else" clauses, optionally followed by an "else".
def block : Parser[Any] = "{"~rep(email | move)~"}"
def ifClause :Parser[Any] = "if"~"("~expr~")"~block
def elseIfClause : Parser[Any] = "else"~ifClause
def elseClause : Parser[Any] = "else"~block
def ifElse : Parser[Any] = ifClause~opt(rep(elseIfClause))~opt(elseClause)
Here you can see some additional parser combinator syntax - the rep method denotes repetition, the opt method denotes an optional element. The only thing that remains now is to define the "top level" rule, which will be the starting point for parsing code written in our little language. Based on the above language sample, a sensible choice for this rule is just to say that text written using this language will be composed of a list of commands, using any of the three commands MOVE, EMAIL or an if else statement:
def commandList = rep(email | move | ifElse)
So now the entire class looks like this:
import scala.util.parsing.combinator.JavaTokenParsers
class FileHandler extends JavaTokenParsers {
def string : Parser[Any] = """".*"""".r
def email : Parser[Any] = "EMAIL"~string
def move : Parser[Any] = "MOVE"~string
def operator : Parser[Any] = "contains" | "endsWith" | "startsWith"
def expr : Parser[Any] = "$FILENAME"~operator~string
def block : Parser[Any] = "{"~rep(email | move)~"}"
def ifClause :Parser[Any] = "if"~"("~expr~")"~block
def elseIfClause : Parser[Any] = "else"~ifClause
def elseClause : Parser[Any] = "else"~block
def ifElse : Parser[Any] = ifClause~opt(rep(elseIfClause))~opt(elseClause)
def commandList = rep(email | move | ifElse)
}
Now we can test it. Create a file with the original language sample in it. I've called mine file-handler.txt. Then we can create a parser by subclassing our parser code, then pointing it at this file as input, and invoking our top level method "commandList":
import java.io.FileReader
import java.io.FileInputStream
object FileHandlerTest extends FileHandler {
def main(args: Array[String]): Unit = {
val reader = new FileReader("/path-to-your-file/file-handler.txt")
println(parseAll(commandList, reader))
}
}
The output will show the code being parsed (line breaks added for clarity):
parsed: List((
((((((if~()~(($FILENAME~contains)~"finance"))~))
~(({~List((MOVE~"finance"), (EMAIL~"finance-team@somecompany.com")))~}))
~Some(List((else~((((if~()~(($FILENAME~endsWith)~"xlsx"))~))
~(({~List((MOVE~"spreadsheets")))~}))),
(else~((((if~()~(($FILENAME~startsWith)~"daily"))~))
~(({~List((MOVE~"daily-reports"),
(EMAIL~"report-team@somecompany.com")))~}))))))
~Some((else~(({~List((MOVE~"additional")))~})))))
Note that with the parser as written above, the language does not support nesting, because a block is defined to composed of either MOVE or EMAIL statements. i.e. it cannot contain a nested "if" statement. The generic way to change your grammar rules to support nesting is to change the definition of a block to recursively point back to your top level rule, which in our case is the one called "commandList":
def block : Parser[Any] = "{"~commandList~"}"
You might like to make this change, then update your example input to include a nested if statement, and confirm that it can be parsed correctly.
For more information on Scala parser combinators, check out chapter 31 of "Programming in Scala", available online here:
http://www.artima.com/pins1ed/combinator-parsing.html
2 Responses to Writing a DSL with Scala Parser Combinators