Practical Scala – file IO and regular expressions

Scala is a great language but learning it can seem like you’re battling with too many new concepts to be able to get anything done. The purpose of this article is to show that even with a few lines of Scala, you can start to do productive tasks. After reading this article you should be able to write small automation jobs that involve reading and writing text files, and use regular expressions. However, along the way, it will introduce a number of Scala concepts. (You could call this method of teaching, the “Karate Kid” method…)

I suggest you use the Eclipse Scala plugin for this tutorial, it’s probably the easiest way to compile and run your first Scala code. Once you have it installed, go to New -> Scala object. Whoa….hang on a minute here….an object? Surely that should be a class right? Well, actually no. Scala makes extensive of singleton objects. A singleton object can be defined in the same file as the corresponding class, in which case it is called a “companion object”, or it can be defined without a corresponding class, in which case it is a “standalone” singleton. Scala does not have static methods, so singleton objects are generally where you will put code that would have been in a static method in Java. In this case, we want to write a main method to start our application, so we’ll create a standalone singleton. Type in the following:

import scala.io.Source

object FileReader {

  def main(args: Array[String]): Unit = {  
    val file = Source.fromFile("/scala_fileio/file-to-read.txt")
    file.getLines().foreach( line => println(line))
  }

}

We can learn a lot of Scala syntax just from this example:
  • Methods are declared with the "def" keyword.
  • Unlike Java, variable names always come before their types, separated by a colon, which can be seen in the parameter to the main method - args : Array[String].
  • Type parameterization uses square brackets rather than the angle brackets seen in Java.
  • Method return values come after the parameter list, separated by a colon. In the above example, the method return type is Unit, essentially the same as Java's void.
You can see that I've created a dummy file to read and saved it as /scala_fileio/file-to-read.txt. Obviously just adjust this line to point to a dummy file on your system. Then, if you run the code from Eclipse, you should see each line of the file being printed out. In my case I get: intro more stuff last line So what's going on with the weird syntax for reading the file? Well, it's an example of a closure - a function for which all variables have assigned values. In this case, the function is just printing the line. It's a closure because the string value of the line is provided by the foreach method. Scala is far more functional than Java, and as you write more Scala, you'll find that closures and functions allow you to write code that is both more concise and more flexible than the Java equivalent. The file class is called "Source" because the original Scala implementation was written alongside the Scala compiler, and when compiling, each file is a piece of source.

Let's extend this example to show how to write to a file. As we iterate over this file, let's output it to a second file, with asterisks before and after the text. Update the code to:

import scala.io.Source
import java.io.File
import java.io.FileWriter
import java.io.BufferedWriter

object FileReader {

  def main(args: Array[String]): Unit = {
    val file = Source.fromFile("/scala_fileio/file-to-read.txt")
    val outputFile = new File("/scala_fileio/output.txt")
    val writer = new BufferedWriter(new FileWriter(outputFile))
    // use curly brackets {} to tell Scala that it's now a multi-line statement!
    file.getLines().foreach{ line => 
      println(line)
      writer.write("***" + line + "***")
      writer.newLine()
    }
    writer.flush()
    writer.close()
  }

}

You can see here that we're just using the Java file IO classes to write the file. This is the easiest way to write the code, although Scala does have an add on library called Scala IO which gives you some more Scala-ish file writing classes. You can also see that you need to change the foreach call to use curly brackets to tell the Scala compiler that we are now passing in a multi-line statement rather than a single line.

What if we wanted to print out the line number on each line? In Java, you'd need to maintain a separate counter to keep track of the line number. In Scala, you can use the zip method. A zip method takes two lists and iterates over each one to create a new list. Each element in the output list is a pair composed of the elements at that position from the two input lists. In this scenario, we can use a variant of the zip method, called zipWithIndex. It iterates over a single list, and for each position in the list, it gives you both the element and the index. We'll get rid of the call to foreach and just use a normal Scala for loop, that iterates over the pairs of values produced by the call to zipWithIndex:

import scala.io.Source
import java.io.File
import java.io.FileWriter
import java.io.BufferedWriter

object FileReader {

  def main(args: Array[String]): Unit = {
    val file = Source.fromFile("/scala_fileio/file-to-read.txt")
    val outputFile = new File("/scala_fileio/output.txt")
    val writer = new BufferedWriter(new FileWriter(outputFile))
    for ( (line,index) <- file.getLines().zipWithIndex){ 
      println(line)
      writer.write("Line " + (index+1) + ": " + line)
      writer.newLine()
    }
    writer.flush()
    writer.close()
  }

}

Since the index values start at zero, we add one to the index value to get each line number. This is done inside brackets to avoid it being done as a string concatenation.

Okay, let's move on to some regular expressions. Let's update the input file to have some more interesting input, similar to what you might find in a log file, with date and time at the beginning of each line: 22-08-2012 08:30:45 intro 23-09-2012 14:21:46 more stuff 24-09-2012 18:21:47 java.lang.NullPointerException, caused by java.text.ParseException, invalid date format Let's suppose we want to find all lines that were printed in September. I'm using a british date format, so the month is the middle section of the date. Hence the pattern we want to look for is any two digits, followed by a hyphen, followed by 09 for September. Update the FileReader code to:

import scala.io.Source
import java.io.File
import java.io.FileWriter
import java.io.BufferedWriter
import scala.util.matching.Regex

object FileReader {

  def main(args: Array[String]): Unit = {
    val file = Source.fromFile("/scala_fileio/file-to-read.txt")
    val regex = new Regex("dd-09")
    for ( line <- file.getLines()){ 
    	regex.findFirstIn(line) match {
    	  case Some(septemberDate) => println("Found a log line from September: " 
    	      + line + " The matching part of the string was: " + septemberDate)
    	  case None => println("This line doesn't match")
    	}
    }
  }

}

If you run this code you should get the following output:

This line doesn't match
Found a log line from September: 23-09-2012 14:21:46 more stuff The matching part of the string was: 23-09
Found a log line from September: 24-09-2012 18:21:47 last line The matching part of the string was: 24-09

What is the the code doing? Well, we're creating a regex to match against each line of the file. But we're also using a couple of new pieces of Scala syntax:
  1. Pattern matching (case classes)
  2. The Scala Option class, and its subclasses, Some and None
The match / case syntax is an example of a very widely used piece of Scala, called pattern matching. Don't be confused - it is separate concept from regular expressions. You can think of it as a very advanced form of a switch statement. Whereas in Java, you can only switch on numbers, characters and strings (from Java 7), in Scala you can also match against objects - matching for their class and values of their instance variables.

In this example example, the pattern match can either find a pattern, or not find one. It uses another common piece of Scala to do this - returning either Some or None. This is a mechanism within Scala to avoid NullPointerExceptions. In Java, if a method call could return a null, if you forget to put a null check in your code, you could get a NullPointerException. In Scala, methods that could return a null actually return an object of type Option. The Option class has two subtypes, called Some and None. If Some is returned, it is a container, that contains the actual return object. If None is returned, you don't have a return object. This mechanism avoids a null pointer, because in order to get the returned object, you must perform a pattern match. The object will only be extracted from the Some container once the return has been checked and found to be a Some object. You can see from the above code that the syntax for matching against the Some object is to say Some(variableName). If the match succeeds, the returned object is bound to that variable name, and you can use it on the right hand side of the match statement. In the above example, we bind it to a variable called septemberDate and print it out. As is standard with regular expressions, it only contains part of the line - the specific part that matched the regex.

Let's try an example which has multiple matches on a single line. We'll extract the names of the exceptions on the third line. A basic pattern is to look for word characters, then a dot, then word characters, then a dot, then more word characters ending with "Exception". (Obviously this wouldn't work for all exception names, but it is sufficient for this example.) Update the code to:

import scala.io.Source
import java.io.File
import java.io.FileWriter
import java.io.BufferedWriter
import scala.util.matching.Regex

object FileReader {

  def main(args: Array[String]): Unit = {
    val file = Source.fromFile("/scala_fileio/file-to-read.txt")
    val regex = new Regex("w+.w+.w*Exception")
    for ( line <- file.getLines()){ 
    	for (m <- regex.findAllIn(line)) { 
    		println("Found a log line with an exception: " 
    	      + line + " The matching part of the string was: " + m)
    	}
    }
  }

}

If you run this you should get the output:

Found a log line with an exception: 24-09-2012 18:21:47 java.lang.NullPointerException, caused by java.text.ParseException, invalid date format The matching part of the string was: java.lang.NullPointerException
Found a log line with an exception: 24-09-2012 18:21:47 java.lang.NullPointerException, caused by java.text.ParseException, invalid date format The matching part of the string was: java.text.ParseException

Now, this code works, but we can simplify it. The Scala for construct is far more powerful than Java's. You can iterate over multiple variables within a single for loop, so you can update the for loop to:
    for ( line <- file.getLines(); m <- regex.findAllIn(line)) { 

You should be able to rerun this and find you get the same result.

Summary

In this article you've learnt:
  • How to write a Scala object with a main method.
  • How Scala uses singletons, and the difference between standalone singletons and companion objects.
  • How to read a file using the Scala Source class.
  • How to iterate over a file using getLines() and the foreach method.
  • How to iterate over a file with line numbers by using the zipWithIndex method.
  • How to use existing Java classes from Scala to write to a file.
  • How to write a basic for loop in Scala
  • How to use the Regex class to create a regular expression.
  • The basics of how Scala pattern matching works.
  • How Scala avoids NullPointerExceptions with the Option, Some and None classes.

If you'd like to see some more examples of pattern matching and how the Option class works, see:
Why Java developers should be learning Scala

If you'd like to experiment with the Scala IO library, see:
http://jesseeichar.github.com/scala-io-doc/0.4.1-seq/index.html#!/overview

If you want a more detailed explanation of the concepts touched upon in this article, the first edition of "Programming in Scala" is available free online:
http://www.artima.com/pins1ed/

This entry was posted in Scala and tagged . Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *

HTML tags are not allowed.

517,978 Spambots Blocked by Simple Comments