Java 8 Streams Tutorial

In this tutorial, I’m going to start by explaining some of the basics of streams. Viz:
  • What streams are
  • Terminal and non-terminal operations
  • Their “lazy” nature
  • Their read-once nature
  • Why they were introduced i.e. how they enable easy parallel operations
Then I’m going to work through examples of four key stream operations:
  • Filter
  • Map
  • Flatmap
  • Collect
I’m going to include plenty of code snippets, but note that you can get all the source over on my github: https://github.com/hedleyproctor/java8-examples

Introduction to streams

To obtain a stream, you call the new stream() method that has been added to the Collection interface.

Stream operations can be divided into two types:

Intermediate operations, that return a stream:

  • filter
  • skip
  • limit
  • map
  • flatMap
  • distinct
  • sorted

Terminal operations, that return some kind of result

  • anyMatch – boolean
  • noneMatch – boolean
  • allMatch – boolean
  • findAny – Optional
  • findFirst – Optional
  • forEach – void, e.g. print
  • collect
  • reduce

The idea behind streams is that you can build up a pipeline of operations by calling multiple intermediate operations, and then finally a terminal operation to obtain a result.

It’s important to note that streams have two important differences compared to collections: Firstly, unlike a collection, which is essentially a set of data in memory, stream elements are only produced one at a time, as you iterate over the stream. This is referred to as the “lazy” nature of streams. Imagine you have a large dataset of a million elements in memory and you create a stream backed by this dataset. If every time you called an intermediate operation, the entire dataset was iterated, this would be hugely inefficient. Rather, you can think of the intermediate operations as recording that an operation needs to be performed, but deferring the actual execution of that operation until you call a terminal method. At this point, the stream is iterated, and each intermediate operation is evaluated. Secondly, you can only read from a stream once. This differs from e.g. Scala, in which you can read a stream as many times as you like. There is a great stackoverflow answer from one of the stream API designers that explains why this design choice was taken, but it is a bit of a monster, so I’ll summarise it:

  • You can use streams for things other than collections, that genuinely are read once. e.g. read a file with BufferedReader, which has a lines() method returning Stream.
  • Allowing two types of stream, one lazy and the other not, creates its own problems. e.g.
    • In Scala you can have bugs if your code attempts to read a stream twice when in fact it has been passed a once-off stream implementation.
    • Collection classes optimise some operations by storing / caching data. e.g. calling size() on a collection returns a cached size value. Calling size() on a filtered collection would take O(n) time, as it would have to apply the filter to the collection.
    • If you pass round a lazy stream and use it multiple times, each time you operate on it, the entire set of operations need to be evaluated.
There is a link to the answer at the bottom of this article if you want to read it.

Why were streams introduced?

To me, the advantages of streams can be summed up as three points:
  1. Using functional style methods is clearer. i.e. if you use a filter method, someone can see at a glance what you are doing
  2. Because streams are lazy, you can pass streams around between methods, classes or code modules, and apply operations to the stream, without anything being evaluated until you need it to be.
  3. Streams processing can be done in parallel, by calling the parallelStream method. i.e. because you aren’t manually iterating over a collection and performing arbitrary actions, but instead calling well defined methods on the stream such as map and filter, the stream classes know how to split themselves up into separate streams to be processed on different processors, and then recombine the results.
The third reason is really the driver. With the advent of “big data”, the ability to perform data processing operations on massive data sets is hugely important, and to do this efficiently, you will want your code to be able to make use of multiple cores / processors. Streams provide a way of doing that which means you don’t have to write complex code to split your input up, send it to multiple places, wait for the results and then recombine them. The stream implementation handles this for you. However, this article is meant as an introduction to streams, so I don’t want to go into too much detail as to how this works. Rather, let’s start looking at some actual stream operations.

Stream operations

To give my code examples, I’m going to use examples from two domains:
  1. Insurance – this is the domain I work in. Here we have insurance claims, which could be of different types (e.g. motor, household), have jobs attached to the them (e.g. motor repair, solicitor, loss adjuster) and payments made.
  2. Restaurant menu – this is what Java 8 In Action use for their examples.

Filter

I think filter is a great operation to start with, it’s a very common thing to do and a nice intro to stream syntax. In my code examples, if you open the StreamExamples class and find the filter method, you can see the syntax for filtering a collection of claims to motor claims only:
Stream<Claim> motorClaims = claims.stream().filter(claim -> claim.getProductType().equals(Claim.PRODUCT_TYPE.MOTOR));
The filter method takes a lambda expression, which accepts an object of the type used in your stream, in this case a Claim object, and returns a boolean. Any element for which this check is true is included in the filtered stream and elements for which the check returns false are excluded. In this case, we simply check if the type of the claim is MOTOR. As this is an intermediate operation, the return type is Stream. As explained above, at this point, the filter hasn’t actually been evaluated. It will only be evaluated when a terminal operation is added. Before we do that, let’s look at a couple more simple examples of filter. We could filter on payments over 1000:
Stream<Claim> paymentsOver1000 = claims.stream().filter(claim -> claim.getTotalPayments() > 1000);
Or claims with 2 or more jobs:
Stream<Claim> twoOrMore = claims.stream().filter(claim -> claim.getJobs().size() >= 2);

Map

The map operation means “map” in the mathematical sense – that of mapping one value to another. Suppose we have a stream of Claim objects, but what we need is claim ids? Just map the stream like this:
Stream<Long> claimIds = claims.stream().map(claim -> claim.getId());
As you can see, the map operation takes a lambda expression that accepts an object of the type used in your stream, in this case a Claim, and converts it to another type. In fact, in this example, you don’t even need to write the full lambda expression, you can use a method reference:
Stream<Long> claimIds2 = claims.stream().map(Claim::getId);
Now that we have seen two different intermediate operations, let’s look at how to build a pipeline by applying the operations one after another. If we want to get the ids of all motor claims, we can write the following:
Stream<Long> motorClaimIds = claims.stream()
                .filter(claim -> claim.getProductType().equals(Claim.PRODUCT_TYPE.MOTOR))
                .map(Claim::getId);
I’d recommend writing your pipelines with each operation on a separate line like this. Not only does it make the code more readable, but if there is a fatal exception during your stream processing, the line number will take you straight to the failing operation.

Note that you don’t just have to “extract” values during a map operation, you can also create new objects. For example, you might convert from domain objects to DTOs, like this:

Stream<ClaimDTO> claimDTOs = claims.stream().map(claim -> new ClaimDTO(claim.getId(), claim.getTotalPayments()));

FlatMap

Suppose you want to get a stream or collection of all of the jobs attached to a list of claims. You might start with a map operation, like this:
claims.stream().map(Claim::getJobs)
However, there is a problem here. Calling getJobs() on a claim returns a Set of Job objects. So we now have a stream composed of Sets, whereas we want a stream of Job objects. This is where flatMap comes in. It takes a stream composed of Sets or another collection type, and “collapses” it down to a stream of the objects in the collections. Hence, to get a stream of all the jobs, we write:
Stream<Job> jobs = claims.stream().map(Claim::getJobs).flatMap(Set::stream);
Again, we can pipeline a number of operations here, for example by filtering the stream before mapping the values. Taking an example from the food / menu domain, here’s how to get side orders available for dishes with over 750 calories:
Stream<SideOrder> sideOrdersOver750 = menu.stream().filter(dish -> dish.getCalories() > 750).map(Dish::getSideOrders).flatMap(Set::stream);

Collect

The three operations we have covered so far are all intermediate operations. They operate on a stream and return a stream. When you want to convert your stream back into a collection, you will want to call the collect method. There a large number of variations as to how you collect, and this choice can be a bit bewildering at first, so I want to show a good number of examples here to help you get familiar with what is available to you.

Firstly, let’s start with the simplest possible collect operations, to a set, list or map. Here is what you could do if you want a stream of motor claims collected to one of these types:

Set<Claim> motorClaimSet = claims.stream().
                                    filter(claim -> claim.getProductType().equals(Claim.PRODUCT_TYPE.MOTOR)).
                                    collect(Collectors.toSet());

List<Claim> motorClaimList = claims.stream().
                                    filter(claim -> claim.getProductType().equals(Claim.PRODUCT_TYPE.MOTOR)).
                                    collect(Collectors.toList());
// to a map (grouping by unique key)
Map<Long,Claim> motorClaimMap =  claims.stream().
                                        filter(claim -> claim.getProductType().equals(Claim.PRODUCT_TYPE.MOTOR)).
                                        collect(Collectors.toMap(Claim::getId, Function.<Claim>identity()));
In the map example, the key of claim id is unique. What happens if you map by a non-unique key? The answer is that your map values won’t be individual objects, but rather lists of the objects that share that non-unique key. For example:
Map<Claim.PRODUCT_TYPE,List<Claim>> claimsByType = claims.stream().collect(groupingBy(Claim::getProductType));
You can see here that we are using the groupingBy method. Grouping can be multi-level however. Not only that, but the grouping keys don’t have to be attributes of the objects, you can dynamically create the key values as part of the grouping. Consider grouping by product type, and then by claims of £1000 or less:
Map<Claim.PRODUCT_TYPE,Map<String,List<Claim>>> claimsByTypeAndPayment = 
 claims.stream()
 .collect(
   groupingBy(Claim::getProductType,
      groupingBy(claim -> {
         if (claim.getTotalPayments() > 1000) {
              return "HIGH";
         }
         else {
              return "LOW";
         }
        })
      ));
Note that the result of your grouping doesn’t have to be the objects in your stream. You may want to extract a value from them. In the menu domain, suppose I want to group side orders by type, and get a list of the calories for each of the side orders in each type. In this case you will want to operate on a stream of SideOrder objects, but use the two parameter groupingBy method to specify to extract the calorie value, rather than collecting the SideOrder objects themselves:
Map<SideOrder.Type,List<Integer>> sideOrderCalories = 
    menu.stream()
    .map(Dish::getSideOrders)
    .flatMap(Set::stream)
    .collect(groupingBy(SideOrder::getType, mapping(SideOrder::getCalories, toList())));
Sometimes you want to want to group into only two groups. Because this is a common operation, it has a special convenience method called partition:
Map<Boolean,List<Dish>> veggieAndNonVeggie = menu.stream().collect(partitioningBy(Dish::isVegetarian));
Sometimes you want to sum or average numerical values from your stream:
int totalCalories = menu.stream().collect(summingInt(Dish::getCalories));
double totalPayments = claims.stream().collect(summingDouble(Claim::getTotalPayments));
double averagePayment = claims.stream().collect(averagingDouble(Claim::getTotalPayments));
The above syntax is fine if you are only obtaining one value. However, if you want both a sum and an average say, you shouldn’t evaluate each one separately – this will iterate over the stream multiple times. Instead, you should use a summing collector:
DoubleSummaryStatistics paymentStats = 
  claims.stream().collect(summarizingDouble(Claim::getTotalPayments));
totalPayments = paymentStats.getSum();
averagePayment = paymentStats.getAverage();
My final example is something that has been missing from Java for a while. How often have you needed to concatenate a collection of strings, only to have to resort to using Apache Commons to do it! No more. Now you can use the joining() collector:
String claimIdListAsCommaSeparatedString = claims.stream().map(claim -> claim.getId().toString()).collect(joining(","));
Note that if you don’t specify a separator, the default is that none will be used.

Summary

I hope this has been a useful introduction to streams and how to use them. We’ve covered what streams are, their lazy nature, their once-off nature and why they enable easier parallel processing. Then we have looked at most common stream operations: filter, map, flatMap and collect. For collecting, if you want to know how to write your own custom collector, see my example:
Yet another Java 8 custom collector example If you are interested in the background to the stream API design choices, see: Why are Java streams once off? Why doesn’t java.util.Collection implement the new stream interface? Finally, for more details on both Java 8 in general, and functional programming, I’d strongly recommend Java 8 In Action.
This entry was posted in Java and tagged , . Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *

HTML tags are not allowed.

516,900 Spambots Blocked by Simple Comments