Creating custom Java 8 Stream collectors

The streams feature (which makes heavy use of Functional Interfaces) is arguably the strongest feature in Java 8 (and above). The introduction of lambdas in Java in the same release to target Functional Interfaces also meant that creating chained operations in a functional style has never been easier in Java.

That being said, there are plenty of examples in the official docs that show how streams can be “collected” (effectively reduced/folded depending on your preference of the terminology) into Collection types – List, Set, or even Map. Now, I must make it absolutely clear at this stage that I love Java streams, and barring the absence (or rather, deprecation) of the zip iterator, it’s almost comprehensive. However, there are no examples in the docs to show how we might a series of intermediate operations into a custom type. Of course the helper class, Collectors has several helper methods such as groupingBy, partitioningBy, filtering, and reducing, but they either return a Map, or expect a reducible expression which may not always be the case as explained next.

Recently, I did a project in which I needed to process a stream of integers (the lack of zip forced me to take quite the peripatetic approach to finally make things work. Perhaps more on that in a later blogpost) acting as indices into a simple wrapper around a List of integers, and then ultimately collect the updated values of the list into a new instance of the custom type. It was quite an interesting experience that sparked interest in exploring how much more we could push the collect mechanism. (If you are interested in checking out the code for the mentioned example, you can find it here – Functional Nim.

For some more examples of custom Collector implementations, you can check out my Github page.

Use Case

For the purposes of this blog, to keep things simple, let us consider a a hypothetical example. Suppose we have a Point class with the following structure:

package com.z0ltan.custom.collectors.types;

public class Point {
	private int x;
	private int y;

	public Point(final int x, final int y) {
		this.x = x;
		this.y = y;
	}

	public int x() {
		return this.x;
	}

	public int y() {
		return this.y;
	}

	@Override
	public int hashCode() {
		return (this.x + this.y) % 31;
	}

	@Override
	public boolean equals(Object other) {
		if (other == null || !(other instanceof Point)) {
			return false;
		}

		Point p = (Point) other;

		return p.x == this.x && p.y == this.y;
	}

	@Override
	public String toString() {
		return "Point { x = " + x + ", y = " + y + " }";
	}
}

and we have a List of such points. Now suppose we wish to collate the information into a custom Points object which has the following structure:

package com.z0ltan.custom.collectors.types;

import java.util.Set;

public class Points {
	private Set<Integer> xs;
	private Set<Integer> ys;

	public Points(final Set<Integer> xs, final Set<Integer> ys) {
		this.xs = xs;
		this.ys = ys;
	}

	@Override
	public int hashCode() {
		return (this.xs.hashCode() + this.ys.hashCode()) % 31;
	}

	@Override
	public boolean equals(Object other) {
		if (other == null || !(other instanceof Points)) {
			return false;
		}

		Points p = (Points) other;

		return p.xs.equals(this.xs) && p.ys.equals(this.ys);
	}

	@Override
	public String toString() {
		return "Points { xs = " + xs + ", ys = " + ys + " }";
	}
}

As can be seen, we wish to retain only the unique x and y coordinate values into our final object.

A simple and logical way would be to collect the items from this stream (collect being a “terminal operation” defined in the Stream interface) into our Points object using a custom collector. Before we can do that, let us first understand what is involved in implementing a custom collector.

How to implement a custom collector

In brief, there are two forms of operations involved in Java streams – non-terminal or intermediate operations, which produce streams of their own, and terminal operations, which effectively stop the pipeline resulting in a final result of some sort.

As mentioned before, in most cases, the helper class, Collectors, provides enough functionality to cater to almost any requirement imaginable. However, in cases such as these, where we want to collect data into a custom type, we might be better off defining our own custom collector.

To do that, let us examine the signature of the collect method in the Stream interface. In fact, we will find that there are two versions of this method available to us:

 R collect​(Collector collector)

and

 R collect​(Supplier supplier,
              BiConsumer accumulator,
              BiConsumer combiner)

So which one do we use? Well, we can actually use either one for our purposes. In fact, the former is preferable if we have some involved logic, and we wish to encapsulate all of that in a nice class. However, functionally speaking, the latter is exactly the same. This can be further clarified by examining the Collector interface (showing only the abstract methods):

public interface Collector<T, A, R> {
        Supplier<A>	supplier​()
        BiConsumer<A,T>	accumulator​();	
        BinaryOperator<A>	combiner​()	
        Function<A,R>	finisher​()
        Set<Collector.Characteristics>	characteristics​()	
 }

As can be seen, if we do implement the Collector interface, we will have to implement essentially the same methods as that used by the second version of the collect method. In addition, we have a couple of extra methods which are not only interesting, but quite vital if we wish to implement the interface:

  • finisher: This is the main method that we need to implement as per our collection logic. This is the actual part where the accumulated values of the stream are massaged together into the final return value. The type parameters give a big hint in this regard – the return type,
    R is the same as that returned by the overall collect
    method.
  • characteristics: This is where we need to be careful. The enum has three variants – CONCURRENT, IDENTITY_FINISH, and UNORDERED. The bottomline is this – always use CONCURRENT for your custom types if the final value depends on the ordering of the values in the stream, or use UNORDERED if they do not. In the case of collecting values into a custom non-collection type, I don’t see any scenario where you would want to use IDENTITY_FINISH (unless you are a big fan of unsolicited ClassCastExceptionS).

    In short, this variant indicates that the finisher function is essentially an identity function, meaning that it can be skipped, and the currently accumulated value returned as the overall result (which is precisely what we wish to avoid).

One final comment to understand the collect method once and for all – what all those terms mean!

  • Supplier: This is the mechanism by which the input values are supplied to the collect method.
  • Accumulator: This is where the elements of the stream are combined with a running accumulator (which may be of a different type from the elements themselves), “reduced”, or “folded” in Functional terms.
  • Combiner: Similar to the accumulator, but the elements being combined together are of the same type. In most cases, this type would be a collection type, and finally,
  • Finisher: This is the meat of the whole collector. This is where the actual custom logic goes into to take the values produced by the combiner into the final result of the given return type.

Now that we’ve analysed the signature of the collect method, we must be in a position to realise that we can actually create custom collectors in multiple ways:

  • Using the static of methods in the Collector interface by supplying the correct supplier, accumulator, combiner, finisher, and Collector characteristics,
  • By creating a class that implements the Collector interface itself, and thereby providing implementations of the same supplier, accumulator, combiner, finisher, and Collector characteristics,
  • By simply creating any anonymous class conforming to the Collector interface, and providing the same inputs as in the previous two cases, or
  • Using any combination of the above.

To keep matters simple, let us create a custom class that implements the Collector interface. This will not only make things easier to understand, but also allow us to maintain code cleanliness.

Now let’s proceed with the implementation of the given use case to solidify these concepts.

Implementation and Demo

Let’s create a simple Maven project called custom_stream_collectors:

Macushla:Blog z0ltan$ mvn archetype:generate -DgroupId=com.z0ltan.custom.collectors -DartifactId=custom-collector -DarchetypeArtifactId=maven-archetype-quickstart -DinteractiveMode=false
[INFO] Scanning for projects...
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] Building Maven Stub Project (No POM) 1
[INFO] ------------------------------------------------------------------------

               <elided>

[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 30.967 s
[INFO] Finished at: 2017-07-11T20:57:45+05:30
[INFO] Final Memory: 18M/62M
[INFO] ------------------------------------------------------------------------

After customising the project to our heart’s desire, let’s fill in our custom collector class:

package com.z0ltan.custom.collectors.collectors;

import java.util.ArrayList;
import java.util.Collections;
import java.util.EnumSet;
import java.util.HashSet;
import java.util.List;
import java.util.Set;
import java.util.function.BiConsumer;
import java.util.function.BinaryOperator;
import java.util.function.Function;
import java.util.function.Supplier;
import java.util.stream.Collector;

import com.z0ltan.custom.collectors.types.Point;
import com.z0ltan.custom.collectors.types.Points;

public class PointToPointsCollector implements Collector<Point, List<Point>, Points> {
	@Override
	public Supplier<List<Point>> supplier() {
		return ArrayList::new;
	}

	@Override
	public BiConsumer<List<Point>, Point> accumulator() {
		return List::add;
	}

	@Override
	public BinaryOperator<List<Point>> combiner() {
		return (acc, ps) -> {
			acc.addAll(ps);
			return acc;
		};
	}

	@Override
	public Function<List<Point>, Points> finisher() {
		return (points) -> {
			final Set<Integer> xs = new HashSet<>();
			final Set<Integer> ys = new HashSet<>();
			
			for (Point p : points) {
				xs.add(p.x());
				ys.add(p.y());
			}
			
			return new Points(xs, ys);
		};
	}

	@Override
	public Set<java.util.stream.Collector.Characteristics> characteristics() {
		return Collections.unmodifiableSet(EnumSet.of(Collector.Characteristics.UNORDERED));
	}
}

We use ArrayList::new (method references are another excellent feature in Java 8 and beyond) for our Supplier since we start off with a blank slate, and for the Accumulator, we use List::add since the last section made it clear that the accumulator’s only job is to keep collecting items into running value of another type (a List in this case).

Then we have the Combiner which is implemented by the little lambda expression:

   (acc, ps) -> { acc.addAll(ps); return acc; }

As mentioned in the previous section, the combiner simply flattens the collections together into a single collection. In case of confusion, always look to the type signature for clarity.

And finally, we have the Finisher:

return (points) -> {
			final Set<Integer> xs = new HashSet<>();
			final Set<Integer> ys = new HashSet<>();
			
			for (Point p : points) {
				xs.add(p.x());
				ys.add(p.y());
			}
			
			return new Points(xs, ys);
		};

At this point of the stream pipeline, the points variable holds the list of accumulated Point objects. All we do then is to create an instance of the Points class by using the data available in the points variable. The whole point (if you will forgive the pun) is that this method will have logic peculiar to your specific use case, so the implementation will vary tremendously (which is more than can be said about the others – supplier, accumulator, and combiner).

And finally, here is our main class:

package com.z0ltan.custom.collectors;

import java.util.Arrays;
import java.util.List;

import com.z0ltan.custom.collectors.collectors.PointToPointsCollector;
import com.z0ltan.custom.collectors.types.Point;
import com.z0ltan.custom.collectors.types.Points;

public class Main {
	public static void main(String[] args) {
		final List<Point> points = Arrays.asList(new Point(1, 2), new Point(1, 2), new Point(3, 4), new Point(4, 3),
				new Point(2, 5), new Point(2, 5));

		// the result of our custom collector
		final Points pointsData = points.stream().collect(new PointToPointsCollector());

		System.out.printf("\npoints = %s\n", points);
		System.out.printf("\npoints data = %s\n", pointsData);
	}
}

Well, let’s run it and see the output!

Macushla:custom-collector z0ltan$ mvn package && java -jar target/custom-collector-1.0-SNAPSHOT.jar
[INFO] Scanning for projects...
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] Building custom-collector 1.0-SNAPSHOT
        <elided>
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 2.627 s
[INFO] Finished at: 2017-07-11T21:31:19+05:30
[INFO] Final Memory: 16M/55M
[INFO] ------------------------------------------------------------------------

points = [Point { x = 1, y = 2 }, Point { x = 1, y = 2 }, Point { x = 3, y = 4 }, Point { x = 4, y = 3 }, Point { x = 2, y = 5 }, Point { x = 2, y = 5 }]

points data = Points { xs = [1, 2, 3, 4], ys = [2, 3, 4, 5] }

Success!

Advertisements
Creating custom Java 8 Stream collectors

My favourite feature in Java 9 – JShell (and other ruminations)

It used to be the case that Java releases were few and far between. Back in the olden days, releases used to happen so infrequently that many users were forced to write libraries to compensate for the lack of features in Java. I still recall how Generics itself was a big feature that was included only in Java 5 after Java 1.4 had been around for a long time. The problem with that was that a lot of code had already been written which made use of “raw” (in Java Generics parlance) types, and the Java folks could not afford to break all that code! Thus started this incongruous (if understandable) tradition in Java – introduce features but never break any existing code. The issue with this approach is that while it saves hundreds of thousands of man-hours of effort, it does introduce inherent limitations in the feature being introduced. This is the reason why Java Generics are such a mess, and now why Lambdas in Java are less than ideal (in fact I’d argue they’re pretty much non-features and we’ll see why in future posts). Notice also that Lambdas in Java were introduced as late as Java 8, and arguably primarily in order to support Streams. Thankfully, major major Java releases have been occurring with greater frequency in recent years, and surely that is a good thing. Of course, I don’t want a situation where releases happen with such frequency that codebases get broken on a regular basis – that’d be even worse.

Java 9 has been in the offing for some time now, and I believe that its planning had already begun even before Java 8 was released. Java 8 introduced some much needed features – basic Lambda support, Streams (my favourite feature in Java 8!), and default methods in interfaces. The interesting bit about default methods in interfaces is that that feature was introduced to support the new Streams feature in Java 8. How do you figure? Let’s take a simple example – the List interface in the java.util package. Till Java 8, this package did not have a stream() method. This means that if stream() had been made a method in the List interface in JDK 8, all legacy code using the List interface (which would practically be all codebases) would be irreparably broken. If, on the other hand, this method was not part of List, then streams could not be supported on List! To solve this conundrum, JDK 8 introduced a new List interface where ‘stream’ was made a default method of the List interface. This means that legacy code would work fine with the new List interface since the JRE would take care to ensure that the stream method (which is, to be pedantic, actually defined in the base Collection interface with the signature: default Stream stream()) would be ignored for older code whereas new code that made use of this new feature would hum along nicely as well. Hackish, but it works.

Anyway, to get on with it, Java 9 introduces some very interesting new features as well. The full list can be seen here: JDK 9 features. However, a couple of major features stand out in this list – JShell (and the JShell API), and support for Modules. The latter feature is technically a separate project that’s will be bundled with the main Java release, and is far more complicated than I would have liked. I reserve further comments on that till such time as Java 9 itself is released. However, I absolutely love JShell and feel its arguably the best productivity-boosting feature ever released in Java. Of course, I’m talking about the JShell tool itself (which comes bundled with Java 9. Download EA (Early Access) releases of Java 9 here – JDK 9 Downloads).

Anyone who has ever worked with languages with ecosystems that include at least a REPL (Read-Eval-Print-Loop for the Luddites) will agree that once you get used to that mode of working, it feels severely constraining to get back to the traditional Code-Compile-run cycle. The best environment in this regard is provided by Lisps – Common Lisp in particular ensures a very interactive image-based ecosystem that is in a league of its own. Scheme and Racket also provide interactive development environments (DrRacket, for instance), but they’re still inferior (in my opinion) to Common Lisp environments (SBCL + Emacs + SLIME is what I use myself). Python also provides a decent REPL system, and is arguably the best of the mainstream languages in that regard. Traditionally, dynamic languages have had it easy when it comes to REPLs by their very nature while statically-typed languages have not been able to provide something comparable. Haskell is a very good example of a language that defies this rule. Haskell is a very strongly-statically-typed language and yet it has a very nice REPL. Plus, of course, the wonderful Hindley-Milner type inference system in place ensures that you hardly ever have to type in the types yourself (pun intended). Scala is also a strongly-typed static language that has a decent REPL (on par with Haskell). More traditional static languages like C++ and Java haven’t had a REPL in years, apart from some projects that have attempted to provide one in the form of libraries – that never does work as well though as coming bundled with the language, of course. With the introduction of JShell in Java 9, I believe Java has at least one feature that’s clearly superior to C++. In terms of the other “functional” features such as lambdas and closures, not at all. More on this comparison in later posts. Now, let’s jump into JShell and play around a bit!

When you install the JDK 9 EA bits, the “jshell” executable comes bundled in the “bin/“ folder by default. JShell also comes with an new sets of JShell APIs that can be used to essentially create our own custom shells, but I’m more interested in the interactive JShell tool for now. To run jshell, simply run it in a shell, and you should see something like the following:

Timmys-MacBook-Pro:Blogs z0ltan$ jshell
|  Welcome to JShell -- Version 9-ea
|  For an introduction type: /help intro

jshell>

Now, let’s see. As far as I have evaluated it, some points to be noted while developing with this REPL are:

  • No semi-colons required (well, almost none).
  • No mandatory class wrappers to execute code – plain Java code works fine.
  • Ability to use the generated variable names ($N, where N is an integer) to refer to objects (a la Scala).
  • Ability to import custom packages (if the JARs are in the classpath).
  • Ability to save code snippets to file.
  • Ability to persist state between sessions (very important for interactive development).
  • Ability to open external files and load them.
  • Ability to list all session-declared variables, methods, classes, and history.
  • And most importantly, code completion using the tab key (eat that, Python!).

Of course, that is not an exhaustive list. There are a lot more features that are available, and you can see all the options using the “/help” command:

Now let’s get down to some hacking! (For those unfamiliar with Lambdas and Streams in Java, I’ll post a series of introductory blogs on those topics in the near future. For now, bear with me as the main point of this blog is to demonstrate JShell’s support interactive development environment).

0). First let us import the classes that we are interested in:

jshell> import java.util.*

jshell> import java.util.stream.*

jshell> import java.util.function.*

1). Let us now create a list of names, convert them to uppercase, and the print them out:

jshell> List<String> names =
            Arrays.asList("Peter", "Timmy", "Gennady", "Petr", "Slava")
names ==> [Peter, Timmy, Gennady, Petr, Slava]

jshell> names.stream()
             .map((name) -> name.toUpperCase())
             .forEach(System.out::println)
PETER
TIMMY
GENNADY
PETR
SLAVA

jshell> names.forEach(System.out::println)
Peter
Timmy
Gennady
Petr
Slava

Note that the original list is not modified at all since streams are functional (in most respects). All stream functions return a new version of the original data structure with the necessary processing done. Also note how the “names” variable is printed out in human-readable form.

2). With the same list of names, let’s sort them out in non-decreasing order (lexicographically speaking):

jshell> Collections.sort(names, (f, s) -> f.compareTo(s))

jshell> names
names ==> [Gennady, Peter, Petr, Slava, Timmy]

Note that Collections.sort() is a mutating operation.

3). Now that this list has been sorted, let us do a series of operations: let’s filter out the names that start with G, convert the rest to uppercase, and concatenate them to a single string, and return that value:

jshell> names.stream()
             .filter((s) -> !(s.startsWith("G") || s.startsWith("g")))
             .map(String::toUpperCase)
             .collect(Collectors.joining())
$9 ==> "PETERPETRSLAVATIMMY"

Observe how map(String::toUpperCase) is essentially the same as map((r) -> r.toUpperCase()), and also how the return type, which is a string, is automatically assigned to a generated variable, $9, which can now be used in further processing. For instance, if we wished to find the length of this concatenated string, we could do something like (it’s obviously a contrived example):

jshell> Function<String, Void> func =
                 (s) -> { System.out.println(s.length()); return null; }
func ==> $Lambda$16/1427810650@35d176f7

jshell> func.apply($9)
19
$11 ==> null

A couple of interesting observations here: first off, Note how we use the semi-colons inside the body of the lambda expression – this is required in JShell when you have multiple statements in the body or when you use an explicit return statement. Moreover, if a class or function is being defined explicitly, a semi-colon is also required. If you want to play safe, you might as well use semi-colons everywhere.

Secondly, we assign the lambda expression to a Function variable. Function<T, R> is a “functional interface”, which basically means that it is an interface with a single abstract (non-default) method (also called SAMs). Any interface that conforms to this convention is a functional interface. However, the problem is that we can’t just invoke the function object. We have to know that this interface’s method name is “apply”, and thus use that explicitly. More on this on the series of planned posts on Lambdas and Streams in Java 8.

4). That’s about strings. Now let’s see some more examples with other types. Let’s generate an infinite stream of positive integers, collect upto a certain limit, filter out the evens, take their sum and return the maximum in the set:

jshell> IntStream.iterate(1, (n) -> n+1)
                  .limit(100).filter((d) -> d%2 == 0)
                  .summaryStatistics()
$12 ==> IntSummaryStatistics{count=50, sum=2550, min=2, average=51.000000, max=100}

jshell> System.out.format("Max: %d, Sum: %d\n", $12.getMax(), $12.getSum())
Max: 100, Sum: 2550
$13 ==> java.io.PrintStream@612679d6

The “iterate” method takes an initial starting value and a lambda expression that acts as a generator function for the series.

5). Finally, let’s observe some of the aforementioned JShell features in action:

  • Viewing all the imports:
    jshell> /imports
    |    import java.io.*
    |    import java.math.*
    |    import java.net.*
    |    import java.util.concurrent.*
    |    import java.util.prefs.*
    |    import java.util.regex.*
    |    import java.util.*
    |    import java.util.stream.*
    |    import java.util.function.*
    
  • Open and load a source file:
    jshell> /open /Users/z0ltan/HelloWorld.java
    |  Warning:
    |  Modifier 'public'  not permitted in top-level declarations, ignored
    |  public class HelloWorld {
    |  ^----^
    
    jshell> HelloWorld main = new HelloWorld()
    main ==> HelloWorld@757942a1
    
    jshell> main.main(null)
    Hello, world!
    
  • Save the current session:
    jshell> /save -all /Users/z0ltan/session1
    
    jshell>
    

    This session file can then be opened using the “open” command when launching a fresh jshell session.

  • Finally, to exit the current session:
    jshell> /exit
    |  Goodbye
    

Well, that’s all for a very basic introduction to JShell in Java 9! I didn’t go into much detail about why such interactivity is useful. I can’t list out a few reason why (without going deeper, which I defer to future posts):

  • Smalls snippets of code can be tested without the whole Code->Save->Compiler->Run->Debug cycle that one would normally have to do.
  • Top down and Bottom up programming can be done in a REPL. For instance:
    jshell> void printFactorial(int n) {
       ...> System.out.format("factorial(%d) = %d\n", n, factorial(n));
       ...> }
    |  created method printFactorial(int), however, it cannot be invoked until method factorial(int) is declared
    
    jshell> long factorial(int num) {
       ...>   long f = 1L;
       ...>   for (int i = 1; i <= num; i++) 
       ...>      f *= i;
       ...>   return f;
       ...> }
    |  created method factorial(int)
    
    jshell> printFactorial(10)
    factorial(10) = 3628800
    

    In this case, we could define functions (or methods, if you want to be pedantic about it) that reference functions that haven’t been defined as yet. This is an example of top-down programming.

  • There is much less thought impedance when developing in a REPL. Of course, Java as a whole is not as suited to interactive development as, say, Common Lisp is, but it is nevertheless invaluable. In other words, it doesn’t hamper flow as much as the traditional compile cycle, and this definitely helps boost creativity and productivity.

 

 

 

Next up, I will discuss lambda support in Java (8 and above) and how it compares to similar support in other languages – C++, Python, Common Lisp, and Haskell.

My favourite feature in Java 9 – JShell (and other ruminations)