TL;DR: Reducers are all about maximising performance.

The Clojure Reducers library (“reducers”) was announced by Clojure’s author Rich Hickey in May 2012 in this blog post. He followed up a week later with a second post.

These posts were very well written and densely packed with volumes of information and both really benefit from being pored over and multiple re-reading. Their approach was one of trying to convey an understanding of reducers together with some examples of how to use them.

Both posts covered a lot of ground, introducing the (new) reducers way of thinking about map, filter & reduce for collections, aided by new terminology to explain the new ideas.

Reducers also use heavily Clojure’s support for dynamically generating functions (in fact, a function generating a function that in turn generates a third function) so you need to have a clear grasp of how that works as well.

The second post also explained how reduce had been changed (in v1.4) to allow collections to support directly reduction using the CollReduce protocol.

Quite a lot to take in and the mental jigsaw puzzle quite hard to piece together at times.

There have been other posts explaining reducers (especially good is this slideshare from Leonardo Borges) but all have followed the approach of, and examples from, Rich’s original posts i.e. understand and then use.

Sometimes though is easier to consolidate your understanding after having used something successfully. As I’ve been poring over Rich’s and other posts, I been writing some examples for precisely that reason and thought I’d present a “narrative” of some trivial example to illustrate the use of reducers.

I’m will be deliberately avoiding as much “theory” and new terminology as I can and just concentrating on some practical, if trivial, examples, labouring points sometimes to ensure clarity.

The Examples Collection

Many of the following examples will use the same collection which describes the population of a very (very!) small village. The village has four families, some families have two parents, others one; and each family between one and three children. One family lives in the North, and one each in the South, East and West.

This is stylised, artificial and contrived data designed to be easily understandable to most and in no way intended to suggest any family structures, conventions or arrangements as socially preferable. Just saying.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22 ;; The Families in the Village

(def village
  [
   {:home :north :family "smith" :name "sue" :age 37 :sex :f :role :parent}
   {:home :north :family "smith" :name "stan" :age 35 :sex :m :role :parent}
   {:home :north :family "smith" :name "simon" :age 7 :sex :m :role :child}
   {:home :north :family "smith" :name "sadie" :age 5 :sex :f :role :child}
   
   {:home :south :family "jones" :name "jill" :age 45 :sex :f :role :parent}
   {:home :south :family "jones" :name "jeff" :age 45 :sex :m :role :parent}
   {:home :south :family "jones" :name "jackie" :age 19 :sex :f :role :child}
   {:home :south :family "jones" :name "jason" :age 16 :sex :f :role :child}
   {:home :south :family "jones" :name "june" :age 14 :sex :f :role :child}

   {:home :west :family "brown" :name "billie" :age 55 :sex :f :role :parent}
   {:home :west :family "brown" :name "brian" :age 23 :sex :m :role :child}
   {:home :west :family "brown" :name "bettie" :age 29 :sex :f :role :child}
   
   {:home :east :family "williams" :name "walter" :age 23 :sex :m :role :parent}
   {:home :east :family "williams" :name "wanda" :age 3 :sex :f :role :child}
   ])

Code is on Github

The code can be found here.

Note the reducers library is referred to as r:

1
2   [:require [clojure.core.reducers :as r]
            [clojure.string :as string]]

Example 1 - how many children in the village?

An obvious way to do this would be to map each child to the value 1, else 0. The total number of children is then the simple addition of the 1s & 0s results of the map operation.

Example 1 - create the reducers map function to return 1 if a child, else 0

But, with reducers, you don’t use map in the same way as core map because the reducers map function (i.e. r/map below) returns another function to be used together with a collection and reduce.

1
2 ;; Example 1 - create the reducers map function to return 1 if a child, else 0
(def ex1-map-children-to-value-1 (r/map #(if (= :child (:role %)) 1 0)))

Example 1 - use reduce to add up all the mapped values

The reduce function has to be used to perform (action) the mapping and then add up the mapped 1 & 0 values.

1
2
3
4 ;; Example 1 - use reduce to add up all the mapped values
(r/reduce + 0 (ex1-map-children-to-value-1 village))
;;=>
8

An important point to emphasise here is the reducers map function only “prepares” the resulting collection to be consumed by the reduce; the map itself doesn’t do anything off its own bat to the contents of the collection and needs reduce to “drive” the mapping.

Example 2 - how many children in the Brown family?

Ok, how about finding how many children just in the Brown family? Again, an obvious way to do this would be to select ( filter) the members of the Brown family, and use the same map function as Example 1 (ex1-map-children-to-value-1) to count the children.

Example 2 - select the members of the Brown family

Just like map, reducers has a filter function that returns another function that can be used with reduce:

1
2 ;; Example 2 - select the members of the Brown family
(def ex2-select-the-brown-family (r/filter #(= "brown" (string/lower-case (:family %)))))

Example 2 - compose a composite function to select the Brown family and map children to 1

The functions returned by reducers map and filter can be composed in the usual Clojure way to create a multi-step transformation. So we can create a transformation pipeline function to perform first the ex2-select-the-brown-family and then ex1-map children to 1 :

1
2 ;; compose a composite function to select the Brown family and map children to 1
(def ex2-pipeline (comp ex1-map-children-to-value-1 ex2-select-the-brown-family))

Example 2 - using reduce to add up all the Brown children

The pipeline function (ex2-pipeline) then can be used with reduce to find (add) the number of children in the Brown family:

1
2
3
4 ;; Example 2 - using reduce to add up all the  Brown children
(r/reduce + 0 (ex2-pipeline village))
;;=>
2

Its worth noting that although this is a multi-step transformation pipeline (i.e filter then map), the reduce does not need to create any intermediate collections as would be the case when using core map and filter. This is a useful performance boost.

Example 3 - how many children’s names start with J?

We already know the answer: just the 3 children in the Jones family.

Algorithmically, this is a three step pipeline: filter on children, a filter on names beginning with “j” and then a count of how many children in the result.

Example 3 - selecting (filtering) just the children

A very simple filter will select just the children:

1
2 ;; Example 3 - selecting (filtering) just the children
(def ex3-select-children (r/filter #(= :child (:role %))

Example 3 - selecting names beginning with “j”

Again, this is a very simple filter:

1
2 ;; Example 3 - selecting names beginning with "j"
(def ex3-select-names-beginning-with-j (r/filter #(= "j" (string/lower-case (first (:name %))))))

Example 3 - mapping the entries in a collection to 1

In Example 1 we created the map function ex1-map-children-to-value-1 to enable reduce to count the number of children.

But the need to count the number of entries in a collection using reduce, after a pipeline possibly involving many filters and mappers, is a common one. This is straightforward to do, in the final stage of the pipeline use a map function to transform each entry to value 1:

1
2 ;; Example 3 - mapping the  entries in a collection to 1
(def ex0-map-to-value-1 (r/map (fn [v] 1)))

Example 3 - create the three step pipeline function

As in Example 2, the three-step pipeline can be composed simply:

1
2
3
4 ;; Example 3 - create the three step pipeline function
(def ex3-pipeline (comp ex0-map-to-value-1
                        ex3-select-names-beginning-with-j
                        ex3-select-children))

Its worth labouring the point that composing custom pipelines from simply individual filters and mappers is very powerful.

Example 3 - reduce the village with the pipeline function

The final reduction should now be familiar and the result will be the number of children whose names begin with “j”:

1
2
3
4 ;; Example 3 - reduce the village with the pipeline function
(r/reduce + 0 (ex3-pipeline village))
;; =>
3

Example 4 - making a collection of the children whose names start with J?

Sometimes though you will want the actually collection itself.

Example 4 - a pipeline to just filter children with names starting with “j”

Since we want the actual entries, rather than count them, we need a pure filter pipeline, similar to Example 3 but one that doesn’t use the ex-map-to-value-1 mapper:

1
2
3 ;; Example 4 - a pipeline to just filter children with names starting with "j"
(def ex4-pipeline (comp ex3-select-names-beginning-with-j
                        ex3-select-children)

Example 4 - create a vector with the “j” children

Creating a vector a vector of the children who’s names start with J can be done simply using into.

Under the covers into uses reduce so making the vector is just a matter of applying the ex4-pipeline function to village and then using into to build the vector.

1
2
3
4
5
6 ;; Example 4 - create a vector with the "j" children
(into [] (ex4-pipeline village))
;; =>
[{:age 19, :home :south, :name "jackie", :sex :f, :family "jones", :role :child}
 {:age 16, :home :south, :name "jason", :sex :f, :family "jones", :role :child}
 {:age 14, :home :south, :name "june", :sex :f, :family "jones", :role :child}]

Example 5 - average age of children on or below the equator

A more involved, but still straightforward, example to finish this section: what is the average age of the children who live on or below the equator? By equator I mean where home is East, South or West.

To do this, the value of home will be mapped to a latitude and longitude. For example West will be :lat 0 :lng -180 and South is :lat -90 :lng 0.

Example 5 - select the children

We can re-use the ex3-select-children filter again.

Example 5 - map home to latitude and longitude

A simple mapper to add latitude and longitude depending on the value of :home:

1
2
3
4
5
6
7
8
9 ;; Example 5 - map :home to latitude and longitude
(def ex5-map-home-to-latitude-and-longitude
  (r/map
   (fn [v]
     (condp = (:home v)
       :north (assoc v :lat 90 :lng 0)
       :south (assoc v :lat -90 :lng 0)
       :west (assoc v :lat 0 :lng -180)
       :east (assoc v :lat 0 :lng 180)))))

Example 5 - select people on or below the equator i.e. latitude <= 0

People on or below the equator have a latitude with maximum value of 0:

1
2 ;; Example 5 - select people on or below the equator i.e. latitude <= 0
(def ex5-select-on-or-below-equator (r/filter #(>= 0 (:lat %)))

Example 5 - count the number of children on or below the equator

To find the average age, we need to know the number of children.

Note, rather than creating a composite pipeline function, in this example the individual stages of the pipeline have been used explicitly:

1
2
3
4
5
6
7 ;; Example 5 - count the number of children on or below the equator
(def ex5-no-children-on-or-below-the-equator
  (r/reduce + 0
          (ex0-map-to-value-1
           (ex5-select-people-on-or-below-equator
            (ex5-map-home-to-latitude-and-longitude
             (ex3-select-children village))))))

Example 5 - sum the ages of children

To sum the ages require the ex0-map-to-value-1 to be replaced by a function to select the age value (ex5-select-age):

1
2
3
4
5
6
7
8
9 ;; Example 5 - sum the ages of children
(def ex5-select-age (r/map #(:age %)))

(def ex5-sum-of-ages-of-children-on-or-below-the-equator
  (r/reduce + 0
          (ex5-select-age
           (ex5-select-people-on-or-below-equator
            (ex5-map-home-to-latitude-and-longitude
             (ex3-select-children village))))))

Example 5 - calculate the average age of children on or below the equator

For completeness, the final step is the division to calculate the average age:

1
2
3
4 ;; Example 5 - calculate the average age of children on or below the equator
(def ex5-averge-age-of-children-on-or-below-the-equator (float (/ ex5-sum-of-ages-of-children-on-or-below-the-equator ex5-no-children-on-or-below-the-equator )))
;; =>
17.3

That was fun but why bother?

The elimination of intermediate collections in reducers multi-step pipelines is a very useful performance boost, and the reducers way of thinking about and processing collections is an interesting mindset change, but are they worth the bother?

The answer lies in a peer function of reduce called fold. When fold can be used, it will automatically parallelise the processing of the collection, taking advantage of multiple CPU cores.

Where reduce has been used above, fold can be used instead. NOTE THOUGH fold has slightly different arguments; at its simplest it does not have / need the initial value.

If needed, fold will drop down to reduce if it can’t parallelise.

In fact, fold breaks the collection down into chunks (currently defaulting to 512) and processes (reduces) each chunk in parallel. When a chunk has been processed, its results have to be combined with the results of the other processed chunks.

Bit of unavoidable theory, to maintain the order of the results the map and filter functions used by fold must be associative. Associativity means simply that how results are “grouped” together is unimportant e.g. for arithmetic addition (+) then 1 + (2 + 3) === (1 + 2) + 3 === (1 + 2 + 3). Not all operators are associative though.

Example 6 - comparing the performance of reduce and fold

Lets time the addition of the ages from Example 5 using reduce and fold.

Example 6 - time reduce adding up Example 5’s ages

Lets time reduce first:

1
2
3
4
5
6
7
8 ;; Example 6 - time reduce adding up Example 5's ages
(time (r/reduce +
                (ex5-select-age
                 (ex5-select-people-on-or-below-equator
                  (ex5-map-home-to-latitude-and-longitude
                   (ex3-select-children village))))))
;; =>
"Elapsed time: 0.091714 msecs"

Example 6 - time fold adding up Example 5’s ages

Now fold:

1
2
3
4
5
6
7
8 ;; Example 6 - time fold adding up Example 5's ages
(time (r/fold +
              (ex5-select-age
               (ex5-select-people-on-or-below-equator
                (ex5-map-home-to-latitude-and-longitude
                 (ex3-select-children village))))))
;; =>
"Elapsed time: 0.185448 msecs"

So fold took twice as long as reduce in this example because (I guess) the fold “red tape” (e.g. chunking and combining) outweighed the benefit gained from performing the computation in parallel. No free lunch.

Example 7 - all the relatives visit the village!

Every year all the relatives of the families in the village visit and the population of the village swells enormously. Stick with me on this :-)

Example 7 - make some visitors

Lets define some functions to create an influx of visitors. Note, no attempt has been made to ensure this randomly generated data makes any sort of real world sense - it could include e.g. a child of age 100.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22 ;; Example 7 - make some visitors

(def ex7-fn-random-name (fn [] (rand-nth ["chris" "jim" "mark" "jon" "lisa" "kate" "jay" "june" "julie" "laura"])))
(def ex7-fn-random-family (fn [] (rand-nth ["smith" "jones" "brown" "williams" "taylor" "davies"])))
(def ex7-fn-random-home (fn [] (rand-nth [:north :south :east :west])))
(def ex7-fn-random-sex (fn [] (rand-nth [:m :f])))
(def ex7-fn-random-role (fn [] (rand-nth [:child :parent])))
(def ex7-fn-random-age (fn [] (rand-int 100)))

(def ex7-visitor-template
  {:home ex7-fn-random-home
   :family ex7-fn-random-family
   :name ex7-fn-random-name
   :age ex7-fn-random-age
   :sex ex7-fn-random-sex
   :role ex7-fn-random-role})

(defn ex7-make-visitor [] (into {} (for [[k v] ex7-visitor-template] [k (v)])))

(defn ex7-make-visitors [n] (take n (repeatedly ex7-make-visitor)))

(def ex7-visitors (into [] (ex7-make-visitors 1000000)))

Example 7 - count the visiting Brown children using reduce

Lets reprise Example 2 and time how long it takes to count the visiting Brown children using reducers reduce:

1
2
3
4 ;; Example 7 - count the visiting Brown children using reduce
(time (r/reduce + 0 (ex2-pipeline ex7-visitors)))
;; =>
"Elapsed time: 238.448041 msecs"

Example 7 - count the visiting Brown children using fold

Ok, lets try the same using fold:

1
2
3
4 ;; Example 7 - count the visiting Brown children using fold
;;(time (r/fold + (ex2-pipeline ex7-visitors)))
;; =>
"Elapsed time: 59.6259 msecs"

Nearly four times faster! (My workstation has four cores.) The exact timing values should be taken with a pinch of salt though as this wasn’t a rigorous benchmark. But the improvement is undeniable.

Example 7 - count the visiting Brown children using core map, filter and reduce

How do the reducers fare against core map, filter and reduce?

1
2
3
4
5
6
7 ;; Example 7 - count the visiting Brown children using core map, filter and reduce
(time (reduce + 0
              (map #(if (= :child (:role %)) 1 0)
                   (filter #(= "brown" (string/lower-case (:family %))) ex7-visitors))))

;; =>
"Elapsed time: 223.55717 msecs"

The core reduce and reducers reduce timings are pretty much the same; as you might expect since only one intermediate collection would be created by the core versions and very little else has changed.

What else can you do with fold?

Bit more theory.

I mentioned above that fold breaks the collection into chunks, the default chunk size is 512, and each chunk is processed in parallel.

But this raises a question: how does fold know how to combine the processed chunks to produce the final results?

Because fold can take also the chunk size and a combiner function as arguments, as well as the usual reducing function and collection.

To demonstrate, here is the fold from Example 7:

1
2 ;; A simple fold where both the reducer and combiner is +
(r/fold + (ex2-pipeline ex7-visitors))

In the above the + operator (function) was both the reducer and combiner. No chunk size was given.

A completely specified call to fold could look like this:

1 (r/fold the-chunk-size (r/monoid the-combiner-function the-init-function) the-reducer-function the-collection)

The reducer function could be + but is more likely to some custom logic.

The call to the monoid function is a convenient way of creating a properly formed combination function that fold can use. monoid takes two arguments, the first being your combiner function, and the second being the function that will create the init value when it is called with no arguments. (In the previous examples the init value to reduce was 0)

Or, if your combiner will generate the init value, it can be used directly:

1 (r/fold the-chunk-size the-combiner-function-with-init the-reducer-function the-collection)

An example may help to clarify.

Example 8 - find the average age of visiting children with names beginning “j”

In this example, we’ll calculate the average age of all visiting children with names beginning “j”. But in just one fold, not like the two-step calculation of average age in Example 5.

Example 8 - create a collection of visiting children with name beginning with “j”

To select visiting children with names starting with “j” we can use exactly the same pipeline from Example 4, apply it to the visitors and store the collection into a vector:

1
2 ;; Example 8 - create a collection of vistsors with name beginning with "j"
(def ex8-visitors-collection (into [] (ex4-pipeline (ex7-make-visitors 1000000))))

I don’t understand why but fold seems to need a “real” collection else it doesn’t seem to call the combiner. Answers on a postcard please.

Example 8 - create the reducer to sum the ages and number of people in that chunk

The reducer function adds up how many people have been seen in that chunk and also sums their age:

1
2
3
4
5 ;; Example 8 - create the reducer to sum the ages and number of people in that chunk
(def ex8-reducer
  (fn [s v]
    {:total-people (+ (get s :total-people 0) 1)    ;; add 1 to number of people seen
     :total-age (+ (get s :age 0) (get v :age))}))  ;; and add their ages

Example 8 - create the combiner to calculate the average age

The combiner takes the results of two processed chunks and calculates the average age.

1
2
3
4
5
6
7
8
9
10
11
12 ;; Example 8 - create the combiner to calculate the average age
(defn ex8-combiner
  ([] {}) ;; note this is the init value for each reduced chunk
  ([a b]
     (doall (println (format "EX8 CMB FN2 a >%s< b >%s<" a  b )))
     (let [total-people (+ (get a :total-people) (get b :total-people))
           total-age (+ (get a :total-age) (get b :total-age))
           average-age (float (/ total-age total-people))
           ]
       {:total-people total-people
        :total-age total-age
        :average-age average-age})))

Note, when called with no arguments, it returns the init value for the reduction of each chunk (an empty hash-map here).

Example 8 - run the fold to perform the average calculation

The average age can now be calculated by running the fold:

1
2
3
4 ;; Example 8 - run the fold to perform the average calculation
(r/fold ex8-combiner-fn2 ex8-reducer-fn2  ex8-visitors-collection)
;; =>
{:total-people 28, :total-age 1314, :average-age 46.92857}

Note, the calculation of the average age in the combiner rather than the reducer was an artificial distinction for illustration, the processing could have been performed in one function. One reason why, when doing arithmetic, a separate combiner often isn’t necessary. However, one can conceive of some “final logic” that could and should only be performed in a combiner.

Final Words

Some very trivial examples that only scratch the surface of the potential of reducers.

Once you get reducers you realise they are simple if perhaps not easy at first. But once you get the hang of them, they feel as comfortable and accessible to use as their core equivalents.

The performance benefits from being able to easily parallelise the processing of collections is undeniable; for very little reworking of the code, performance can be improved significantly.

I particularly like how sophisticated, multi-step transformation pipelines can be composed from individual mappers and filters and used with very little overhead (no need to create intermediate collections.)