A BST implementation in Clojure

My new workplace is heavily invested in Clojure, and while other languages are used as the need arises (Golang, Elixir, Haskell, PureScript, TypeScript just to name a few), it is primarily a Clojure shop.

This was rather interesting for me since I have dabbled with Common Lisp, and on the surface, a lot of the basic concepts immediately carried over (especially macros, with minor syntactic differences and the hygienic aspects of Clojure macros). That being said, a lot of the core concepts/philosophies/approaches are rather different, and that’s where dabbling with Haskell was probably more useful in me getting up to speed on those concepts.

I’m enjoying learning it so far, but I believe that real learning (especially learning how to write idiomatic Clojure) will happen only when I start doing real projects.

Nevertheless, it has been fun playing around on the REPL (even though CIDER is no SLIME), and I thought of posting a small BST example that I did for fun – the traversals are especially inspired by the samples in “Joy of Clojure” (a really good book in my opinion, barring the chapter on macros which I thought as atrocious).

Without any further ceremony, here is the code:

(ns bst)

;;;
;;; define the node type
;;;

(defrecord BSTNode [data left right])

;;;
;;; define the operations on the bst
;;;

(defn bst-add [root data]
  "adds a new node to the tree - creating the root node if the root is nil"
  (cond
    (nil? root) (BSTNode. data nil nil)
    (<= data (:data root)) (BSTNode. (:data root) (bst-add (:left root) data) (:right root))
    :else (BSTNode. (:data root) (:left root) (bst-add (:right root) data))))

(defn bst-size [root]
  "returns the number of nodes in the BST"
  (cond
    (nil? root) 0
    :else (+ 1 (bst-size (:left root)) (bst-size (:right root)))))

(defn bst-height [root]
  "returns the height of the BST"
  (cond
    (nil? root) 0
    :else (inc (max (bst-height (:left root)) (bst-height (:right root))))))

(defn bst-preorder [root]
  "returns the nodes of the BST traversing in pre-order fashion"
  (when root
    (concat [(:data root)] (bst-preorder (:left root)) (bst-preorder (:right root)))))

(defn bst-inorder [root]
  "returns the nodes of the BST traversing in in-order fashion"
  (when root
    (concat (bst-inorder (:left root)) [(:data root)] (bst-inorder (:right root)))))

(defn bst-postorder [root]
  "returns the nodes of the BST traversing in post-order fashion"
  (when root
    (concat (bst-postorder (:left root)) (bst-postorder (:right root)) [(:data root)]))) 

;;;
;;; sample randomized BST
;;;
(def my-bst (reduce bst-add nil (for [_ (range (+ (rand-int 20) 10))] (rand-int 100))))

(printf "size = %d, height = %d\n" (bst-size my-bst) (bst-height my-bst))

(println "pre-order: " (bst-preorder my-bst))

(println "in-order: " (bst-inorder my-bst))

(println "post-order: " (bst-postorder my-bst))

And here is a small run-through on the REPL (I thought using Leiningen would be overkill for such a small “project”):

user> (load-file "bst.clj")
size = 15, height = 6
pre-order:  (42 9 3 18 13 31 28 30 33 38 75 56 56 92 92)
in-order:  (3 9 13 18 28 30 31 33 38 42 56 56 75 92 92)
post-order:  (3 13 30 28 38 33 31 18 9 56 56 92 92 75 42)
nil
user> (in-ns 'bst)
#namespace[bst]
bst> my-bst
#bst.BSTNode{:data 42, :left #bst.BSTNode{:data 9, :left #bst.BSTNode{:data 3, :left nil, :right nil}, :right #bst.BSTNode{:data 18, :left #bst.BSTNode{:data 13, :left nil, :right nil}, :right #bst.BSTNode{:data 31, :left #bst.BSTNode{:data 28, :left nil, :right #bst.BSTNode{:data 30, :left nil, :right nil}}, :right #bst.BSTNode{:data 33, :left nil, :right #bst.BSTNode{:data 38, :left nil, :right nil}}}}}, :right #bst.BSTNode{:data 75, :left #bst.BSTNode{:data 56, :left #bst.BSTNode{:data 56, :left nil, :right nil}, :right nil}, :right #bst.BSTNode{:data 92, :left #bst.BSTNode{:data 92, :left nil, :right nil}, :right nil}}}

Enjoyable, but there’s still the most interesting bits of Clojure to come in this learning path – concurrency and designing programs the idiomatic way!

Advertisements

Simple expression evaluator comparison between Haskell, Rust, and Common Lisp

Consider a simple expression language that consists only of the four basic mathematical operations – addition, subtraction, multiplication, and division. The idea of this exercise is to implement an evaluator for such a language in Haskell, and then compare it with literal translations (as far as possible) into Rust and Common Lisp. This should be interesting.

First off, the Haskell version:

module Main where

data Expr = Val Int
          | App Op Expr Expr
          deriving (Eq, Show, Ord)

data Op = Add | Sub | Mul | Div deriving (Eq, Show, Ord)


eval :: Expr -> Maybe Int
eval (Val n) = Just n
eval (App o e1 e2) = do m <- eval e1
                        n  Just (m + n)
                            Sub -> Just (m - n)
                            Mul -> Just (m * n)
                            Div -> safediv m n
                            

safediv :: Int -> Int -> Maybe Int
safediv m 0 = Nothing
safediv m n = Just $ m `div` n


e1 :: Expr 
e1 = App Add (Val 2) (App Mul (Val 3) (Val 6))


e2 :: Expr
e2 = App Mul (App Add (Val 1) (Val 3)) (App Div (Val 10) (Val 0))


main :: IO ()
main = do putStrLn $ show (eval e1)
          putStrLn $ show (eval e2)        

As simple as it gets! The `safediv` idea is directly implemented from Professor Graham Hutton’s book, “Programming in Haskell” (2nd Edition) from the chapter on “Functors, Applicatives, and Monads”. We simply use the `Maybe` monad to indicate potential for failure.

Let’s run it:

$ghc -O2 --make *.hs -o main -threaded -rtsopts
[1 of 1] Compiling Main             ( main.hs, main.o )
Linking main ...
$main
Just 20
Nothing

Excellent. For the expression `1 + 3 * 6`, we get `Just 20`, and for `10 / 0`, we get `Nothing`. As expected.

Now let’s see the Rust version:

#[derive(Debug, PartialEq, PartialOrd)]
pub enum Op {
    Add,
    Sub,
    Mul,
    Div
}

use self::Op::*;


#[derive(Debug, PartialEq, PartialOrd)]
pub enum Expr {
    Val(isize),
    App(Op, Box, Box)
}

use self::Expr::*;

impl Expr {
    fn eval(&self) -> Option {
        match *self {
            Val(ref n) => Some(*n),
            App(ref o, ref e1, ref e2) => {
                if let Some(m) = e1.eval() {
                    if let Some(n) = e2.eval() {
                        match *o {
                            Add => return Some(m + n),
                            Sub => return Some(m - n),
                            Mul => return Some(m * n),
                            Div => if n == 0 { 
                                        return None; 
                                    } else { 
                                        return Some (m / n) 
                                   }
                        }
                    } 
                }   
                None
            }
        }
    }
}

fn main() {
    let e1 = Box::new(App(Add, Box::new(Val(2)), 
                      Box::new(App(Mul, Box::new(Val(3)), Box::new(Val(6))))));
    
    let e2 = Box::new(App(Div, Box::new(Val(10)), Box::new(Val(0))));
    
    println!("{:?}, {:?}", e1.eval(), e2.eval());
}

As can be seen, it is an almost identical translation from Haskell into Rust. Algebraic Data Type (ADT) support in Rust makes it a very powerful language, and pattern matching in Rust is as (if not more) powerful as in Haskell. The only difference is the syntax, and I must say that the Haskell version has a lot less noise than the Rust version, even though they are conceptually identical.

Running it:

   Compiling playground v0.0.1 (file:///playground)
    Finished dev [unoptimized + debuginfo] target(s) in 1.02s
     Running `target/debug/playground`

Some(20), None

Excellent!

And finally, for the fun part – Common Lisp. Of course, this is not idiomatic Common Lisp by any stretch of the imagination, and the idea is to try and preserve the essence of the Haskell version whilst still being runnable Lisp. Common Lisp is, of course, a dynamically-typed language, and it doesn’t have a real equivalent of ADTs. However, the bigger discomfort, in my opinion, is that
Common Lisp’s pattern matching is almost non-existent (unlike Racket).

Nevertheless, this is a simple enough example that it does not cause any real problems. The interesting part is how the representation of the expression changes here – due to the lack of a sort of “schema” for the structure of the data, I am forced to include a dummy `nil` argument for the `Val` constructor.
Of course, I could have used varargs to handle that, but that would have led to more verbose code for little ROI. In any case, here is the code:

(defun evaluate (e)
    (let ((op (car e)))
        (cond ((eql op 'Val) (cadr e))
              (t (let ((e1 (evaluate (cadr e)))
                       (e2 (evaluate (caddr e))))
                (cond ((eql op 'Add) (+ e1 e2))
                    ((eql op 'Sub) (- e1 e2))
                    ((eql op 'Mul) (* e1 e2))
                    ((eql op 'Div) (if (= e2 0)
                                    nil
                                    (/ e1 e2)))
                    (t (error "invalid operation"))))))))                    
           
(defun main ()
    (format t "~s~%" (evaluate '(Add (Val 2 nil) (Mul (Val 3 nil) (Val 6 nil)))))
    (format t "~s~%" (evaluate '(Div (Val 10 nil) (Val 0 nil)))))
    

(main)    

And running it:

CL-USER> (main)
20
NIL
NIL

Playtime with Lisp!

Since it has been a while that I have updated my blog, I thought of resurrecting it (in a manner of speaking) by playing around with some Common Lisp.

My first exposure to Functional Programming was, strangely enough, not through Haskell or even Scheme, but through Common Lisp. I know that many purists (on both sides) will claim that Common Lisp is a rather bad example of a Functional language, but I argue that Common Lisp is indeed one of the most performant Functional languages out there. Mutability alone does not dictate whether a language is Functional or not.

Working through the great book, “Programming in Haskell” (2nd Edition) by Professor Graham Hutton, I came across the following functions which were, as expected, rather niftily expressed in Haskell:

subs – the usual recursive way of generating all subsequences of a list.

subs :: [a] -> [[a]]
subs [] = [[]]
subs (x:xs) = yss ++ map (x:) yss
             where
                 yss = subs xs

*Misc> subs [1,2,3]
[[],[3],[2],[2,3],[1],[1,3],[1,2],[1,2,3]]

interleave – insert an element into every possible position of the target list.


interleave :: a -> [a] -> [[a]]
interleave x [] = [[x]]
interleave x (y:ys) = (x:y:ys) : map (y:) (interleave x ys)

*Misc> interleave 'a' "bcd"
["abcd","bacd","bcad","bcda"]

and finally,

perms – generate all the permutations of a given list/sequence.


perms :: [a] -> [[a]]
perms [] = [[]]
perms (x:xs) = concat $ map (interleave x) (perms xs)

*Misc> perms [1,2,3]
[[1,2,3],[2,1,3],[2,3,1],[1,3,2],[3,1,2],[3,2,1]]

Of course, this version uses plain recursion. The usual way I would generate permutations for practical purposes would be with something like Heap’s Algorithm as shown below (Rust implementation):


use std::fmt::Debug;

fn heap(a: &mut Vec, n: usize) {
    if n == 0 {
        println!("{:?}", a);
    } else {
        for i in 0..n - 1 {
            heap(a, n - 1);

            if n % 2 == 0 {
                a.swap(i, n - 1);
            } else {
                a.swap(0, n - 1);
            }
         }

        heap(a, n - 1);
    }
}

fn main() {
    let mut v = vec![1, 2, 3];
    let vlen = v.len();

    heap(&mut v, vlen);

    let mut vv = vec!['a', 'b', 'c', 'd'];
    let vvlen = vv.len();

    heap(&mut vv, vvlen);
}

bash-3.2$ rustc heap.rs && ./heap
[1, 2, 3]
[2, 1, 3]
[3, 1, 2]
[1, 3, 2]
[2, 3, 1]
[3, 2, 1]
['a', 'b', 'c', 'd']
['b', 'a', 'c', 'd']
['c', 'a', 'b', 'd']
['a', 'c', 'b', 'd']
['b', 'c', 'a', 'd']
['c', 'b', 'a', 'd']
['d', 'b', 'a', 'c']
['b', 'd', 'a', 'c']
['a', 'd', 'b', 'c']
['d', 'a', 'b', 'c']
['b', 'a', 'd', 'c']
['a', 'b', 'd', 'c']
['a', 'c', 'd', 'b']
['c', 'a', 'd', 'b']
['d', 'a', 'c', 'b']
['a', 'd', 'c', 'b']
['c', 'd', 'a', 'b']
['d', 'c', 'a', 'b']
['d', 'c', 'b', 'a']
['c', 'd', 'b', 'a']
['b', 'd', 'c', 'a']
['d', 'b', 'c', 'a']
['c', 'b', 'd', 'a']
['b', 'c', 'd', 'a']

 

However, the purpose of this exercise is to simply translate the Haskell examples into Common Lisp (as idiomatic as possible). So here are the equivalent Common Lisp versions:

subs


(defun subs (lst)
    "generate the subsequences of the given list"
    (cond ((null lst) '(()))
           (t (let ((first (car lst))
                    (subs-lst (subs (cdr lst))))
                (append subs-lst 
                        (mapcar #'(lambda (rest) (cons first rest)) 
                                subs-lst))))))

CL-USER> (subs '(a b c))

(NIL (C) (B) (B C) (A) (A C) (A B) (A B C))

interleave


(defun interleave (elem lst)
    "generate a list of lists by inserting elem into every possible slot"
    (cond ((null lst) `((,elem)))
           (t (let ((fst (car lst))
                    (rst (cdr lst)))
                (cons (cons elem lst)
                      (mapcar #'(lambda (lst1) (cons fst lst1)) 
                              (interleave elem rst)))))))

CL-USER> (interleave #\c (coerce "ab" 'list))
((#\c #\a #\b) (#\a #\c #\b) (#\a #\b #\c))

The interesting things is that there appears to be no equivalent in Common Lisp for the Haskell concat function. Well, so let’s write out our own!

concat – flatten a list of lists.


(defun concat (lsts)
    "flatten the given list of lists"
    (cond ((null lsts) '())
          ((null (cdr lsts)) (car lsts))
           (t (append (car lsts) (concat (cdr lsts))))))

Now we can finally complete perms


(defun perms (lst)
    "generate the permutations of the given list"
    (cond ((null lst) '(()))
          (t (let ((fst (car lst))
                   (rst (cdr lst)))
              (concat (mapcar #'(lambda (lst1) 
                              (interleave fst lst1)) (perms rst)))))))

CL-USER> (perms '(1 2 3))
((1 2 3) (2 1 3) (2 3 1) (1 3 2) (3 1 2) (3 2 1))

Nice!

 

 

 

 

 

 

 

An inefficient Haskell implementation of the Integer Partitioning problem

This is my first shot at implementing a solution for the integer partitioning problem in Haskell. It is a very inefficient algorithm, and it would probably either hang forever or crash with a core dump beyond, say, 20, but it is a first step in an efficient implementation of this problem which appears to be egregiously difficult to implement in idiomatic Haskell as compared to say, Java.

This implementation also generates the partitions in lexicographic order.

Here is the code:

module Partitions where
isort :: Ord a => (a -> a -> Bool) -> [a] -> [a]
isort _ [] = []
isort f (x:xs) = insert f x (isort f xs)
                   where
                     insert _ x [] = [x]
                     insert f x (y:ys) | f x y = x : (y:ys)
                                       | otherwise = y : insert f x ys
lexical :: Ord a => [a] -> [a] -> Bool
lexical [] [] = True
lexical [] ys = True
lexical xs [] = True
lexical (x:xs) (y:ys) | x > y = False
                      | x < y = True
                      | otherwise = lexical xs ys
clean :: Ord a => [[a]] -> [[a]]
clean xss = h (map (isort (<)) xss) []
      where
        h [] acc = acc
        h [xs] acc = if elem xs acc then acc else (xs:acc)
        h (xs:xss) acc = if elem xs acc then h xss acc
                         else h xss (xs:acc)
partitions :: Int -> [[Int]]
partitions 0 = []
partitions n = isort lexical $ clean $ f start [start]
               where
                 start = replicate n 1
                 f [_] acc = acc
                 f xs acc = f (normalise xs) (acc ++ cs) ++ clean (concat (map (\ys -> f ys []) cs))
                   where
                     normalise [] = []
                     normalise [x, y] = [x+y]
                     normalise (x:xs) = x : normalise xs
                     combinations [] = []
                     combinations [x] = [[x]]
                     combinations xs = g xs (length xs – 2)
                       where
                         g xs s | s < 0 = []
                                | otherwise = [take s xs ++ [xs!!s + xs!!(s+1)] ++ drop (s+2) xs] ++
                                  g xs (s-1)
                     cs = clean $ combinations xs

 

Running some sample tests:

$ ghci
GHCi, version 8.0.2: http://www.haskell.org/ghc/
Prelude> :l Partitions.hs
[1 of 1] Compiling Partitions       (Partitions.hs, interpreted )
Ok, modules loaded: Partitions.
*Partitions> partitions 0
[]
*Partitions> partitions 4
[[1,1,1,1],[1,1,2],[1,3],[2,2],[4]]
*Partitions> partitions 9
[[1,1,1,1,1,1,1,1,1],[1,1,1,1,1,1,1,2],[1,1,1,1,1,1,3],[1,1,1,1,1,2,2],[1,1,1,1,1,4],[1,1,1,1,2,3],[1,1,1,1,5],[1,1,1,2,2,2],[1,1,1,2,4],[1,1,1,3,3],[1,1,1,6],[1,1,2,2,3],[1,1,2,5],[1,1,3,4],[1,1,7],[1,2,2,2,2],[1,2,2,4],[1,2,3,3],[1,2,6],[1,3,5],[1,4,4],[1,8],[2,2,2,3],[2,2,5],[2,3,4],[2,7],[3,3,3],[3,6],[4,5],[9]]
*Partitions> partitions 12
[[1,1,1,1,1,1,1,1,1,1,1,1],[1,1,1,1,1,1,1,1,1,1,2],[1,1,1,1,1,1,1,1,1,3],[1,1,1,1,1,1,1,1,2,2],[1,1,1,1,1,1,1,1,4],[1,1,1,1,1,1,1,2,3],[1,1,1,1,1,1,1,5],[1,1,1,1,1,1,2,2,2],[1,1,1,1,1,1,2,4],[1,1,1,1,1,1,3,3],[1,1,1,1,1,1,6],[1,1,1,1,1,2,2,3],[1,1,1,1,1,2,5],[1,1,1,1,1,3,4],[1,1,1,1,1,7],[1,1,1,1,2,2,2,2],[1,1,1,1,2,2,4],[1,1,1,1,2,3,3],[1,1,1,1,2,6],[1,1,1,1,3,5],[1,1,1,1,4,4],[1,1,1,1,8],[1,1,1,2,2,2,3],[1,1,1,2,2,5],[1,1,1,2,3,4],[1,1,1,2,7],[1,1,1,3,3,3],[1,1,1,3,6],[1,1,1,4,5],[1,1,1,9],[1,1,2,2,2,2,2],[1,1,2,2,2,4],[1,1,2,2,3,3],[1,1,2,2,6],[1,1,2,3,5],[1,1,2,4,4],[1,1,2,8],[1,1,3,3,4],[1,1,3,7],[1,1,4,6],[1,1,5,5],[1,1,10],[1,2,2,2,2,3],[1,2,2,2,5],[1,2,2,3,4],[1,2,2,7],[1,2,3,3,3],[1,2,3,6],[1,2,4,5],[1,2,9],[1,3,3,5],[1,3,4,4],[1,3,8],[1,4,7],[1,5,6],[1,11],[2,2,2,2,2,2],[2,2,2,2,4],[2,2,2,3,3],[2,2,2,6],[2,2,3,5],[2,2,4,4],[2,2,8],[2,3,3,4],[2,3,7],[2,4,6],[2,5,5],[2,10],[3,3,3,3],[3,3,6],[3,4,5],[3,9],[4,4,4],[4,8],[5,7],[6,6],[12]]

 

An explanation will be provided when the efficient solution has been implemented!

An absolute beginner’s guide to folding in Haskell

(This is a small write-up I did on a forum for a Haskell MOOC).

foldr and foldl are simply means of reducing/accumulating/folding over values of a sequence into a single value. That’s basically it!

To make it more concrete, let’s analyse a specific implementation of foldr and foldl for Lists:

The easier one to understand is actually foldl since it is more intuitive and natural:

foldl :: (b -> a -> b) -> b -> [a] -> b
foldl f acc [] = acc
foldl f acc (x:xs) = foldl f (f acc x) xs

The first thing to do is look at the type. How does it read? In plain English, you can read that as “foldl is a function that takes three arguments: the first argument is a function that takes a value of type b and a value of type a and produces a value of type b, the second argument is a value of type b (this is the accumulator), and the final argument is a list of values of type a. The overall result is a single value of type b”. Makes sense?

With that intuition, let’s look at an example and map that to the definition itself:

Suppose you have a list, xs = [1, 2, 3, 4, 5] and you want to find the sum of all the elements of this list. Then you basically want to reduce the list into a single element under addition, right? So you can define the whole operation as: foldl (+) 0 xs.

Now see how that maps to the definition. So the first argument to foldl, which is a function, is +, and this makes sense since + is a binary function that takes two numerical values and produces their sum. The second argument, which is the “accumulator” (basically the one which keeps accumulating the results as we traverse the list) and we want to start off with 0 here since that is the identity for addition. Finally, the final argument is the list of values itself, xs.

Now look at the body of foldl. It has two patterns to match the inputs against:

1). foldl f acc [] = acc

So when we run out of values in our input list, we simply return the accumulator, which had been dutifully collecting the running sum of the elements of the list. So in our case, this would be analogous to something like foldl (+) acc [] = acc.

2). foldl f acc (x:xs) = foldl f (f acc x) xs

This is the more interesting case. Translating it to our example, we might have something like foldl (+) acc (x:xs) = foldl (+) (acc + x) xs. This is the crucial part – note the acc + x part. In the recursive call, the next value of acc will be acc + x, right? So we are collecting, in this example, the sums of the elements of the list in the variable acc. Now, the most important bit – note that acc is always the left operand of the function as in acc + xs (which, in function application form would be (+) acc xs). This is the reason why we call it “foldl” or “fold left” – we are simply reducing the list from the left to the end of the list!

So, for example:
foldl (+) 0 [1, 2, 3, 4, 5] can be expanded (conceptually) as:

foldl (+) 0 [1, 2, 3, 4, 5]
= foldl (+) (0+1) [2, 3, 4, 5
= foldl (+) ((0+1)+2) [3, 4, 5]
= foldl (+) (((0+1)+2)+3) [4, 5]
= foldl (+) ((((0+1)+2)+3)+4) [5]
= foldl (+) ((((((0+1)+2)+3)+4)+5) []
= (((((0+1)+2)+3)+3)+5) — from the first pattern in the definition of `foldl`.
= 15

As you can see, we move from left to right across the list, and keep accumulating the values as we go along. This is also the reason why foldl is much more efficient than foldr (as we will see).

Now, onto foldr. foldr is very similar to foldl, but whereas foldl fold from left to right, foldr fold from right to left! First, again, let’s look at the definition:

foldr :: (a -> b -> b) -> b -> [a] -> b
foldr f acc [] = acc
foldr f acc (x:xs) = f x (foldr f acc xs)

The type in plain English: “foldr is a function that takes three arguments – the first argument is a binary function that takes a value of type a and a value of type b, and produces a value of type b, the second argument is the “accumulator” of type b, and the final argument is the list of values of type a. The overall result is a single value of type b”.

Now taking the same example, let’s analyse the body:

1). foldr f acc [] = acc

This is identical to the first pattern in the body of foldl. This makes sense since once we have exhausted the list, we simply need to return the final result, which is stored in the running accumulator, acc.

2), foldr f acc (x:xs) = f x (foldr f acc xs)

Now, this is the tricky part and needs to be analysed carefully. The analogue of this definition for our sum example would be foldr (+) acc (x:xs) = x + (foldr (+) acc xs). What does this mean?

Well, since we want to process the list starting from the rightmost end of the list, we are forced to have to update the accumulator from the right and move towards the beginning of the list. This is what that definition basically means. So if you observe carefully, the value of the accumulator is actually given by the expression (foldr f acc xs) (or (foldr (+) acc xs) in our example). Remember that in the case of foldl, the accumulator was always the left operand. Likewise, in the case of foldr, the accumulator is always the right operand. This is why in the expression f x (foldr f acc xs), the second operand is the running state of the accumulator. Note that the full call itself cannot be fully evaluated until the (foldr f acc xs) part has been evaluated. This means that we keep on building up a stack of function calls, and only when the entire list has been consumed can we begin actually updating the value of the accumulator. This is why foldr is much slower and memory-intensive than foldl, which does not suffer from this deficiency.

To use the same example, let’s evaluate the call foldr (+) 0 [1, 2, 3, 4, 5]:

foldr (+) 0 [1, 2, 3, 4, 5]
= 1 + (foldr (+) 0 [2, 3, 4, 5])
= 1 + (2 + (foldr (+) 0 [3, 4, 5]))
= 1 + (2 + (3 + (foldr (+) 0 [4, 5])))
= 1 + (2 + (3 + (4 + (foldr (+) 0 [5]))))
= 1 + (2 + (3 + (4 + (5 + (foldr (+) 0 []))))) — we now match with the first pattern in the definition of `foldr`
= 1 + (2 + (3 + (4 + (5 + 0)))) — we now keep updating the accumulator all the way from right to left
= 1 + (2 + (3 + (4 + 5)))
= 1 + (2 + (3 + 9))
= 1 + (2 + 12)
= 1 + 14
= 15

If you observe carefully, the expressions are parenthesised from right to left whereas in the case of foldl, they were from left to right.

Now, you see that both foldl and foldr gave the same result 15 for our example. However, this is only because addition is a commutative property (just like multiplication), and that’s why it doesn’t matter if we fold from the left or from the right. So, multiplication would also give the same results for both foldl and foldr, but subtraction and division would not, since they are not commutative operations.

To confirm this assertion, just compare the outputs of the following expressions:

Prelude> foldr (+) 0 [1..5]
15
Prelude> foldl (+) 0 [1..5]
15

Prelude> foldr (*) 1 [1..5]
120
Prelude> foldl (*) 1 [1..5]
120

Prelude> foldr (-) 0 [1..5]
3
Prelude> foldl (-) 0 $ [1..5]
-15

Prelude> foldr (/) 1.0 [1.0, 2.0, 3.0, 4.0, 5.0]
1.875
Prelude> foldl (/) 1.0 [1.0, 2.0, 3.0, 4.0, 5.0]
8.333333333333333e-3

Let’s just analyse the subtraction example:

foldr (-) 0 [1..5]
= 1 – (foldr (-) [2, 3, 4, 5])
= 1 – (2 – (foldr (-) [3, 4, 5]))
= 1 – (2 – (3 – (foldr (-) [4, 5])))
= 1 – (2 – (3 – (4 – (foldr (-) [5]))))
= 1 – (2- (3 – (4 – (5 – (foldr (-) [])))))
= 1 – (2 – (3 – (4 – (5 – 0))))
= 1 – (2 – (3 – (4 – 5)))
= 1 – (2 – (3 – (-1)))
= 1 – (2 – 4)
= 1 – (-2)
= 3.

and

foldl (-) 0 [1..5]
= foldl (-) (0-1) [2, 3, 4, 5]
= foldl (-) ((0-1)-2) [3, 4, 5]
= foldl (-) (((0-1)-2)-3) [4, 5]
= foldl (-) ((((0-1)-2)-3)-4) [5]
= foldl (-) (((((0-1)-2)-3)-4)-5) []
= (((((0-1)-2)-3)-4)-5)
= -15.

Et voila!

Optional

In case you are familiar with C, the functions may be rendered thus, for instance:

// The generic foldr function
void *foldr(void *(*fptr)(void*, void*), void *acc, void *arr, size_t n, size_t delta)
{
    if (!n) {
        return acc;
    } else {
        return fptr(arr, foldr(fptr, acc, arr+delta, n-1, delta));
    }
}

// The generic foldl function
void *foldl(void *(*fptr)(void*, void*), void *acc, void *arr, size_t n, size_t delta)
{
    if (!n) {
        return acc;
    } else {
        return foldl(fptr, fptr(acc, arr), arr+delta, n-1, delta);
    }
}

Note how they map to the Haskell definitions directly.

If you are interested, here is a full program (note that this is only to simulate Haskell behaviour as closely as possible – it is not the idiomatic way to do the operations in C, and it leaks memory as well which we do not care to consider for this demo) that you can run to observe the behaviour and compare it with Haskell:

The program:

#include <stdio.h>
#include <stdlib.h>

void *add_int(void *x, void *y)
{
    int *res = malloc(sizeof(int));
    *res = *((int*) x) + *((int*) y);

    return (void*) res;
}

void *multiply_int(void *x, void *y)
{
    int *res = malloc(sizeof(int));
    *res = *((int*) x) * *((int*) y);

    return (void*)res;
}

void *subtract_int(void *x, void *y)
{
    int *res = malloc(sizeof(int));
    *res = *((int*) x) - *((int*) y);

    return (void*) res;
}

void *divide_double(void *x, void *y)
{
    double *res = malloc(sizeof(double));
    *res = *((double*) x) / *((double*) y);

    return (void*) res;
}

// The generic foldr function
void* foldr(void *(*fptr)(void*, void*), void *acc, void *arr, size_t n, size_t delta)
{
    if (!n) {
        return acc;
    } else {
        return fptr(arr, foldr(fptr, acc, arr+delta, n-1, delta));
    }
}

// The generic foldl function
void *foldl(void *(*fptr)(void*, void*), void *acc, void *arr, size_t n, size_t delta)
{
    if (!n) {
        return acc;
    } else {
        return foldl(fptr, fptr(acc, arr), arr+delta, n-1, delta);
    }
}

int sum_foldr(int a[], size_t n)
{
    int sum = 0, acc = 0;
    void *ret = foldr(add_int, &acc, a, n, sizeof(int));
    if (ret) {
        sum = *((int*) ret);
        free(ret);
    }

    return sum;
}

int product_foldr(int a[], size_t n)
{
    int prod = 1, acc = 1;
    void *ret = foldr(multiply_int, &acc, a, n, sizeof(int));
    if (ret) {
        prod = *((int*) ret);
        free(ret);
    }

    return prod;
}

int sum_foldl(int a[], size_t n)
{
    int sum = 0, acc = 0;
    void *ret = foldl(add_int, &acc, a, n, sizeof(int));
    if (ret) {
        sum = *((int*) ret);
        free(ret);
    }

    return sum;
}

int product_foldl(int a[], size_t n)
{
    int prod = 1, acc = 1;
    void *ret = foldl(multiply_int, &acc, a, n, sizeof(int));
    if (ret) {
        prod = *((int*) ret);
        free(ret);
    }

    return prod;
}

int sub_foldr(int a[], size_t n)
{
    int diff = 0, acc = 0;
    void *ret = foldr(subtract_int, &acc, a, n, sizeof(int));
    if (ret) {
        diff = *((int*) ret);
        free(ret);
    }

    return diff;
}

int sub_foldl(int a[], size_t n)
{
    int diff = 0, acc = 0;
    void *ret = foldl(subtract_int, &acc, a, n, sizeof(int));
    if (ret) {
        diff = *((int*) ret);
        free(ret);
    }

    return diff;
}

double div_foldr(double a[], size_t n)
{
    double div = 1.0, acc = 1.0;
    void *ret = foldr(divide_double, &acc, a, n, sizeof(double));
    if (ret) {
        div = *((double*) ret);
        free(ret);
    }

    return div;
}

double div_foldl(double a[], size_t n)
{
    double div = 1.0, acc = 1.0;
    void *ret = foldl(divide_double, &acc, a, n, sizeof(double));
    if (ret) {
        div = *((double*) ret);
        free(ret);
    }

    return div;
}

int main(int argc, char **argv)
{
    int a[] = { 1, 2, 3, 4, 5 };
    size_t n = sizeof(a)/sizeof(a[0]);

    double b[] = { 1.0, 2.0, 3.0, 4.0, 5.0 };
    size_t m = sizeof(b)/sizeof(b[0]);

    printf("sum_foldr = %d\n", sum_foldr(a, n));
    printf("product_foldr = %d\n", product_foldr(a, n));

    printf("sum_foldl = %d\n", sum_foldl(a, n));
    printf("product_foldl = %d\n", product_foldl(a, n));

    printf("sub_foldr = %d\n", sub_foldr(a, n));
    printf("div_foldr = %lf\n", div_foldr(b, m));

    printf("sub_foldl = %d\n", sub_foldl(a, n));
    printf("div_foldl = %lf\n", div_foldl(b, m));

    return 0;
}

Running it:

$ gcc -Wall -O2 -o fold fold.c && ./fold
sum_foldr = 15
product_foldr = 120
sum_foldl = 15
product_foldl = 120
sub_foldr = 3
div_foldr = 1.875000
sub_foldl = -15
div_foldl = 0.008333

Exactly the same result as in the case of Haskell.

The basic idea is this – take a simple example, and work through it with pen and paper to really understand how it works!

Implementing Monad for a custom List type in Haskell

As an extension to the previous blog, we finally reach the highest level of abstraction yet – the Monad. Let’s see how we can implement our custom List type as an instance of the Monad typeclass. Here is the definition of the Monad typeclass as given in the book:

class Applicative m => Monad m where
    return :: a -> m a

    (>>=) :: m a -> (a -> m b) -> m b

    return = pure

This means that in order to make our List type an instance of Monad, we will have to make it an instance of Functor as well as Applicative.

Here is a simple but comprehensive implementation:

data List a = Nil | Cons a (List a) deriving Show

append :: List a -> List a -> List a
append Nil ys = ys
append (Cons x xs) ys = Cons x (append xs ys)

instance Functor List where
    -- fmap :: (a -> b) -> List a -> List b
    fmap _ Nil = Nil
    fmap f (Cons x xs) = Cons (f x) (fmap f xs)

instance Applicative List where
    -- pure :: a -> List a
    pure x = Cons x Nil

    -- (<*>) :: List (a -> b) -> List a -> List b
    Nil <*> _ = Nil
    (Cons g gs) <*> xs = fmap g xs `append` (gs <*> xs)

instance Monad List where
    -- (>>=) :: List a -> (a -> List b) -> List b
    Nil >>= _ = Nil
    (Cons x xs) >>= f = (f x) `append` (xs >>= f)

We also define some test data to work with, and a function called pairs that simply returns the cross-product of the two input lists.


-- test data
l1 :: List Int
l1 = Cons 1 (Cons 2 (Cons 3 (Cons 4 (Cons 5 Nil))))

l2 :: List String
l2 = Cons "Hello" (Cons "World" (Cons "we" (Cons "meet" (Cons "again" Nil))))

pairs :: Monad m => m a -> m b -> m (a, b)
pairs xs ys = xs >>= \x ->
                ys >>= \y ->
                    return (x, y)

Note that the pairs function could also have been written using the do notation as follows:

pairs :: Monad m => m a -> m b -> m (a, b)
pairs xs ys = do x <- xs
                 y <- ys
                 return (x, y)

This is identical to the previous version since the do notation is simply syntactic sugar for the bind operator.

Sample test run:

*Main> :t l1
l1 :: List Int
*Main> l2
Cons "Hello" (Cons "World" (Cons "we" (Cons "meet" (Cons "again" Nil))))
*Main> pairs l1 l2
Cons (1,"Hello") (Cons (1,"World") (Cons (1,"we") (Cons (1,"meet") (Cons (1,"again") (Cons (2,"Hello") (Cons (2,"World") (Cons (2,"we") (Cons (2,"meet") (Cons (2,"again") (Cons (3,"Hello") (Cons (3,"World") (Cons (3,"we") (Cons (3,"meet") (Cons (3,"again") (Cons (4,"Hello") (Cons (4,"World") (Cons (4,"we") (Cons (4,"meet") (Cons (4,"again") (Cons (5,"Hello") (Cons (5,"World") (Cons (5,"we") (Cons (5,"meet") (Cons (5,"again") Nil))))))))))))))))))))))))

Implementing Applicative for a custom List type in Haskell

Continuing with the Haskell theme, this is a small implementation of Applicative for a custom List type in Haskell.

Here is the general definition of Applicative as given in Professor Graham Hutton’s exemplary book, “Programming in Haskell” (2nd Edition) – one of the best books on Haskell:

class Functor f => Applicative f where
   pure :: a -> f a

   (<*>) :: f (a -> b) -> f a -> fb

This means that we have to make our custom List type into an instance of Functor before we can make it an instance of Applicative. Here is the definition of Functor from the same book:

class Functor f where
    fmap :: (a -> b) -> f a -> f b

Okay, enough preamble, let’s get started with it. First, we define our custom List type:

data List a = Nil | Cons a (List a) deriving (Eq, Ord, Show, Read)

Now we define a function called append. This is our analogue of the built-in ++ generic append operator in Haskell. This will be required when implementing the Applicative code:

-- append function for our custom List type
append :: List a -> List a -> List a
append Nil ys = ys
append (Cons x xs) ys = Cons x (append xs ys)

Almost there! Let’s make our List type a Functor type:

-- make it an instance of the Functor typeclass 
instance Functor List where
    -- fmap :: (a -> b) -> List a -> List b
    fmap _ Nil = Nil
    fmap g (Cons x xs) = Cons (g x) (fmap g xs)

Simple enough – we simply implement the fmap function using Pattern Matching. Now, finally, let’s make our type an Applicative Functor!

-- now make the List type an instance of Applicative
instance Applicative List where
    -- pure :: a -> List a
    pure x = Cons x Nil

    -- (<*>) :: List (a -> b) -> List a -> List b
    Nil <*> _ = Nil
    (Cons g gs) <*> xs = append (fmap g xs) (gs <*> xs)

That’s it! We’re all done. Again, we use Pattern Matching in the implementation of , the applicative operator. As promised earlier, we make use of both fmap and append in this definition.

Finally, let’s define some test data:

-- test data
l1 :: List Int
l1 = Cons 1 (Cons 2 (Cons 2 Nil)) -- [1, 2, 3]

l2 :: List Int
l2 = Cons 10 (Cons 20 Nil) -- [10, 20]

l3 :: List Int
l3 = Nil -- []    

Let’s try out a few tests to make sure that it works correctly!

*Main> l1
Cons 1 (Cons 2 (Cons 2 Nil))

*Main> l2
Cons 10 (Cons 20 Nil)

*Main> l3
Nil

*Main> pure (+1) <*> l1
Cons 2 (Cons 3 (Cons 3 Nil))

*Main> pure (+) <*> l1 <*> l2
Cons 11 (Cons 21 (Cons 12 (Cons 22 (Cons 12 (Cons 22 Nil)))))

*Main> pure (\x -> \y -> \z -> x * y * z) <*> l1 <*> l2 <*> l1
Cons 10 (Cons 20 (Cons 20 (Cons 20 (Cons 40 (Cons 40 (Cons 20 (Cons 40 (Cons 40 <br/>(Cons 40 (Cons 80 (Cons 80 (Cons 20 (Cons 40 (Cons 40 (Cons 40 (Cons 80 <br/>(Cons 80 Nil)))))))))))))))))

*Main> pure (\x -> \y -> \z -> x * y * z) <*> l1 <*> l2 <*> Nil
Nil

*Main> pure (^200) <*> l3
Nil

*Main> pure (\x y z w -> (x+y) * (z+w)) <*> l1 <*> l2 <*> l1 <*> l2
Cons 121 (Cons 231 (Cons 132 (Cons 242 (Cons 132 (Cons 242 (Cons 231 (Cons 441 <br/>(Cons 252 (Cons 462 (Cons 252 (Cons 462 (Cons 132 (Cons 252 (Cons 144 (Cons <br/>264 (Cons 144 (Cons 264 (Cons 242 (Cons 462 (Cons 264 (Cons 484 (Cons 264 (Cons <br/>484 (Cons 132 (Cons 252 (Cons 144 (Cons 264 (Cons 144 (Cons 264 (Cons <br/>242 (Cons 462 (Cons 264 (Cons 484 (Cons 264 (Cons 484 <br/>Nil)))))))))))))))))))))))))))))))))))

Most satisfying indeed!

A simple Guessing Game implementation in Haskell

Just a quick program since it’s been some time since I updated my blog!

module Main where

import System.Random

main :: IO ()
main = do putStrLn "Try to guess the secret word ([1-100])..."
          secret <- randomRIO (1, 100)
          play secret


play :: Int -> IO ()
play secret = do guesses <- playGame secret 0
                 putStrLn $ "You win! You took " ++ show guesses ++ " guesses!"


playGame :: Int -> Int -> IO Int
playGame secret guesses = do putStr "? "
                             input <- getLine
                             let guess = read input :: Int
                             if guess == secret then
                                do return (guesses + 1)
                             else if guess < secret then
                                do putStrLn "Too small!"
                                   playGame secret (guesses + 1)
                             else do putStrLn "Too big!"
                                     playGame secret (guesses + 1)   

The code is pretty straightforward – the only tricky part I faced was implementing the random number generation (quite a complicated topic in Haskell).

Running it,

$ ghc guessingGame.hs
[1 of 1] Compiling Main             ( guessingGame.hs, guessingGame.o )
Linking guessingGame ...

$ ./guessingGame
Try to guess the secret word ([1-100])...
50
? Too small!
75
? Too big!
63
? Too big!
56
? Too small!
59
? Too small!
61
? You win! You took 6 guesses!

Exciting, eh? 🙂

Changing the Java language using OpenJDK – a small experiment

Of late, I have been fascinated with languages, compilers, and specifically parsing and code generation. To that end, I started looking into how the Java language itself might be modified. It turns out that Java itself uses a handwritten lexer and parser with a very well-defined tree structure for the AST. That being said, good documentation if hard to come by.

I chanced upon this gem of a blog – Making Java esoteric which shows a simple example of how to add a maybe keyword to Java so that it may randomly execute statements and/or blocks.

I was inspired enough to try out my first hack on OpenJDK. To that end, I decided to implement a when statement that would essentially act as its Common Lisp equivalent.

My attempt at implementing the when keyword didn’t go down so well since it appears that the word has been used in umpteen places in the JDK itself. All right, let’s just use the Russian equivalent, kogda meaning essentially the same.

My basic logic was to define the new keyword, kogda that would work with block statements or simple statements, and then offload the actual work to the if statement. So here were the code changes in order:

First off the lexer:
/langtools/src/jdk.compiler/share/classes/com/sun/tools/javac/parser/Tokens.java):

 public enum TokenKind implements Formattable, Filter<TokenKind> {
    ...
    KOGDA("kogda"),
    ...
}

And then the parser:/langtools/src/jdk.compiler/share/classes/com/sun/tools/javac/parser/JavacParer.java:

 List<JCStatement> blockStatement() {
    ...
    case KOGDA:
    ...
    case ASSERT:
            return List.of(parseSimpleStatement());
    ...
}

and

 public JCStatement parseSimpleStatement() {
   ...
     case KOGDA: {
                nextToken();
                JCExpression cond = parExpression();
                JCStatement body = parseStatementAsBlock();
                return F.at(pos).Kogda(cond, body);
            }
   ...
}

Nothing special here – we simply read in the next token, get the conditional part, and then get the body of the when expression/statement.

And then finally, langtools/src/jdk.compiler/share/classes/com/sun/tools/javac/tree/TreeMaker.java:

public JCIf Kogda(JCExpression cond, JCStatement body) {
        JCIf tree = new JCIf(cond, body, null);
        tree.pos = pos;

        return tree;
    }

This is the actual meat of the logic per se. In this case, I simply offload the work to the reliable if expression and insert null for the else part. That’s all there is to it!

Then we build the new version of the JDK:

Macushla:9dev z0ltan$ pwd
/Users/z0ltan/Projects/Resources/Languages/Java/9dev

Macushla:9dev z0ltan$ make clean && make images
<output elided>

Macushla:9dev z0ltan$ cd build/macosx-x86_64-normal-server-release/jdk/bin/

Macushla:bin z0ltan$ ./javac -version
javac 9-internal

Here’s our little test program to test out:

Macushla:bin z0ltan$ cat HelloKogda.java
public class HelloKogda {
    public static void main(String[] args) {
        kogda (1 < 2) {
            System.out.println("Yes, 1 < 2");
        }

        kogda (2 > 100) {
            System.out.println("2 > 100");
        }

        kogda (1 < 10 && 1 == 1) System.out.println("Booyah!");
    }
}

And finally, compiling and running the test code using our custom-built JDK:

Macushla:bin z0ltan$ ./javac HelloKogda.java
Macushla:bin z0ltan$ ./java -cp . HelloKogda
Yes, 1 < 2
Booyah!

Nice! There was just one thing that I was not completely happy with – JShell. In theory, it should be trivial to modify JShell to reflect these same changes, but I had immense trouble trying to get the jdk.jshell project pick up the changes from my modified jdk.compiler project instead of the default JDK (which, of course, does not contain my changes). Maybe when I work my head around the internals of the whole ecosystem, I will post an update here. For now, this was a fun experiment!

A parser using Regular Expressions?

The oft-quoted joke is that Regular Expressions are such a useful tool that it is inevitably abused beyond its abilities. Case in point – Dmitry Olshansky’s Dconf 2017 talk where he mentions the tonnes of questions on StackOverflow about how to write a full-fledged HTML parser using only regular expressions (not quite possible!).

That amused me to such an extent that I decided to experiment a bit with regular expressions and to see if I could use a regular expression engine (which, of course, do not conform to strict regular languages (think backreferences, etc.). A modern regex engine (such as Java’s java.util.regex package) is suitably powerful enough to be a perfect candidate for such wanton abuse!

All right, so what’s the scope here? For the purpose of this experiment, I will restrict the scope of the parser to the following:

  1. Allow List initialisation of the form
              List<Double> specialNos = { 3.14519, 2.71828, 1.818181 }
             
  2. Allow similar syntactic sugar for SetS:
              Set<String> languages = { "Java", "Rust", "Haskell", "Common Lisp", "C++" }
             
  3. A Rubyesque Map initialisation syntax:
              Map<Integer, String> nos = { 1 => "One"; 2 => "Two"; 3 => "Three" }
              

So here’s how a sample file might look like:

Macushla:basic-regex-parser z0ltan$ cat samples/com/z0ltan/regexparser/test/HelloParser.javax
package com.z0ltan.regexparser.test;

import java.util.Arrays;
import java.util.HashMap;
import java.util.HashSet;
import java.util.List;
import java.util.Map;
import java.util.Set;

public class HelloParser {
    public static void main(String[] args) {
        listDemo();

        setDemo();

        mapDemo();
    }

    private static void listDemo() {
        // normal list
        List<String> greetings = Arrays.asList("Hello", "Hallo", "Hola");
        System.out.printf("greetings = %s\n", greetings);

        // custom list syntax
        List<Integer> myNums =
                    { 1, 3, 5, 7, 9, 11 }

        System.out.printf("myNums = %s\n", myNums);

    }

    private static void setDemo() {
        // normal set
        Set<Integer> evensBelow10 = new HashSet<>();
        evensBelow10.addAll(Arrays.asList(2, 4, 6, 8));

        System.out.printf("evensBelow10 = %s\n", evensBelow10);

        // custom set syntax
        Set<String> mammals = { "Cat", "Dog", "Lion", "Beaver", "Raccoon" }

        System.out.printf("mammals = %s\n", mammals);
    }

    private static void mapDemo() {
        // normal map
        Map<Integer, String> numberToString = new HashMap<>();
        numberToString.put(1,  "One");
        numberToString.put(2, "Two");
        numberToString.put(3,  "Three");

        System.out.printf("numberToString = %s\n",  numberToString);

        // custom map syntax
        Map<Integer, Foo> numToFoo =
                {
                    1 => new Foo(1, "Ein");

                    2 => new Foo(2, "Zwei");

                    3 => new Foo(3, "Drei");

                    4 => new Foo(4, "Vier");

                    5 => new Foo(5, "Fuenf");
                }

        System.out.printf("numToFoo = %s\n", numToFoo);
    }

    static class Foo {
        private int id;
        private String name;

        public Foo(int id, String name) {
            this.id = id;
            this.name = name;
        }

        @Override
        public String toString() {
            return "Foo { id = " + id + ", name = " + name + " }";
        }
    }
}

So this sample input file includes examples for all three variants. So how would the parser for this look like? Let’s split the parser into three variants as well:

The parser for List:

/**
	 * Handle custom List syntax.
	 * 
	 * @param contents
	 * @return
	 */
	private static String parseListSyntax(String contents) {
		Matcher m = Patterns.LIST_PATTERN.matcher(contents);

		while (m.find()) {
			if (m.groupCount() == 3) {
				String type = m.group(1);
				String var = m.group(2);
				String val = m.group(3).trim();

				String replacement = "List<" + type + ">" + var + " = new ArrayList<" + type + ">() {{ ";
				String[] vals = val.split(",");

				for (String rep : vals) {
					String v = rep.trim();

					if (!v.isEmpty()) {
						replacement += "add(" + v + ");\n\t\t\t\t\t ";
					}
				}
				replacement += "}};";

				contents = m.replaceFirst(replacement);

				m = Patterns.LIST_PATTERN.matcher(contents);
			}
		}
		return contents;
	}

The parser for Set:

/**
	 * Handle custom Set syntax
	 * 
	 * @param contents
	 * @return
	 */
	private static String parseSetSyntax(String contents) {
		Matcher m = Patterns.SET_PATTERN.matcher(contents);

		while (m.find()) {
			if (m.groupCount() == 3) {
				String type = m.group(1);
				String var = m.group(2);
				String val = m.group(3).trim();

				String replacement = "Set<" + type + "> " + var + " = new HashSet<" + type + ">() {{ ";
				String[] vals = val.split(",");

				for (String rep : vals) {
					String v = rep.trim();

					if (!v.isEmpty()) {
						replacement += "add(" + v + ");\n\t\t\t\t\t";
					}
				}

				replacement += " }};";

				contents = m.replaceFirst(replacement);

				m = Patterns.SET_PATTERN.matcher(contents);
			}
		}

		return contents;
	}

And finally, the parser for Map:

/**
	 * Handle custom Map syntax
	 * 
	 * @param contents
	 * @return
	 */
	private static String parseMapSyntax(String contents) {
		Matcher m = Patterns.MAP_PATTERN.matcher(contents);

		while (m.find()) {
			if (m.groupCount() == 3) {
				String type = m.group(1);
				String var = m.group(2);
				String val = m.group(3).trim();

				String replacement = "Map<" + type + "> " + var + " = new HashMap<" + type + ">() {{";
				String[] vals = val.split(";");

				for (String rep : vals) {
					String v = rep.trim();

					if (!rep.isEmpty()) {
						String[] keyVal = v.split("=>");

						if (keyVal == null || keyVal.length != 2 || keyVal[0] == null || keyVal[1] == null) {
							throw new RuntimeException("invalid map syntax");
						}
						replacement += "put(" + keyVal[0].trim() + ", " + keyVal[1].trim() + ");\n\t\t\t\t\t";
					}
				}

				replacement += " }};";

				contents = m.replaceFirst(replacement);

				m = Patterns.MAP_PATTERN.matcher(contents);
			}
		}

		return contents;
	}

The code is quite straightforward and similar for all three variants – we simply to see if the relevant pattern is found in the current code contents (the whole source file is read in in one go), and if so, the relevant bit is rewritten, and the updated String is then passed to the next stage of the pipeline for further processing.

One thing to observe here is that in every parsing method, the find operation is performed in a loop. This is so that every occurrence can be updated correctly, and that is also the reason why the Matcher is updated at the end of every iteration of the loop – m = Patterns.LIST_PATTERN.matcher(contents), for instance. If this is not done so, the result will be an infinite loop since the while (m.find()) will never fail!

Here’s how the main parsing method looks like:

public static void parse(final String fileName) {
		try (BufferedReader reader = new BufferedReader(new FileReader(fileName))) {
			String ext = fileName.substring(fileName.lastIndexOf('.') + 1);

			if (!ext.equalsIgnoreCase("javax")) {
				throw new RuntimeException("input files must have a .javax extension");
			}

			File srcFile = new File(fileName);

			if (!srcFile.exists()) {
				throw new RuntimeException("source file " + fileName + " does not exist!");
			}

			final String basePath = srcFile.getAbsolutePath().substring(0,
					srcFile.getAbsolutePath().lastIndexOf(File.separator));
			final String baseFileName = srcFile.getName().substring(0, srcFile.getName().lastIndexOf('.'));
			final String outputFile = basePath + File.separator + baseFileName
					+ ".java";

			String contents = reader.lines().reduce("", (acc, s) -> {
				acc += s;
				acc += "\n";
				return acc;
			});

			contents = insertImports(contents);
			contents = parseListSyntax(contents);
			contents = parseSetSyntax(contents);
			contents = parseMapSyntax(contents);

			try (BufferedWriter writer = new BufferedWriter(new FileWriter(outputFile))) {
				writer.write(contents);
			} catch (Exception ex) {
				throw ex;
			}

		} catch (Exception ex) {
			throw new RuntimeException("Error during parsing: " + ex.getLocalizedMessage());
		}
	}

A final subtlety is in the way the importS are handled. For the full source code as well as more explanation on the way the whole tool works, visit the Github page.

Let’s take it for a spin:

Macushla:basic-regex-parser z0ltan$ ls
README.md pom.xml   samples   src       target

Macushla:basic-regex-parser z0ltan$ mvn package 1> /dev/null

Macushla:basic-regex-parser z0ltan$ java -jar target/basic-regex-parser-1.0-SNAPSHOT.jar samples/Default.javax samples/com/z0ltan/test/Package.javax samples/com/z0ltan/regexparser/test/HelloParser.javax

Let’s see how the sample file shown earlier has been rewritten:

Macushla:basic-regex-parser z0ltan$ cat samples/com/z0ltan/regexparser/test/HelloParser.java
package com.z0ltan.regexparser.test;

import java.util.*;

import java.util.Arrays;
import java.util.HashMap;
import java.util.HashSet;
import java.util.List;
import java.util.Map;
import java.util.Set;

public class HelloParser {
    public static void main(String[] args) {
        listDemo();

        setDemo();

        mapDemo();
    }

    private static void listDemo() {
        // normal list
        List<String> greetings = Arrays.asList("Hello", "Hallo", "Hola");
        System.out.printf("greetings = %s\n", greetings);

        // custom list syntax
        List<Integer>myNums = new ArrayList<Integer>() {{ add(1);
					 add(3);
					 add(5);
					 add(7);
					 add(9);
					 add(11);
					 }};

        System.out.printf("myNums = %s\n", myNums);

    }

    private static void setDemo() {
        // normal set
        Set<Integer> evensBelow10 = new HashSet<>();
        evensBelow10.addAll(Arrays.asList(2, 4, 6, 8));

        System.out.printf("evensBelow10 = %s\n", evensBelow10);

        // custom set syntax
        Set<String> mammals = new HashSet<String>() {{ add("Cat");
					add("Dog");
					add("Lion");
					add("Beaver");
					add("Raccoon");
					 }};

        System.out.printf("mammals = %s\n", mammals);
    }

    private static void mapDemo() {
        // normal map
        Map<Integer, String> numberToString = new HashMap<>();
        numberToString.put(1,  "One");
        numberToString.put(2, "Two");
        numberToString.put(3,  "Three");

        System.out.printf("numberToString = %s\n",  numberToString);

        // custom map syntax
        Map<Integer, Foo> numToFoo = new HashMap<Integer, Foo>() {{put(1, new Foo(1, "Ein"));
					put(2, new Foo(2, "Zwei"));
					put(3, new Foo(3, "Drei"));
					put(4, new Foo(4, "Vier"));
					put(5, new Foo(5, "Fuenf"));
					 }};

        System.out.printf("numToFoo = %s\n", numToFoo);
    }

    static class Foo {
        private int id;
        private String name;

        public Foo(int id, String name) {
            this.id = id;
            this.name = name;
        }

        @Override
        public String toString() {
            return "Foo { id = " + id + ", name = " + name + " }";
        }
    }
}

Excellent! Now let’s try and run all the parsed files to make sure they are all working as expected:

Macushla:basic-regex-parser z0ltan$ cd samples/

Macushla:samples z0ltan$ javac Default.java
Macushla:samples z0ltan$ java -cp . Default
users = [Cornelius, Marcus, Linnaeus, Gaius, Augustus]

Macushla:samples z0ltan$ javac com/z0ltan/test/Package.java
Macushla:samples z0ltan$ java -cp . com.z0ltan.test.Package
map = {One=1, Two=2, Three=3}
normalMap = {Hello=100, Again=300, World=200}

Macushla:samples z0ltan$ javac com/z0ltan/regexparser/test/HelloParser.java
Macushla:samples z0ltan$ java -cp . com.z0ltan.regexparser.test.HelloParser
greetings = [Hello, Hallo, Hola]
myNums = [1, 3, 5, 7, 9, 11]
evensBelow10 = [2, 4, 6, 8]
mammals = [Cat, Lion, Beaver, Raccoon, Dog]
numberToString = {1=One, 2=Two, 3=Three}
numToFoo = {1=Foo { id = 1, name = Ein }, 2=Foo { id = 2, name = Zwei }, 3=Foo { id = 3, name = Drei }, 4=Foo { id = 4, name = Vier }, 5=Foo { id = 5, name = Fuenf }}

Nice! Of course, this is just a toy tool that does trivial rewriting. If Java had a proper macro system (or even a macro system to begin with!), all this would have been trivial at best without so much redundant code. Nevertheless, it does illustrate the regular expression engines are indeed quite powerful well beyond their original intent.

EDIT: Just in case you are curious as to what the patterns look like, here they are:

public class Patterns {
	public static final Pattern LIST_PATTERN = Pattern
			.compile("List<([a-zA-Z]+)>\\s+([a-zA-Z0-9]+)\\s*=\\s*\\{\\s*([a-zA-Z0-9,\" ]+)\\s*\\}");

	public static final Pattern SET_PATTERN = Pattern
			.compile("Set<([a-zA-Z]+)>\\s+([a-zA-Z0-9]+)\\s*=\\s*\\{\\s*([a-zA-Z0-9,\" ]+)\\s*\\}");

	public static final Pattern MAP_PATTERN = Pattern
			.compile("Map<([a-zA-Z, ]+)>\\s+([a-zA-Z0-9]+)\\s*=\\s*\\{\\s*([a-zA-Z0-9\"\\s=>();, ]+)\\s*\\}");
	
	public static final Pattern PACKAGE_PATTERN = Pattern
			.compile("package\\s+([a-zA-Z0-9._;]+)");
	
	public static final Pattern IMPORT_PATTERN = Pattern
			.compile("import\\s+([a-zA-Z0-9.;]+)");
	
	public static final Pattern PUBLIC_CLASS_PATTERN = Pattern
			.compile("public class\\s+([a-zA-Z0-9_]+)");
	
	public static final Pattern NORMAL_CLASS_PATTERN = Pattern
			.compile("class\\s+([a-zA-Z0-9_]+)");
}

You may safely ignore the patterns after the first three patterns – they are used to insert the import java.util.*; declaration at the first valid location. Again, for more details, visit the Github location!