A parser using Regular Expressions?

The oft-quoted joke is that Regular Expressions are such a useful tool that it is inevitably abused beyond its abilities. Case in point – Dmitry Olshansky’s Dconf 2017 talk where he mentions the tonnes of questions on StackOverflow about how to write a full-fledged HTML parser using only regular expressions (not quite possible!).

That amused me to such an extent that I decided to experiment a bit with regular expressions and to see if I could use a regular expression engine (which, of course, do not conform to strict regular languages (think backreferences, etc.). A modern regex engine (such as Java’s java.util.regex package) is suitably powerful enough to be a perfect candidate for such wanton abuse!

All right, so what’s the scope here? For the purpose of this experiment, I will restrict the scope of the parser to the following:

  1. Allow List initialisation of the form
              List<Double> specialNos = { 3.14519, 2.71828, 1.818181 }
             
  2. Allow similar syntactic sugar for SetS:
              Set<String> languages = { "Java", "Rust", "Haskell", "Common Lisp", "C++" }
             
  3. A Rubyesque Map initialisation syntax:
              Map<Integer, String> nos = { 1 => "One"; 2 => "Two"; 3 => "Three" }
              

So here’s how a sample file might look like:

Macushla:basic-regex-parser z0ltan$ cat samples/com/z0ltan/regexparser/test/HelloParser.javax
package com.z0ltan.regexparser.test;

import java.util.Arrays;
import java.util.HashMap;
import java.util.HashSet;
import java.util.List;
import java.util.Map;
import java.util.Set;

public class HelloParser {
    public static void main(String[] args) {
        listDemo();

        setDemo();

        mapDemo();
    }

    private static void listDemo() {
        // normal list
        List<String> greetings = Arrays.asList("Hello", "Hallo", "Hola");
        System.out.printf("greetings = %s\n", greetings);

        // custom list syntax
        List<Integer> myNums =
                    { 1, 3, 5, 7, 9, 11 }

        System.out.printf("myNums = %s\n", myNums);

    }

    private static void setDemo() {
        // normal set
        Set<Integer> evensBelow10 = new HashSet<>();
        evensBelow10.addAll(Arrays.asList(2, 4, 6, 8));

        System.out.printf("evensBelow10 = %s\n", evensBelow10);

        // custom set syntax
        Set<String> mammals = { "Cat", "Dog", "Lion", "Beaver", "Raccoon" }

        System.out.printf("mammals = %s\n", mammals);
    }

    private static void mapDemo() {
        // normal map
        Map<Integer, String> numberToString = new HashMap<>();
        numberToString.put(1,  "One");
        numberToString.put(2, "Two");
        numberToString.put(3,  "Three");

        System.out.printf("numberToString = %s\n",  numberToString);

        // custom map syntax
        Map<Integer, Foo> numToFoo =
                {
                    1 => new Foo(1, "Ein");

                    2 => new Foo(2, "Zwei");

                    3 => new Foo(3, "Drei");

                    4 => new Foo(4, "Vier");

                    5 => new Foo(5, "Fuenf");
                }

        System.out.printf("numToFoo = %s\n", numToFoo);
    }

    static class Foo {
        private int id;
        private String name;

        public Foo(int id, String name) {
            this.id = id;
            this.name = name;
        }

        @Override
        public String toString() {
            return "Foo { id = " + id + ", name = " + name + " }";
        }
    }
}

So this sample input file includes examples for all three variants. So how would the parser for this look like? Let’s split the parser into three variants as well:

The parser for List:

/**
	 * Handle custom List syntax.
	 * 
	 * @param contents
	 * @return
	 */
	private static String parseListSyntax(String contents) {
		Matcher m = Patterns.LIST_PATTERN.matcher(contents);

		while (m.find()) {
			if (m.groupCount() == 3) {
				String type = m.group(1);
				String var = m.group(2);
				String val = m.group(3).trim();

				String replacement = "List<" + type + ">" + var + " = new ArrayList<" + type + ">() {{ ";
				String[] vals = val.split(",");

				for (String rep : vals) {
					String v = rep.trim();

					if (!v.isEmpty()) {
						replacement += "add(" + v + ");\n\t\t\t\t\t ";
					}
				}
				replacement += "}};";

				contents = m.replaceFirst(replacement);

				m = Patterns.LIST_PATTERN.matcher(contents);
			}
		}
		return contents;
	}

The parser for Set:

/**
	 * Handle custom Set syntax
	 * 
	 * @param contents
	 * @return
	 */
	private static String parseSetSyntax(String contents) {
		Matcher m = Patterns.SET_PATTERN.matcher(contents);

		while (m.find()) {
			if (m.groupCount() == 3) {
				String type = m.group(1);
				String var = m.group(2);
				String val = m.group(3).trim();

				String replacement = "Set<" + type + "> " + var + " = new HashSet<" + type + ">() {{ ";
				String[] vals = val.split(",");

				for (String rep : vals) {
					String v = rep.trim();

					if (!v.isEmpty()) {
						replacement += "add(" + v + ");\n\t\t\t\t\t";
					}
				}

				replacement += " }};";

				contents = m.replaceFirst(replacement);

				m = Patterns.SET_PATTERN.matcher(contents);
			}
		}

		return contents;
	}

And finally, the parser for Map:

/**
	 * Handle custom Map syntax
	 * 
	 * @param contents
	 * @return
	 */
	private static String parseMapSyntax(String contents) {
		Matcher m = Patterns.MAP_PATTERN.matcher(contents);

		while (m.find()) {
			if (m.groupCount() == 3) {
				String type = m.group(1);
				String var = m.group(2);
				String val = m.group(3).trim();

				String replacement = "Map<" + type + "> " + var + " = new HashMap<" + type + ">() {{";
				String[] vals = val.split(";");

				for (String rep : vals) {
					String v = rep.trim();

					if (!rep.isEmpty()) {
						String[] keyVal = v.split("=>");

						if (keyVal == null || keyVal.length != 2 || keyVal[0] == null || keyVal[1] == null) {
							throw new RuntimeException("invalid map syntax");
						}
						replacement += "put(" + keyVal[0].trim() + ", " + keyVal[1].trim() + ");\n\t\t\t\t\t";
					}
				}

				replacement += " }};";

				contents = m.replaceFirst(replacement);

				m = Patterns.MAP_PATTERN.matcher(contents);
			}
		}

		return contents;
	}

The code is quite straightforward and similar for all three variants – we simply to see if the relevant pattern is found in the current code contents (the whole source file is read in in one go), and if so, the relevant bit is rewritten, and the updated String is then passed to the next stage of the pipeline for further processing.

One thing to observe here is that in every parsing method, the find operation is performed in a loop. This is so that every occurrence can be updated correctly, and that is also the reason why the Matcher is updated at the end of every iteration of the loop – m = Patterns.LIST_PATTERN.matcher(contents), for instance. If this is not done so, the result will be an infinite loop since the while (m.find()) will never fail!

Here’s how the main parsing method looks like:

public static void parse(final String fileName) {
		try (BufferedReader reader = new BufferedReader(new FileReader(fileName))) {
			String ext = fileName.substring(fileName.lastIndexOf('.') + 1);

			if (!ext.equalsIgnoreCase("javax")) {
				throw new RuntimeException("input files must have a .javax extension");
			}

			File srcFile = new File(fileName);

			if (!srcFile.exists()) {
				throw new RuntimeException("source file " + fileName + " does not exist!");
			}

			final String basePath = srcFile.getAbsolutePath().substring(0,
					srcFile.getAbsolutePath().lastIndexOf(File.separator));
			final String baseFileName = srcFile.getName().substring(0, srcFile.getName().lastIndexOf('.'));
			final String outputFile = basePath + File.separator + baseFileName
					+ ".java";

			String contents = reader.lines().reduce("", (acc, s) -> {
				acc += s;
				acc += "\n";
				return acc;
			});

			contents = insertImports(contents);
			contents = parseListSyntax(contents);
			contents = parseSetSyntax(contents);
			contents = parseMapSyntax(contents);

			try (BufferedWriter writer = new BufferedWriter(new FileWriter(outputFile))) {
				writer.write(contents);
			} catch (Exception ex) {
				throw ex;
			}

		} catch (Exception ex) {
			throw new RuntimeException("Error during parsing: " + ex.getLocalizedMessage());
		}
	}

A final subtlety is in the way the importS are handled. For the full source code as well as more explanation on the way the whole tool works, visit the Github page.

Let’s take it for a spin:

Macushla:basic-regex-parser z0ltan$ ls
README.md pom.xml   samples   src       target

Macushla:basic-regex-parser z0ltan$ mvn package 1> /dev/null

Macushla:basic-regex-parser z0ltan$ java -jar target/basic-regex-parser-1.0-SNAPSHOT.jar samples/Default.javax samples/com/z0ltan/test/Package.javax samples/com/z0ltan/regexparser/test/HelloParser.javax

Let’s see how the sample file shown earlier has been rewritten:

Macushla:basic-regex-parser z0ltan$ cat samples/com/z0ltan/regexparser/test/HelloParser.java
package com.z0ltan.regexparser.test;

import java.util.*;

import java.util.Arrays;
import java.util.HashMap;
import java.util.HashSet;
import java.util.List;
import java.util.Map;
import java.util.Set;

public class HelloParser {
    public static void main(String[] args) {
        listDemo();

        setDemo();

        mapDemo();
    }

    private static void listDemo() {
        // normal list
        List<String> greetings = Arrays.asList("Hello", "Hallo", "Hola");
        System.out.printf("greetings = %s\n", greetings);

        // custom list syntax
        List<Integer>myNums = new ArrayList<Integer>() {{ add(1);
					 add(3);
					 add(5);
					 add(7);
					 add(9);
					 add(11);
					 }};

        System.out.printf("myNums = %s\n", myNums);

    }

    private static void setDemo() {
        // normal set
        Set<Integer> evensBelow10 = new HashSet<>();
        evensBelow10.addAll(Arrays.asList(2, 4, 6, 8));

        System.out.printf("evensBelow10 = %s\n", evensBelow10);

        // custom set syntax
        Set<String> mammals = new HashSet<String>() {{ add("Cat");
					add("Dog");
					add("Lion");
					add("Beaver");
					add("Raccoon");
					 }};

        System.out.printf("mammals = %s\n", mammals);
    }

    private static void mapDemo() {
        // normal map
        Map<Integer, String> numberToString = new HashMap<>();
        numberToString.put(1,  "One");
        numberToString.put(2, "Two");
        numberToString.put(3,  "Three");

        System.out.printf("numberToString = %s\n",  numberToString);

        // custom map syntax
        Map<Integer, Foo> numToFoo = new HashMap<Integer, Foo>() {{put(1, new Foo(1, "Ein"));
					put(2, new Foo(2, "Zwei"));
					put(3, new Foo(3, "Drei"));
					put(4, new Foo(4, "Vier"));
					put(5, new Foo(5, "Fuenf"));
					 }};

        System.out.printf("numToFoo = %s\n", numToFoo);
    }

    static class Foo {
        private int id;
        private String name;

        public Foo(int id, String name) {
            this.id = id;
            this.name = name;
        }

        @Override
        public String toString() {
            return "Foo { id = " + id + ", name = " + name + " }";
        }
    }
}

Excellent! Now let’s try and run all the parsed files to make sure they are all working as expected:

Macushla:basic-regex-parser z0ltan$ cd samples/

Macushla:samples z0ltan$ javac Default.java
Macushla:samples z0ltan$ java -cp . Default
users = [Cornelius, Marcus, Linnaeus, Gaius, Augustus]

Macushla:samples z0ltan$ javac com/z0ltan/test/Package.java
Macushla:samples z0ltan$ java -cp . com.z0ltan.test.Package
map = {One=1, Two=2, Three=3}
normalMap = {Hello=100, Again=300, World=200}

Macushla:samples z0ltan$ javac com/z0ltan/regexparser/test/HelloParser.java
Macushla:samples z0ltan$ java -cp . com.z0ltan.regexparser.test.HelloParser
greetings = [Hello, Hallo, Hola]
myNums = [1, 3, 5, 7, 9, 11]
evensBelow10 = [2, 4, 6, 8]
mammals = [Cat, Lion, Beaver, Raccoon, Dog]
numberToString = {1=One, 2=Two, 3=Three}
numToFoo = {1=Foo { id = 1, name = Ein }, 2=Foo { id = 2, name = Zwei }, 3=Foo { id = 3, name = Drei }, 4=Foo { id = 4, name = Vier }, 5=Foo { id = 5, name = Fuenf }}

Nice! Of course, this is just a toy tool that does trivial rewriting. If Java had a proper macro system (or even a macro system to begin with!), all this would have been trivial at best without so much redundant code. Nevertheless, it does illustrate the regular expression engines are indeed quite powerful well beyond their original intent.

EDIT: Just in case you are curious as to what the patterns look like, here they are:

public class Patterns {
	public static final Pattern LIST_PATTERN = Pattern
			.compile("List<([a-zA-Z]+)>\\s+([a-zA-Z0-9]+)\\s*=\\s*\\{\\s*([a-zA-Z0-9,\" ]+)\\s*\\}");

	public static final Pattern SET_PATTERN = Pattern
			.compile("Set<([a-zA-Z]+)>\\s+([a-zA-Z0-9]+)\\s*=\\s*\\{\\s*([a-zA-Z0-9,\" ]+)\\s*\\}");

	public static final Pattern MAP_PATTERN = Pattern
			.compile("Map<([a-zA-Z, ]+)>\\s+([a-zA-Z0-9]+)\\s*=\\s*\\{\\s*([a-zA-Z0-9\"\\s=>();, ]+)\\s*\\}");
	
	public static final Pattern PACKAGE_PATTERN = Pattern
			.compile("package\\s+([a-zA-Z0-9._;]+)");
	
	public static final Pattern IMPORT_PATTERN = Pattern
			.compile("import\\s+([a-zA-Z0-9.;]+)");
	
	public static final Pattern PUBLIC_CLASS_PATTERN = Pattern
			.compile("public class\\s+([a-zA-Z0-9_]+)");
	
	public static final Pattern NORMAL_CLASS_PATTERN = Pattern
			.compile("class\\s+([a-zA-Z0-9_]+)");
}

You may safely ignore the patterns after the first three patterns – they are used to insert the import java.util.*; declaration at the first valid location. Again, for more details, visit the Github location!

Advertisements
A parser using Regular Expressions?