Human-readable regular expressions

Regular expressions are a powerful tool for input validation, text extraction, and find-and-replace operations. Every programming language supports regular expressions, either as part of the standard library or implemented directly into the language.

In the following example, we need to process strings in this format: two or three uppercase ASCII letters from A to Z, then a hyphen (-), then a number between 0 and 999, then a dot, and finally either x, y, or z. Our task is to split the string into three components. For example, if the input is AB-0.z, we want the three substrings AB, 0, and z.

Java ¶

In Java, regular expression support lives in the java.util.regex package. The two classes you use most often are Pattern and Matcher.

  public static void main(String[] args) {
    String[] inputs = { "AB-0.z", "ABC-99.y", "BB-789.x", "ab-999.x" };

    Pattern pattern = Pattern.compile("^([A-Z]{2,3})-(\\d{1,3})\\.([xyz])$");

    for (String input : inputs) {
      Matcher matcher = pattern.matcher(input);
      if (matcher.matches()) {
        System.out.printf("Group 1: %s, Group 2: %s, Group 3: %s%n", matcher.group(1),
            matcher.group(2), matcher.group(3));
      }
      else {
        System.out.println(input + " does not match");
      }
    }
  }

Native.java

JavaScript ¶

In JavaScript, you can either use the RegExp constructor new RegExp("..."), which is useful when you build a pattern dynamically, or a regex literal such as /.../.

const regex = /^([A-Z]{2,3})-(\d{1,3})\.([xyz])$/;

const inputs = ["AB-0.z", "ABC-99.y", "BB-789.x", "ab-999.x"];

for (const input of inputs) {
  const match = regex.exec(input);
  if (match != null) {
    console.log(`Group 1: ${match[1]}, Group 2: ${match[2]}, Group 3: ${match[3]}`);
  }
  else {
    console.log(`${input} does not match`);
  }
}

native.js

Regular expressions are powerful, but they are also compact and easy to misread. Even this small example uses capturing groups, quantifiers, and an escaped dot. If you do not write regular expressions every day, it often takes a moment to read the expression and verify what it actually matches.

Fortunately, many tools help you write, test, and visualize regular expressions. Here are three useful ones:

VerbalExpressions ¶

If you want to write regular expressions in a more readable way, VerbalExpressions provides a builder-style API for exactly that purpose.

VerbalExpressions is available for many programming languages. It builds expressions with a fluent API and still relies on each language's built-in regular expression engine under the hood.

Java ¶

If you want to use VerbalExpressions in a Java application, add this dependency to your project.

  <dependencies>
    <dependency>
      <groupId>ru.lanwen.verbalregex</groupId>
      <artifactId>java-verbal-expressions</artifactId>
      <version>1.8</version>
    </dependency>
  </dependencies>

pom.xml

The previous example, rewritten with VerbalExpressions, looks like this:

  public static void main(String[] args) {
    VerbalExpression regex = VerbalExpression.regex().startOfLine().capture()
        .range("A", "Z").count(2, 3).endCapture().then("-").capture().digit().count(1, 3)
        .endCapture().then(".").capture().anyOf("xyz").endCapture().endOfLine().build();

    System.out.println(regex.toString());

    String[] inputs = { "AB-0.z", "ABC-99.y", "BB-789.x", "ab-999.x" };

    Pattern pattern = Pattern.compile(regex.toString());

    for (String input : inputs) {
      Matcher matcher = pattern.matcher(input);
      if (matcher.matches()) {
        System.out.printf("Group 1: %s, Group 2: %s, Group 3: %s%n", matcher.group(1),
            matcher.group(2), matcher.group(3));
      }
      else {
        System.out.println(input + " does not match");
      }
    }

  }

VerbalRegex.java

You have to write more code, but the result reads almost like an English description of the pattern. VerbalExpressions does not remove the need to understand regular expressions. You still need to know the underlying regex features and their limitations. What the library does well is make the intent of the expression more explicit and self-documenting.

The toString() method returns the generated expression as a string, which you can then pass to Pattern.compile(), just like in the first example. You can still build an invalid expression. For instance, the Java builder does not check that every capture() call is matched by an endCapture() call. One detail you do not have to worry about is escaping literal characters such as the dot in our example. The then() method quotes them automatically.

JavaScript ¶

In JavaScript, you can install the package with

npm install verbal-expressions

or load it in the browser from jsDelivr with a script tag:

<script src="https://cdn.jsdelivr.net/npm/verbal-expressions@1.0.2/dist/verbalexpressions.min.js"></script>

The code looks very similar to the Java version. There are some differences, such as the method name for beginning a capture group (beginCapture(), in Java capture()) and the missing count() method in the JavaScript package. For this example, add("{2,3}") and add("{1,3}") are enough because add() appends raw regex text.

The JavaScript library returns a standard RegExp object when you call VerEx(). Because of that, the code that uses the resulting expression is the same as in the native JavaScript example.

const VerEx = require('verbal-expressions');

const regex = VerEx()
                .startOfLine()
                .beginCapture()
                  .range("A", "Z").add("{2,3}")
                .endCapture()
                .then("-")
                .beginCapture()
                  .digit().add("{1,3}")
                .endCapture()
                .then(".")
                .beginCapture()
                  .anyOf("xyz")
                .endCapture()
                .endOfLine();

const inputs = ["AB-0.z", "ABC-99.y", "BB-789.x", "ab-999.x"];

for (const input of inputs) {
  regex.lastIndex = 0;
  const match = regex.exec(input);
  if (match != null) {
    console.log(`Group 1: ${match[1]}, Group 2: ${match[2]}, Group 3: ${match[3]}`);
  }
  else {
    console.log(`${input} does not match`);
  }
}

verbalregex.js

Complex expressions ¶

Another useful feature is extracting common parts of an expression and reusing them.

In this example, the first and last parts share the same structure: three numbers followed by two lowercase characters.

    VerbalExpression regex = VerbalExpression.regex().startOfLine().range("1", "9")
        .count(3).range("a", "z").count(2).then("-").range("a", "z").count(2).then("-")
        .range("1", "9").count(3).range("a", "z").count(2).endOfLine().build();

    String input = "123xy-ab-311de";
    System.out.println(regex.test(input));

VerbalRegex2.java

Instead of repeating that fragment, we can create a builder for the shared part and then reuse it with the add() method.

    VerbalExpression.Builder part = VerbalExpression.regex().range("1", "9").count(3)
        .range("a", "z").count(2);

    regex = VerbalExpression.regex().startOfLine().add(part).then("-").range("a", "z")
        .count(2).then("-").add(part).endOfLine().build();

    System.out.println(regex.test(input));

VerbalRegex2.java

On this wiki page, you can find a larger example that shows the same refactoring technique in more detail.

You can use the same idea when you write regular expressions by hand. Move repeated fragments into separate string variables and append them to the final expression.

The last example also shows one of the convenience methods of the VerbalExpression class: test(), which checks whether a string matches the expression. The library provides a few more helper methods for common use cases. If your use case is not covered by one of them, you can always extract the pattern with toString() and continue with the standard regex API, as we did in the first example.

You can find the source code for all the examples on GitHub: https://github.com/ralscha/blog/tree/master/verbalregex