A better regex experience

eval

eval

Tags: #regex

If you use regular expressions on a, ehm, regular basis then you might be familiar with the fact that it typically results in code that has a short half-life in terms of understandability. Yes, that carefully crafted string that was the result of numerous trials and errors and that, after tuning the greediness of one group with the right amount of laziness of the following groups, finally delivered the right result. Chances are that that exact thing will look totally opaque in, give or take, a week or two (probably less for your team mates 😬).

Add to this the factor length and we can draw a nice XKCD-style graph:

Screenshot 2024-02-26 at 12.12.09.png

In this article I'll describe 2 ways to make your regular expressions clearer and thus better maintainable.

comments

Definitely the best way to make your regular expressions easier on the eyes is to insert comments. And this can be simply done by switching on the comment-flag, supported by most regex engines: (?x).
Say, we need a regex to match a simple URL like "https://google.com/search#fragment", then the difference might be something like this:

;; without comments
#"https?://([^:/]+)(?:\:([^/]+))?(?:(/[^\#]*)(?:\#(.+))?)?"

;; commented
#"(?x) # match e.g. 'https://google.com:8000/search#fragment'
  https?://      # 'http://' or 'https://'
  ([^:/]+)       # capture 'google.com'
  (?:\:          # (optional) ':8000'
    ([^/]+))?      # capture '8000'
  (?:            # (optional) '/search#fragment'
    (/[^\#]*)      # capture '/search'
    (?:\#          # (optional) '#fragment'
      (.+))?         # capture 'fragment'
  )?
"

The comment-flag instructs the regex engine to ignore any whitespace (i.e. trailing, leading and in between!), newlines, as well as anything following a #. And so we have numerous options to make it clearer what's going on where by spreading the various groups out over multiple lines, use indentation and add comments.

Just be aware that when you want to match # or whitespace, you should escape these like on line 12 in the example above.

Compare:

;; without comments
;; match specific tags separated by at least one whitespace
(re-find #"#foo +#bar"
  "Tags: #foo #bar") ;; => "#foo #bar"
(re-find #"#foo +#bar"
  "Tags: #foo#bar") ;; => nil

;; comments enabled
;; now it matches anything as it's one big comment
(re-find #"(?x)#foo +#bar"
  "Tags: #foo #bar") ;; => ""
;; escaping '#' is not enough though:
(re-find #"(?x)\#foo +\#bar"
  "Tags: #foo #bar") ;; => nil
;; ...because it now matches
(re-find #"(?x)\#foo +\#bar"
  "Tags: #foo#bar") ;; => "#foo#bar"
;; ...and
(re-find #"(?x)\#foo +\#bar"
  "Tags: #foooooooo#bar") ;; => "#foooooooo#bar"
;; it's as if we wrote #"#foo+#bar"

;; fix: escaping both '#' and whitespace
(re-find #"(?x)\#foo\ +#bar"
  "Tags: #foo #bar") ;; => "#foo #bar"

Enabling comments is a big help in solving the problem that regular expressions are mostly very dense one-liners.

named captures

Another thing that makes regular expressions better readable are so called 'named capturing groups'.
This is essentially a way to self-document the capture groups. So instead of ([^:/]+) to capture the host, you'd use (?<host>[^:/]+).
Besides self-documenting, it also has the benefit of not relying on the position of the group when using back references, or when extracting the data that was captured.

Consider the following example where we solely seek URLs containing a particular env (such as staging or production) in both the subdomain and the path:

;; should match "https://staging.host.org/staging/some/path"
;; but not "https://staging.host.org/other-env/some/path"
;; without named captures: reference by position
#"^https://([^.]+)\.host\.org(/\1/.+)"

;; named captures: reference by name
#"^https://(?<env>[^.]+)\.host\.org(/\k<env>/.+)" 

Also retrieving the captured data is no longer positional:

;; positional
(get (re-find #"^https://([^.]+)\.host\.org(/\1/.+)"
              "https://staging.host.org/staging/some/path") 1)
;; => "staging"

;; named captures
(let [matcher (re-matcher #"^https://(?<env>[^.]+)\.host\.org(/\k<env>/.+)" 
                          "https://staging.host.org/staging/some/path")]
         (when (.find matcher)
           (.group matcher "env")))
;; => "staging"

Names captures do away with the magic numbers that are otherwise necessary both for back referencing and retrieval.

Combining comments and named capturing groups for our URL matching regex:

#"(?x) # match e.g. 'https://google.com:8000/search#fragment'
  https?://             # 'http://' or 'https://'
  (?<host>[^:/]+)       # 'google.com'
  (?:\:                 # (optional) ':8000'
    (?<port>[^/]+))?      # '8000'
  (?:                   # (optional) '/search#fragment'
    (?<path>/[^\#]*)      # '/search'
    (?:\#                 # (optional) '#fragment'
      (?<fragment>.+))?     # 'fragment'
  )?
"

Conclusion

By applying two simple changes we turned our naive (but nonetheless already pretty unreadable) URL regex #"https?://([^:/]+)(?:\:([^/]+))?(?:(/[^#]*)(?:#(.+))?)?" into something (hopefully) more readable and thus better maintainable. Happy regex-ing!

Tags: #regex