Named capturing groups in Clojure
eval
The previous article, "A better regex experience", discussed some ways to make regular expressions more readable, namely commenting and using named capturing groups.
This way, a (naive but already unreadable) url regex like #"https?://([^:/]+)(?:\:([^/]+))?(?:(/[^#]*)(?:#(.+))?)?"
can be turned into something arguably more descriptive:
#"(?x)
#; match e.g. 'https://google.com:8000/search#fragment'
https?:// #; 'http://' or 'https://'
(?<host>[^:/]+) #; 'google.com'
(?:\: #; (optional) ':8000'
(?<port>[^/]+))? # '8000'
(?: #; (optional) '/search#fragment'
(?<path>/[^\#]*) #; '/search'
(?:\# #; (optional) '#fragment'
(?<fragment>.+))? #; 'fragment'
)?
"
While named group captures make extracting the captured data less fragile (as we're no longer depending on group positions), it's still not intuitive in Clojure to do so.
Let's see how to go about it and explore how to extract all groups at once, e.g. {"host" "google.com" "port" "443"}
.
getting captured data
Let's see the simplest way to get hold of some captured data.
(let [matcher (re-matcher url-re ;; the regex above
"https://staging.host.org/some/path")] ;; 1.
(when (.find matcher) ;; 2.
(.group matcher "host"))) ;; 3.
;; => "staging.host.org"
- Create an instance of
java.util.regex.Matcher
. - See if it (partially) matches the string.
- Retrieve data by group name.
Step 2, besides returning true/false, puts the Matcher instance in the right state to obtain captured data.
Step 3 would raise an error for an unknown group name (i.e. No group with name <foo>
). For a group that captured nothing it would yield nil
(e.g. group "fragment"
). The same applies to a string that doesn't match the regex.
The code above is very imperative: instantiate an object, get the object in the right state and finally get the data using a hardcoded group name. Just abstracting these steps would merit a helper function.
But, while we're at it, let's go a step further: many other languages support extracting the data of all captured groups at once, e.g. {"host" "google.com" "path" "/search"}
.
The most likely reason Clojure doesn't have this, is due to the fact that Java lacked a way to get all group names from a regular expression. Lacked, as this long-standing request was recently solved. Meaning that from Java v20 and up (and thus included in the LTS release v21) we can now use an instance method namedGroups
to get the group names from a Pattern
(or Matcher
) instance:
;; group position by group name
(.namedGroups url-re) ;; => {"path" 3, "port" 2, "fragment" 4, "host" 1}
re-named-captures
With that essential piece of the puzzle solved, implementing re-named-captures
is trivial:
(defn re-named-captures [re s]
(let [^java.util.regex.Matcher matcher (re-matcher re s)]
(when (.find matcher)
(reduce (fn [acc gname]
(if-let [cap (.group matcher gname)]
(assoc acc gname cap)
acc))
{} (keys (.namedGroups re))))))
In action:
(re-named-captures url-re
"url: https://google.com:443/path")
;; => {"path" "/path",
; "port" "443",
; "host" "google.com"}
; NOTE fragment is absent, not nil
(re-named-captures url-re
"not a url")
;; => nil
See the last paragraph on how to start a REPL with the code from this article.
As all allowed group names would make valid Clojure keywords, it would be a good addition to keywordize group names by default.
Just be aware that dashes (and underscores) are not allowed in group names. So in order to get (idiomatic)
:kebab-case
keywords, you'd need something more than justclojure.core/keyword
.
Babashka & ClojureScript compat
Java 21 landed in Babashka in version 1.3.85 (released 2023-09-28). So use this version as :min-bb-version
in your bb.edn
to warn users with older runtimes.
For ClojureScript, as with Clojure, we rely on the host platform to extract the capture groups from a regex. Support for this functionality extends further back than in Java: since June 2020, all major browsers have provided support for it.
The code (with keywordizing logic) would look something like this:
(defn re-named-captures
([re s] (re-named-captures re s nil))
([re s & {:keys [keywordize-keys] :as options}]
(let [group->key (if (false? keywordize-keys) identity keyword)
{:keys [group->key]} (merge {:group->key group->key} options)]
(if (string? s)
(when-let [matches (.exec re s)]
(let [->clj #(when (goog/isObject %)
(into {}
(for [k (.keys js/Object %)]
[(group->key k) (goog.object/get % k)])))]
(->clj (.-groups matches))))
(throw (js/TypeError. "re-named-captures must match against a string."))))))
;; overriding the default group->key
(re-named-captures url-re "https://google.com"
:group->key custom-fn)
;;=> {:host "google.com"}
Try it
Hopefully this article gave you some insights on how to more easily use named capture groups in Clojure.
The code from this article is available as a recipe. Use deps-try to start a REPL with all steps from the recipe preloaded in the REPL-history.
Install deps-try (brew install eval/brew/deps-try
) or run using Docker:
docker run -it ghcr.io/eval/deps-try --recipe https://gist.github.com/eval/504ebccdd784413e4f7f03369ffc97de