Named capturing groups in Clojure

eval

eval

Tags: #Clojure, #regex

The previous article, "A better regex experience", discussed some ways to make regular expressions more readable, namely commenting and using named capturing groups.

This way, a (naive but already unreadable) url regex like #"https?://([^:/]+)(?:\:([^/]+))?(?:(/[^#]*)(?:#(.+))?)?" can be turned into something arguably more descriptive:

#"(?x)
  #; match e.g. 'https://google.com:8000/search#fragment'
  https?://             #; 'http://' or 'https://'
  (?<host>[^:/]+)       #; 'google.com'
  (?:\:                 #; (optional) ':8000'
    (?<port>[^/]+))?      # '8000'
  (?:                   #; (optional) '/search#fragment'
    (?<path>/[^\#]*)      #; '/search'
    (?:\#                 #; (optional) '#fragment'
      (?<fragment>.+))?     #; 'fragment'
  )?
"

While named group captures make extracting the captured data less fragile (as we're no longer depending on group positions), it's still not intuitive in Clojure to do so.
Let's see how to go about it and explore how to extract all groups at once, e.g. {"host" "google.com" "port" "443"}.

getting captured data

Let's see the simplest way to get hold of some captured data.

(let [matcher (re-matcher url-re ;; the regex above 
                          "https://staging.host.org/some/path")] ;; 1.
         (when (.find matcher)   ;; 2.
           (.group matcher "host")))   ;; 3.
;; => "staging.host.org"
  1. Create an instance of java.util.regex.Matcher.
  2. See if it (partially) matches the string.
  3. Retrieve data by group name.

Step 2, besides returning true/false, puts the Matcher instance in the right state to obtain captured data.
Step 3 would raise an error for an unknown group name (i.e. No group with name <foo>). For a group that captured nothing it would yield nil (e.g. group "fragment"). The same applies to a string that doesn't match the regex.

The code above is very imperative: instantiate an object, get the object in the right state and finally get the data using a hardcoded group name. Just abstracting these steps would merit a helper function.

But, while we're at it, let's go a step further: many other languages support extracting the data of all captured groups at once, e.g. {"host" "google.com" "path" "/search"}.

The most likely reason Clojure doesn't have this, is due to the fact that Java lacked a way to get all group names from a regular expression. Lacked, as this long-standing request was recently solved. Meaning that from Java v20 and up (and thus included in the LTS release v21) we can now use an instance method namedGroups to get the group names from a Pattern (or Matcher) instance:

;; group position by group name
(.namedGroups url-re) ;; => {"path" 3, "port" 2, "fragment" 4, "host" 1}

re-named-captures

With that essential piece of the puzzle solved, implementing re-named-captures is trivial:

(defn re-named-captures [re s]
  (let [^java.util.regex.Matcher matcher (re-matcher re s)]
    (when (.find matcher)
      (reduce (fn [acc gname]
                (if-let [cap (.group matcher gname)]
                  (assoc acc gname cap)
                  acc))
              {} (keys (.namedGroups re))))))

In action:

(re-named-captures url-re
                   "url: https://google.com:443/path")
;; => {"path" "/path",
;      "port" "443",
;      "host" "google.com"}
; NOTE fragment is absent, not nil

(re-named-captures url-re
                   "not a url")
;; => nil

See the last paragraph on how to start a REPL with the code from this article.

As all allowed group names would make valid Clojure keywords, it would be a good addition to keywordize group names by default.

Just be aware that dashes (and underscores) are not allowed in group names. So in order to get (idiomatic) :kebab-case keywords, you'd need something more than just clojure.core/keyword.

Babashka & ClojureScript compat

Java 21 landed in Babashka in version 1.3.85 (released 2023-09-28). So use this version as :min-bb-version in your bb.edn to warn users with older runtimes.

For ClojureScript, as with Clojure, we rely on the host platform to extract the capture groups from a regex. Support for this functionality extends further back than in Java: since June 2020, all major browsers have provided support for it.

The code (with keywordizing logic) would look something like this:

(defn re-named-captures
  ([re s] (re-named-captures re s nil))
  ([re s & {:keys [keywordize-keys] :as options}]
   (let [group->key           (if (false? keywordize-keys) identity keyword)
         {:keys [group->key]} (merge {:group->key group->key} options)]
     (if (string? s)
       (when-let [matches (.exec re s)]
         (let [->clj #(when (goog/isObject %)
                        (into {}
                              (for [k (.keys js/Object %)]
                                [(group->key k) (goog.object/get % k)])))]
           (->clj (.-groups matches))))
       (throw (js/TypeError. "re-named-captures must match against a string."))))))

;; overriding the default group->key
(re-named-captures url-re "https://google.com"
              :group->key custom-fn)
;;=> {:host "google.com"}

Try it

Hopefully this article gave you some insights on how to more easily use named capture groups in Clojure.

Screenshot 2024-03-21 at 14.41.11.png

The code from this article is available as a recipe. Use deps-try to start a REPL with all steps from the recipe preloaded in the REPL-history.

Install deps-try (brew install eval/brew/deps-try) or run using Docker:

docker run -it ghcr.io/eval/deps-try --recipe https://gist.github.com/eval/504ebccdd784413e4f7f03369ffc97de

Tags: #Clojure, #regex