19 MARCH 2022 / CODE

Regular expressions for law and court-decision recognition

In legal work, we use consistent citations of federal court decisions, other decisions and legal norms. What is easily and quickly recognisable for lawyers is more difficult for machines to deduce. One way in which machines can also obtain this information is presented here: So-called regular expressions (regex for short).

A regular expression (Regex or RegExp for Regular Expressions) is a string of regular values that serves as a pattern to recognise certain strings. They can be used in various programming languages and are often used to search for or replace strings. Regex is not a new technology and is used in a wide variety of situations. Certain search engines even support Regex when searching the web. But why are regex also interesting for legal-technological work? A string that is quickly recognisable for lawyers is, for example, the indication of a legal norm: Art. 13 Abs. 1 BV. The structure is always the same, so the "building blocks" Art. and Abs. remain the same and only the numbers change. In federal law, articles of law thus always begin with a Art., followed by a number (and possibly a letter like "a" or bis). It can be summarised relatively quickly how a legal norm is referred to. Cantonal law also follows a certain, almost identical form. A machine could thus be "taught" this system: If a Art. is followed by a number and ends with an abbreviation of a law, it is a reference to a legal norm. And this is exactly what regex can be used for, so that a machine can also deduce from a continuous text which norms or decisions are referred to.

How are Regex defined?

As already explained above, regex can be used in different programming languages (different flavours). As a web developer, I use the ECMAScript (JavaScript) flavour.
I do not want to refer to all the possibilities of Regex here, but work with a simple example instead. For a detailed documentation of all possibilities with Regex, see the MDN docs (linked below). Perhaps an even more narrowly defined example than the article recognition presented above are federal court decisions, because in the case of legal norms, variations such as numbers, letters, etc. must also be taken into account. In the following, for the sake of simplicity, the shortest regex will be used as an example, i.e. unpublished federal court decisions (case number). At this point I would like to refer to my open-source scripts on GitHub, which contain regex for norms and federal court decisions in DE, FR and IT: iusable_Regex on GitHub.
The case number of a federal court decision is quite simple: 5A_1000/2020

Example Feature to be recognised
5 Department number
A Letter for the procedures
1000/2020 Consecutive numbering

I would like to take a closer look at the regex used for this purpose. The final regex is shown in the following window. The individual components can be clicked on to display a short explanation.

The brackets are so-called groups and are needed when partial information is to be taken from the element found. In this case, the regex without groups would simply return the string 5A_1000/2020 as the result. With the groups, on the other hand, each of the features mentioned in the table above can be identified. The slashes at the beginning and end of the regex mark the start and end. The "g" after the final slash means that it should be a global search (i.e. the search continues after a hit). For a detailed description of all possible character classes, I recommend the mdn web docs via RegExp from Mozilla.

What are the limits of Regex?

For a human being, an enumeration is easy to recognise. However, it is not as easy for a machine: To recognise with a regex that the string "Art. 1 and 2 StGB" is in fact referring to article 1 and 2 of the StGB. The enumeration would also have to be included in the regex. However, an enumeration can be of any length and may also use different connecting words, paragraphs, numbers, etc. Formulating a regex for this would be extremely time-consuming and I doubt whether all possibilities could really be covered.

How could these limits be circumvented?

One conceivable approach would be the use of an artificial intelligence, to be precise a so-called NER (Named-Entity Recognition) model. These models can be used, for example, to recognise people, company names or other information such as towns in continuous text. I will pursue this approach in the future and hope to achieve even better results compared to the current method. Of course, there would be a follow-up blog post 😉
NER for german legal documents has been done in Germany by E. Leitner, G. Rehm and J. Moreno-Schneider in their paper "Fine-Grained Named Entity Recognition in Legal Documents": DOI https://doi.org/10.1007/978-3-030-33220-4_20