When talking about languages, we sometimes like to show that they satisfy some conditional statement. When that conditional statement makes some existential claims in its antecedent, we sometimes like to take the closure of a language under that conditional; even if in our language the antecedent fails to be realised, we can just add in the words which would realise the antecedent and in doing so get a larger language which does satisfy the given conditional.
For instance, we might have the conditional that if the word $w_1 w_2$ is in a language $L$, then the word $w_2 w_1$ is in that language $L$ as well. Given any language, even one that doesn’t satisfy such a conditional, we can close that language under the conditional by adding in any words required by the existential on the right. For instance, to close a language under the above conditional, we take any words of the language that can be partitioned into $w_1w_2$ and, for each of these words, add to the language the word $w_2 w_1$.
Closures are particularly interesting when considering properties preserved under closure. Given a regular language, for instance, we’re interested if the regularity of that language is preserved under closure or not. Here, then, I’ll focus on this problem: is the regularity of a language preserved under a given closure? Let’s look at some examples and see what we get.
Closure under relations
One way to find conditionals to close languages under is to define some relation $\preceq$ and define some impose the existence of words that satisfy that relation. For instance, we might formulate the conditional that, for each word $w$ in a given language, there exists some word $v$ such that $w \preceq v$ with $v$ also in the given language. Here we’ll look at some general forms that these conditionals can take, and look at whether the regular languages are closed under these operations.
Upwards directedness
Recently I read a paper [1] where the relation of upwards directedness was discussed, where we say a language $L$ is $\preceq$-upwards directed just when, for any two words $w$ and $v$, there exists some word $u$ such that $w \preceq u$ and $v \preceq u$. There are a few sensible choices of $\preceq$ here, and we may say that $w \preceq v$ holds just when:
- $\length{w} \leq \length{v}$, so when $w$ is shorter in length than $v$,
- $w$ has fewer instances of the letter $a$ than $v$, for some letter $a$ of a given alphabet, or
- $w$ is a contiguous subwords of $v$, so appears somewhere as a substring of $v$.
So let’s see what sorts of relations we can use to close a language under upwards directedness.
Length as a relation
A fairly canonical relation over words is the length relationthat is, we say $w \leq v$ just when $v$ has more characters than $w$and so we can think of closing a language under $\leq$-upwards directedness.
Of course, if our language $L$ is infinite, then it’s already $\leq$-upwards directed; there are infinitely many words in $L$, and since there are only finitely many words of length $n$ for any finite $n$ it follows that there are infinitely many words of different lengths in $L$. But then for any two words $w$ and $v$ of length $n$ and $m$ respectively, there must be some word of length greater than both $n$ and $m$ in $L$. So any infinite language is $\leq$-upwards directed.
Finite languages are a bit more interesting, but still easy enough to close under $\leq$-upwards directedness. In particular, if a language $L$ is finite, then it must have a maximum word length; that is, some number $n$ such that no word in $L$ is longer than $n$. That’s because, if $L$ didn’t have a maximum word length, then for any word $w$ there’d be a word $v$ longer than it. But then $L$ would be infinite, which it’s not.
So all we need to do to close $L$ under $\leq$-upwards directedness is just add to it the word $a^{n+1}$ (for any choice of letter $a$ from the alphabet), and we’re done. Every other word in $L$ is shorter than $a^{n+1}$, and so for any choice of words $w$ and $v$ in $L$ we have $w \leq a^{n+1}$ and $v \leq a^{n+1}$. So we’re done.
Finally, in either case if $L$ is regular than the $\leq$-upwards directed closure of $L$ is regular. In the former case, that’s trivially the case, because we don’t even do anything to $L$. In the latter case, we just take our automaton that recognises $L$, and add to it $n+1$ states that together recognise the word $a^{n+1}$. So regularity is preserved by $\leq$-upwards closures.
Numbers of symbols as a relation
We can also consider the number of specific symbols a word has as something that can be compared to other words. So let’s consider the relation $\leq_a$ and $\leq_a$-upwards directedness, where $w \leq_a v$ just when $v$ has more occurences of the letter $a$ in it than $w$ does.
For finite languages it’s easy to close $L$ under $\leq_a$-upwards directedness. In fact we can just apply the same construction as we did when using length as a relation, since if there are at most $n$ letter $a$s in any word of a language $L$, then the word $a^{n+1}$ will have more $a$s than any of those words. So we add that to our language, and we’ve closed it under $leq_a$-upwards directedness.
If our language $L$ is infinite, we then consider whether or not there’s some finite $n$ such that every word in $L$ has at most $n$ occurences of an $a$ in it. If there is such a finite $n$, we can deploy the construction we used above, again. If not, then by similar reasoning to the case where we just looked at length, there’s no need to close $L$ under $\leq_a$-upwards directedness since it already is closed under this relation.
Contiguous subwords as a relation
Here’s a more interesting relation. Let’s say that $w \preceq v$ just when $w$ is a contiguous subword of $v$. Then we have, for instance, that $\word{slat} \leq \word{translate}$, or that $\word{qua} \leq \word{squash}$, but that for instance $\word{cage} \not\leq \word{cabbage}$ (since, in that case, we have a non-contiguous subword).
Can we close the regular languages by directing them under this operation? Yes. Indeed, given a regular language $L$ we can find an extension of it that, for each pair of words $w$ and $v$ in that language $L$, contains a word that $w$ and $v$ are both contiguous subwords of; we just take our new language to be $\set{wv \mid w, v \in L}$. That is, we take our automaton recognising $L$, and when running a word on it we nondeterministically guess when we’ve stopped reading in the first word, go back to the start state, and continue reading the second word. We accept just when a word is of the form $wv$, and so we have a language that contains $L$ and is closed under the directedness of $\leq$.
Of course, we might care about minimal closures. The above example may not be, in some sense, minimal; for instance if we have words $\word{ba}$ and $\word{ab}$, then we don’t need to introduce the word $\word{baab}$, but rather just need to introduce the word $\word{bab}$. In that sense of minimal, though, our regular languages minimally closed under this operation are still regular. Given an automaton that recognises a regular language $L$, we can nondeterministically guess at any point along the run that we’ve starting reading the first word as a substring, and started reading the second as a substring. We accept if both of these sub-runs accept. So we have an automaton that recognises the minimal closure of our regular language, as we wanted.
Closure under operations on words
Another condition that can be imposed on a language is a condition on the result of some operation being in the language. For instance, we may require that if $w$ and $v$ are in a language, then $w \concat v$ is in the language as well, where $w \concat v$ is the concatenation of $w$ and $v$. More generally, given an operator $\bigstar$, we can close a language under that operator by taking any two words of the language $w$ and $v$, and adding to the language the word $w \bigstar v$.
We can generalise this even further; given an $n$-ary operator and $n$ words of a language, we can close the language under the $n$-ary operator by taking our $n$ words and applying the operator to them, and adding the result back to the language.
Some operators we know the regular languages are closed under are concatenation and the Kleene star. Let’s look at some more operators here, and see whether or not regularity is preserved under them.
Unary operators
First let’s look at unary operators. A unary operator takes just one word and returns just one word; for instance, the reversal operator $\reverse{\cdot}$ takes a word and reverses it. Let’s look at that and some other examples, and see whether they preserve regularity.
Reversal
Given a word $w$, its reversal $w^R$ is just $w$ read backwards. For instance, we have $\word{time}^R = \word{emit}$, or $\word{panama}^R = \word{amanap}$.
We’d expect the regular languages to be closed under reversals, and indeed they are. Given a regular language $L$, we take the automaton recognising it, flip all its transitions, nondeterministically choose our new starting state to be one of the old accepting states, and set the new accepting state to be the old starting state. We can see that a word is accepted by this new reversed automaton just when the reversal of the run on that word is accepted by the old automaton, and so we accept just the reverse of words in the language $L$. So the reversed language $L^R$ is also regular. That’s nice.
Weak Palindromification
Given a word $w$, we say its weak palindromification $\weakpal{w}$ is just $w\reverse{w}$. That gives us a palindrome, but it may not be the shortest palindrome which contains the word $w$ as a prefix (we’ll think about those in the next section!). For instance, the weak palindromification of $\word{rats}$ is $\word{ratsstar}$, and the weak palindromification of $\word{racecar}$ is $\word{racecarracecar}$.
We wouldn’t think that the regular languages are, in general, closed under weak palindromification, since intuitively any finite automaton recognising such a language would need to remember what the first half of the word was when reading the second half. And indeed this is the case; a regular language whose weak palindromification isn’t regular is $\set{a^nb \mid n \in \mathbb{N}}$, since we can pump the weak palindromification of this language and cause problems. We’d just pick a large enough $n$, show that the finite automaton has to loop somewhere, and thus accept too many words that it shouldn’t be accepting. So the regular languages are closed under weak palindromification.
Strong Palindromification
Given a word $w$, we say that its strong palindromification $\strongpal{w}$ is the shortest palindrome $w’$ such that $w$ is a prefix of $w’$.
We’d first like to know that this operation is well-defined, and so first like to know whether $\strongpal{w}$ is unique for all words $w$. We might worry that there are two words $v$ and $u$, both palindromes of equal length, such that $w$ is a prefix of both and there are no shorter palindromes containing $w$ as prefix. In short, we want $\strongpal{\cdot}$ to return a single word, not a set of words.
Thankfully $\strongpal{\cdot}$ is well-defined; we prove this in the following proposition:
$\textbf{Proposition.}$ $\strongpal{\cdot}$ is well-defined; in particular $\strongpal{\cdot}$ returns precisely one word in all cases.
/Proof./ Supposing it weren’t for contradiction, we’d have some word $w$ of length $n$ such that $v$ and $u$ are both palindromes of equal length $m \geq n$, with $w$ a prefix of both, with $v \neq u$, and with there being no palindromes shorter than $n$ with $w$ a prefix of them.
We know that, for any word $w$ of length $n$, the shortest palindrome containing it as prefix will be of length at most $2n$, as $w \reverse{w}$ is a palindrome containing $w$ as a prefix. So we know that all the characters of $v’$ and $\word{3’}$ appear in the second half of $v$ and $u$, respectively.
Since $v$ and $u$ are palindromes, and they differ on their second halves on some position $n + k$, it follows that they differ on position $n-k$ as well. But we know position $n-k$ is a position in $w$. But $w$ can’t have two different letters in position $n-k$; that’s absurd. So we get a contradiction, and so $\strongpal{\cdot}$ really is well-defined.
Are the regular languages closed under strong palindromification? We wouldn’t expect it to be, since languages of palindromes aren’t typically regular, and indeed here we get a non-regular language closing some regular languages under strong palindromification. For instance, take the language $\set{a^nb \mid n \in \mathbb{N}}$. That’s the language of strings of $a$’s of any length, followed by a single $b$. The strong palindromification of this language, then, is the language $\set{a^n b a^n \mid n \in \mathbb{N}}$. But that’s certainly not a regular language; no finite automaton can remember how many $a$’s there were in the first block to then read off in the second block. It’s a language ripe to be pumped. So it can’t be regular, and so in general the strong palindromification of a regular language isn’t necessarily regular.
Conclusion
It’s fun to think about how we can play around with regular languages, and under what sorts of operations they’re closed under or not. There’s plenty more than what’s in this article to think about, and so there may be follow-ups to it in future; in any case, the regular languages are a rich class of languages.
Bibliography
- [1] Moses Ganardi and Irmak Sağlam and Georg Zetzsche. Proceedings of the 41st International Symposium on Theoretical Aspects of Computer Science. 289:36:136:20, 2024.