The UTF-8 Automaton Decomposed

13 Jun 2025
Joshua Moerman

I had the idea of learning a DFA to recognise all valid UTF-8 byte sequences. In principle this should be an easy application of automata learning, and I was curious to see the resulting automaton. I am also working on decompositions of state based systems, and this provides a nice opportunity to see how we can decompose a “real-world” example.

I expected a simple automaton, maybe with 4 or 5 states. For UTF-8 you have to sometimes decode 1, 2, 3 or 4 bytes, and this will end up in different states. There is, however, one more difficulty of UTF-8, and that is that certain Unicode code points could be encoded in more than one way. For instance, a space (U+0020) is 0x20 as a single byte, or 0xC090 as two bytes. These are so-called overlong encodings and are not allowed in UTF-8. So the DFA will have to reject them.

The resulting DFA

I used LearnLib to learn an automaton of the OpenJDK (version 23) UTF-8 decoder (java.nio.charset.CharsetDecoder). The input alphabet is simply the set of all 256 bytes. The whole learning process took less than 30 seconds. After some manual cleaning up (like removing a sink state), this results in the diagram below:

Diagram of the UTF8 automaton.

Transitions which are omitted, such as 0xC0 from state s0, will immediately reject the sequence. It is reassuring that the automaton looks exactly like the automaton of Björn Höhrmann. Reassuring for two reasons:

It means that I learned the right thing: the regular language of valid UTF-8 byte sequences.
It means that the OpenJDK implementation is correct. I also learned an automaton for the validator in Google Guava, and it produced the same automaton.

It is interesting that neither implementation uses a state machine, but rather a loop with if-then-else branches.

Decomposing the automaton

The above automaton has 8 states (9 if you include a sink state), and it is a natural question whether we can reduce the number of states somehow. Automata theory has some tools to consider. For instance, we could try to find an equivalent NFA, although I don’t think it will be very useful here. Another option is the consider the DFA for the reversed language. Surprisingly, this does result in a smaller DFA with only 6 states (7 when including a sink state).

However, I am more interested in seeing whether it can be decomposed into several smaller DFAs. It seems most natural to ask whether the UTF-8 automaton can be expressed as a conjunction (intersection). The decomposition of regular languages as conjunction is considered in a paper by Kupferman and Mossheif. Unfortunately, it is not yet known how to efficiently decompose a DFA, and I had to resort to a SAT-solver. (Technical aside: the paper shows that the problem is in PSpace, which is beyond what SAT-solvers can do. But with a fixed number of components, here 2, it is in NP.)

The above automaton is the conjunction of these two automata:

Diagram of the first component of the UTF8 automaton. Diagram of the second component of the UTF8 automaton.

Some things to observe:

There are 8 states in total, just like the original automaton. However, each individual automaton is strictly smaller.
The structure of the first automaton mimics the original automaton. It simplifies the language by checking fewer exceptions it seems. The second automaton then “fixes” these simplifications.
The structure of the second automaton looks more complex. Although it only has three states, all are connected and the byte ranges are not easily expressed.

Concluding remarks

I don’t know whether this is of any use, but it is a nice exercise in automata theory. I have some further things to say:

This decomposition was found with a SAT solver, and it is possible that there are other (perhaps nicer) solutions with 5+3 states.
There are no smaller solutions (in terms of number of states).
I have also tried finding a decomposition into 3 DFAs, but there seems to be no smaller decompositions. For some SAT instances (5+5+2 and 8+2+2, including the sinks states) the SAT solver ran for hours and I had to terminate it.
I have used yEd Live to create the diagrams. It is better than graphviz (dot), but still needed a lot of manual work to clean up the diagrams.
Using the full set of 256 bytes for automata learning seemed a bit inefficient to me. But it worked really well. I wonder whether using symbolic automata would be more efficient.
I think the UTF-8 automaton could be a great resource in a course on formal languages and automata!

Links

All the code (and some more) can be found on our gitlab server.