The UTF-8 Automaton Decomposed

13 Jun 2025
Joshua Moerman

I had the idea of learning a DFA to recognise all valid UTF-8 byte sequences. In principle this should be an easy application of automata learning, and I was curious to see the resulting automaton. I am also working on decompositions of state based systems, and this provides a nice opportunity to see how we can decompose a “real-world” example.

I expected a simple automaton, maybe with 4 or 5 states. For UTF-8 you have to sometimes decode 1, 2, 3 or 4 bytes, and this will end up in different states. There is, however, one more difficulty of UTF-8, and that is that certain Unicode code points could be encoded in more than one way. For instance, a space (U+0020) is 0x20 as a single byte, or 0xC090 as two bytes. These are so-called overlong encodings and are not allowed in UTF-8. So the DFA will have to reject them.

The resulting DFA

I used LearnLib to learn an automaton of the OpenJDK (version 23) UTF-8 decoder (java.nio.charset.CharsetDecoder). The input alphabet is simply the set of all 256 bytes. The whole learning process took less than 30 seconds. After some manual cleaning up (like removing a sink state), this results in the diagram below:

Diagram of the UTF8 automaton.

Transitions which are omitted, such as 0xC0 from state s0, will immediately reject the sequence. It is reassuring that the automaton looks exactly like the automaton of Björn Höhrmann. Reassuring for two reasons:

It is interesting that neither implementation uses a state machine, but rather a loop with if-then-else branches.

Decomposing the automaton

The above automaton has 8 states (9 if you include a sink state), and it is a natural question whether we can reduce the number of states somehow. Automata theory has some tools to consider. For instance, we could try to find an equivalent NFA, although I don’t think it will be very useful here. Another option is the consider the DFA for the reversed language. Surprisingly, this does result in a smaller DFA with only 6 states (7 when including a sink state).

However, I am more interested in seeing whether it can be decomposed into several smaller DFAs. It seems most natural to ask whether the UTF-8 automaton can be expressed as a conjunction (intersection). The decomposition of regular languages as conjunction is considered in a paper by Kupferman and Mossheif. Unfortunately, it is not yet known how to efficiently decompose a DFA, and I had to resort to a SAT-solver. (Technical aside: the paper shows that the problem is in PSpace, which is beyond what SAT-solvers can do. But with a fixed number of components, here 2, it is in NP.)

The above automaton is the conjunction of these two automata:

Diagram of the first component of the UTF8 automaton. Diagram of the second component of the UTF8 automaton.

Some things to observe:

Concluding remarks

I don’t know whether this is of any use, but it is a nice exercise in automata theory. I have some further things to say:

All the code (and some more) can be found on our gitlab server.