Behind Language Detection: Who Won the War Against ‘Tofu’

By VivianOn July 13, 2024February 16, 2025

Behind Language Detection: Who Won the War Against ‘Tofu’

There was a period in computing, the early 2010s, where character encoding support was far from universal. Trying to install and run non-standard ASCII software or open files was a bit of a game of luck. Chracters had to be installed separately to be fully supported. And not just the IME, but actual character encoding support. Even then, nothing was gauranteed. In fact, it remained pretty common to see this on a regular basis: garbled text.

Fast forward to today, the situation has improved dramatically, thanks to advancements like UTF-8. But why was this happening, and why is it rare now? Brace yourselves; we’re diving into character encoding history and its evolution to UTF-8. By the end, you’ll have a clear understanding of why this transformation occurred, why OCR (or what we call image-to-text analyzer) can correctly detect languages.

How Characters Work in Computing: The Basics

To understand how an OCR tool can tell one language from another, let’s first delve into how computers interpret text. And let’s start with some computing 101 reminders. Nothing hard, promise.

Computers function with electric pulses of 0 and 1. That’s a bit. By convention, we put these bits in packs of 8, which is a byte. Each bit is either ON or OFF, which makes 256 different combinations of possibilities. Therefore your computer can count from 0 to 255 using a single byte.

What Is Hexadecimal, and Why Does It Matter?

Here is the hardest part: for certain purposes, we often split this byte in 2, so we get 4 bits, that’s 16 different possibilities, what is called hexadecimal. To represent these numbers we don’t have in our traditional decimal thinking, we use A, B, C, D, E and F. Which means A=10, B=11, etc.

For example:

Link has a tomato color:
<a href="#" style="color: #ff6347;">
The color value, ff6347, is actually 3 hexadecimal numbers: one for red, one for green, and one for blue. “ff” is the highest number possible (16×16=256). We can then know the reds are full in this color, like the name “tomato” would suggest.

Yes, I know, this is confusing and the reason is not you. Rather, we are using familiar numbers and letters to represent a different way of thinking. So don’t sweat it, it’s fine.

ASCII: The Foundation of Character Encoding

Back to our text. As far as your computer is concerned, a text is a list of characters (technically, an array). In your computer, phone, or any kind of intelligent device, each character is put in a grid, and we use the hexadecimals we just mentioned as the rows and colums of this grid.

In the beginning of computing, power was scarse and memory was limited, so in order to be as effective as possible, it was determined the smallest grid possible that would fit as much usable data as possible could be achieved by using a single byte.

Not even a single byte actually, but 7 bits (that leaves one bit to do other things).
This ended with the first convention for language representation: the American Standard Code for Information Interchange, or ASCII, was born.

Fun Fact: Not Every Character is Visible

Note that not every character is designed to be visible. For example, character 13 is End of Line. Therefore, when you press “Enter”, you are actually writing the character number 13 (or D) into your document or chatbox, which your program knows to interpret as a going to the next line.

For a deeper dive into ASCII and binary systems, check out ASCII Overview on W3C.

A Diversity of Encodings

The Early Limitations of ASCII and the Rise of Extended-ASCII

The ASCII standard, which only supported the 26 base letters of the English alphabet, was insufficient for encoding languages other than English. To address this limitation, the original 128-character slot system was quickly abandoned in favor of Extended-ASCII, which expanded the character set to 256 slots. While this was a step forward, it still didn’t provide the capability to support all the world’s languages within a single encoding.

At the time, Extended-ASCII was seen as “good enough,” especially for English-centric systems. However, the expansion was still far from enough to accommodate the complexities of global languages. This challenge, similar to those faced in modern translation technologies, highlights the importance of language detection, where an accurate identification of the language is key to ensuring the right encoding and format are applied.

Language-Specific Encoding Systems and Compatibility Issues

In the absence of a universal standard, each language began to adopt its own character encoding system, complete with unique grids and mappings. This led to the rise of various encoding formats, each designed to meet the specific needs of its language. While this approach addressed immediate issues, it also introduced significant compatibility problems.

For instance, Chinese computing was dominated for years by the GB-2312 (GB standing for 国标) encoding for Simplified Chinese, a system that remains widely used today. In parallel, the Big-5 encoding system was used for Traditional Chinese characters. These encoding systems created one of the most significant sources of incompatibility in the Chinese computing world.

The Role of Encoding Agreements in System Interactions

In computing, different systems must constantly communicate with each other. For example, an operating system (OS) must communicate with software, files must be read by software, and webpages must be rendered by browsers. The first step in this communication is often an agreement on which character encoding to use.

When this agreement is not reached, or when one system doesn’t support the necessary encoding format, the risk arises that the wrong encoding will be applied. As a result, characters may be mapped incorrectly, leading to distorted or unreadable content—often referred to as “garbled characters.”

In older systems, the OS was typically designed to understand only a specific set of character formats. If a system wasn’t built to accommodate other encodings, errors were common. When these systems tried to read data in unsupported formats, characters would often display incorrectly or not at all, creating a frustrating experience for users and developers alike.

This issue becomes especially important in Optical Character Recognition (OCR) systems. OCR technology relies on accurately interpreting and converting images of text into machine-readable text. If the encoding agreement between the OCR system and the software used to process the recognized text is misaligned, the output can be garbled or inconsistent. Ensuring that OCR systems use the correct character encoding is crucial for achieving accurate text recognition, especially when handling documents in multiple languages or specialized formats. Without proper encoding support, OCR-generated content may suffer from errors, making it difficult for users to extract meaningful information from scanned documents.

UTF-8: The Universal Solution

In a legitimate effort to harmonize character systems and solve this issue, the Unicode Consortium pushed for the adoption of a single universal format that would be as widespread as possible. A format that would contain all forms of characters from all languages possible.

After several tries, UTF-8 was the format that stuck, and was widely adopted. This still is the most used format around the world. UTF-8 can accommodate 1,112,064 characters. It supports 1 byte, 2 bytes, 3 bytes and 4 bytes of data. This means that it is not just one grid, but four grids coexisting within a single encoding. This allows it to be compatible with ASCII, because it uses the same single-byte grid.

UTF-8 handles most existing forms of written communication, including emojis, which are, as far as your computer is concerned, regular characters with their reserved space in the grid. UTF-8 is a standard managed by the Unicode Consortium. There is still a lot of free space (meaning empty cells in the grid), that’s why emojis can be regularly added.

Despite its widespread adoption, display issues can still arise. These are often due to unsupported fonts rather than encoding errors. This is particularly relevant when using OCR systems. If the OCR software misinterprets or fails to apply UTF-8 encoding when processing text from images, it can result in incorrectly displayed characters. This can lead to text that appears distorted or unreadable, even when the encoding is theoretically correct. Therefore, ensuring the correct use of UTF-8 encoding in OCR systems is essential for maintaining the integrity and readability of converted text.

The Truth About Fonts

The last key concept to mention is fonts. So first let’s get something out of the way:

Fonts and encoding are two different things

Special characters look ugly, but don’t blame the encoding, it’s the font

A font is a graphical representation of the character matched in the encoding grid. It’s the “picture” your computer will show to the final user.
But it’s up to each font designer to draw what they want wherever they want. Or to not draw anything.

The famous font Wingding, which was Microsoft’s first attempt at showing emojis, has symbols instead of letters. But for all intents and purposes, it’s still a font, which means you will be able to see regular letters whenever you switch to a regular font.

Even though UTF-8 is widely used, not every font supports all characters in the UTF-8 encoding standard. In fact, very few fonts actually do, and the reason is easy to understand: comprehensively supporting the tens of thousands of characters across dozens of different languages is an extremely tedious task.

Font Limitations

Originally, when a computer would stumble upon a character unsupported by the current font, it would display placeholder empty squares:

□□□□□□□□□□

That is why it was so easy to confuse an encoding issue with a font issue. But both issues are very different in nature, as you now understand. Modern software are designed to display a default font if they can’t find the right character. It may not look always good, but it is still better than a tofu placeholder.

The famous Google “Noto” font is the result of Google’s effort at having a font that would never return placeholder square.

This font barely supports the base ASCII characters, but you can use it with a UTF-8 encoding

Language Detection in CAT Tools: A Game Changer for Translation

That’s it for encoding and fonts. That’s not an easy topic to tackle, but why does it matter for understanding character enoding? Mastering its fundamentals can improve troubleshooting for text display issues and ensure smooth handling of multilingual content. And also, understanding the fundamentals of UTF-8 will allow us to do more cool things like detecting languages, for example, in CAT tools.

Language detection is a critical feature in modern translation technologies like CAT tools, automating many processes and eliminating manual input errors. Here are a few ways language detection in CAT tools enhances translation workflows:

Automatic Language Selection

When importing content, CAT tools’ translator automatically identify the source and target languages, streamlining the setup process for translators.

OCR Integration

In a CAT tool where Optical Character Recognition (OCR) is performed, language detection ensures accurate text extraction by adapting to the document’s language.

LQA and Consistency Checks

Language detection enables efficient Language Quality Assurance (LQA), often an essential feature in CAT tool, by identifying inconsistencies in terminology or syntax based on the detected language.

Pro Tip: Explore how our CAT tool leverages advanced language detection to optimize translation workflows and reduce manual effort.

Image-to-Text Translation

Explore how Raiverb's OCR automatically detects language in a smart way.

Automated LQA

Explore how Raiverb ensures linguistic consistency with its LQA check.

Thanks to innovations like UTF-8 and robust language detection systems, what was once a complex and error-prone process has become seamless and intuitive. Whether handling multilingual content or ensuring precise OCR results, language detection is the backbone of modern translation technology. By automating tasks like language selection and consistency checks in CAT tools, translators now can focus on crafting high-quality translations.