December 02, 2021coding

Unicode

Unicode is an international character encoding standard. Its development dates back to the 1980s. Since then, it has gradually evolved. The intention was to create an encoding that would support multilingual characters and work across different platforms. To better understand it, it is necessary to see what limitations encodings of that time had. More on that in the following chapters.

What is encoding?

Before we tackle concrete encoding standards, let’s see what encoding is and why it is necessary.

Character encoding is an old problem that dates back to Morse code. The Morse code encodes characters in short and long electrical signals, which allows the transmission of messages through wires. Today’s computers work similarly. Instead of short and long electrical signals, they have zeros and ones. In other words, encoding is the process of transforming one form of data into another. The reverse operation is decoding or transformation of the encoded data into the initial form. Thanks to that, we can work with plain texts and delegate those transformations, storage, and network operations to computers. To ensure the correctness of the data, it is necessary to have standards that would precisely define each of the operations.

What is ASCII?

ASCII (American Standard Code for Information Interchange) is one of the first widely used character encoding standards. People from the telecommunication and computing industries in America created it during the 1960s. They wanted to overcome some limitations that encoding had at that time, so they developed a new 7-bit coding system. It was widely accepted by computer manufacturers, and later became an international standard for character encoding. As a 7-bit coding system, it supported 128 (i.e. 2⁷) characters, 96 printing characters, and 32 control characters. That was sufficient to encode numbers, some special characters, and the letters of the English alphabet.

However, the spread of computing and the Internet has created a need for other characters as well. As computers used 8-bit bytes, some manufacturers decided to use the remaining 8th bit in the ASCII code and thus expand the number of characters to 256. This 8-bit encoding is often referred to as “Extended ASCII” or “8-bit ASCII“. With the growth of different 8-bit encoders, data exchange became complicated and error-prone. That was a sign that it was necessary to find some universal solution that would work for all languages and cover all the special characters.

What is Unicode?

The need for a uniform encoding system that supports all languages was already evident in the 1980s. The spread of the Internet in that period certainly had an impact on it. Later, in 1991, two organizations, ISO (International Standards Organization) and Unicode Consortium decided to create one universal standard for coding multilingual text. For that purpose, they decided to merge their two standards, the ISO/IEC 10646 and Unicode. Since then, these two organizations have worked together very closely and kept their respective coding standards synchronized. Also, it is worth mentioning that Unicode imposes some implementation guidelines that should guarantee uniformity across platforms and applications that the ISO/IEC 10646 does not have.

Unicode provides a unique code for every character, in every language, in every program, on every platform. It enables a single document to contain text from different writing systems, which was nearly impossible with earlier native encodings. Moreover, Unicode supports emojis, which are an indispensable part of communication today.

Unicode Transformation Formats

Unicode defines several transformation formats, also known as UTFs (Unicode Transformation Formats). These transformation formats define how each code is represented in bits in memory. Below is a brief overview of the three UTFs that Unicode Standard provides.

  • UTF-8
    • variable-length character encoding that uses from 1 to 4 bytes (from 8 to 32 bits)
    • backward compatible with ASCII
    • the most common encoding on the web (~98% of all web pages)
  • UTF-16
    • variable-length character encoding that uses 2 or 4 bytes (16 or 32 bits)
    • internally used by Microsoft Windows, Java, JavaScript, etc.
  • UTF-32
    • fixed length character encoding that uses 4 bytes (32 bits)
    • faster to operate but uses more memory and wastes a lot of bandwidth

When it comes to which transformation format is best, there is no right answer. It depends on what you need. Each of them has its pros and cons. If you don’t know, take UTF-8, it is the most dominated transformation format on the web.

Final thoughts

As you may have noticed, the rise of computing and the Internet brought some new challenges. One of those challenges was how to support new multilingual characters that work well on all platforms. Fortunately, some bright people have devised Unicode. Today, we can hardly imagine working on a computer without this way of encoding. Also, we must not ignore the fact that people from the beginning wanted to use computers in their native language, which tells us how important localization really is. In case you need to localize your website or application, Localizely can help you with that. Create your free account today.

Enjoying the read?

Subscribe to the Localizely blog newsletter for quality product content in your inbox.

Related

Flutter localization: step-by-step
March 29, 2022
In “Coding
Copyrights 2022 © Localizely