logo of the unicode consortium
|alias(es)||universal coded character set (ucs)|
|encoding formats||utf-8, utf-16, gb18030|
less common: utf-32, bocu, scsu, utf-7
|preceded by||iso 8859, various others|
unicode is a information technology standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. the standard is maintained by the unicode consortium, and as of march 2020 the most recent version, unicode 13.0, contains a repertoire of 143,924 characters (consisting of 143,696 graphic characters, 163 format characters and 65 control characters) covering 154 modern and historic scripts, as well as multiple symbol sets and emoji. the character repertoire of the unicode standard is synchronized with iso/iec 10646, and both are code-for-code identical.
the unicode standard consists of a set of code charts for visual reference, an encoding method and set of standard character encodings, a set of reference data files, and a number of related items, such as character properties, rules for normalization, decomposition, collation, rendering, and bidirectional text display order (for the correct display of text containing both right-to-left scripts, such as arabic and hebrew, and left-to-right scripts).
unicode's success at unifying character sets has led to its widespread and predominant use in the internationalization and localization of computer software. the standard has been implemented in many recent technologies, including modern operating systems, xml, java (and other programming languages), and the .net framework.
unicode can be implemented by different character encodings. the unicode standard defines utf-8, utf-16, and utf-32, and several other encodings are in use. the most commonly used encodings are utf-8, utf-16, and ucs-2 (without full support for unicode), a precursor of utf-16; gb18030 is standardized in china and implements unicode fully, while not an official unicode standard.
utf-8, the dominant encoding on the world wide web (used in over 94% of websites as of november 2019 ), uses one byte for the first 128 code points, and up to 4 bytes for other characters. the first 128 unicode code points represent the ascii characters, which means that any ascii text is also a utf-8 text.
ucs-2 uses two bytes (16 bits) for each character but can only encode the first 65,536 code points, the so-called basic multilingual plane (bmp). with 1,112,064 possible unicode code points corresponding to characters (see below) on 17 planes, and with over 143,000 code points defined as of version 13.0, ucs-2 is only able to represent less than half of all encoded unicode characters. therefore, ucs-2 is outdated, though still widely used in software. utf-16 extends ucs-2, by using the same 16-bit encoding as ucs-2 for the basic multilingual plane, and a 4-byte encoding for the other planes. as long as it contains no code points in the reserved range u+d800–u+dfff, a ucs-2 text is valid utf-16 text.
utf-32 (also referred to as ucs-4) uses four bytes for each character. like ucs-2, the number of bytes per character is fixed, facilitating character indexing; but unlike ucs-2, utf-32 is able to encode all unicode code points. however, because each character uses four bytes, utf-32 takes significantly more space than other encodings, and is not widely used.