In computer programming, a character string is traditionally a sequence of characters, either as a literal constant or as a type of variable, the latter allowing its elements to be mutated and length changed or being fixed (after creation).
The string is commonly viewed as a data type and is often implemented as a byte (or word) array data structure that stores a sequence of elements, usually characters, using a particular character encoding. The string can also denote more general arrays or other types of sequence data and structures.
Depending on the programming language and the exact data type used, a variable declared as a string can result in memory being allocated statically for a given maximum length or a dynamic allocation being used in order to be able to contain a variable number of elements.
When a string is literal in the source code, it is known as a literal string or an anonymous string.
In formal languages used in mathematical logic and theoretical computation, a string is a finite sequence of symbols chosen from a set called the alphabet.
A string data type is a data type that is modeled on the idea of a formal string. Strings are such an important and useful type of data that they are implemented in almost every programming language. In some languages, they are available as primitive types and in others as compound types. the syntax of most high-level programming languages allows a string, usually enclosed in quotation marks, to represent an instance of a string data type; such a meta-string is called a literal or string literal.
Although formal strings can be of any finite length, the length of strings in real languages is often limited to an artificial maximum. In general, there are two types of string data types: Fixed-length strings, which have a fixed maximum length set at construction, will.
Time and which occupy the same amount of memory, regardless of whether this maximum is required or not, and character strings with variable length, the length of which cannot be set arbitrarily and which can take up different amounts of memory depending on the actual requirement at runtime (see memory management).
Strings in modern programming languages are strings of variable length. Of course, variable-length strings are also limited in length by the amount of available memory on the computer. The length of the string can be stored as a separate integer (which can artificially limit the length again) or implicitly via a terminator, usually a character value with all bits of zero, as in the C programming language.
In the past, string data types assigned one byte per character, and although the exact character set varied by region, the character encodings were similar enough that programmers could often ignore this since a program would have specially treated characters (such as the period, the space), and comma) were in the same place in all encodings a program would find. These character sets were usually based on ASCII or EBCDIC. When text in one encoding was displayed on a system with a different encoding, the text was often distorted, although it was often somewhat legible and some computer users learned to read the shredded text.
Logographic languages such as Chinese, Japanese, and Korean (known collectively as CJK) need far more than 256 characters (the limit of one 8-bit byte per-character encoding) for reasonable representation. The normal solutions involved keeping single-byte representations for ASCII and using two-byte representations for CJK ideographs. The use of these with existing code led to problems with the matching and cutting of strings, the severity of which depended on how the character encoding was designed.
Some encodings such as the EUC family guarantee that a byte value in the ASCII range will represent only that ASCII character, making the encoding safe for systems that use those characters as field separators.
Other encodings such as ISO-2022 and Shift-JIS do not make such guarantees, making matching on byte codes unsafe. These encodings also were not “self-synchronizing”, so that locating character boundaries required backing up to the start of a string, and pasting two strings together could result in corruption of the second string.
Unicode simplified the picture a bit. Most programming languages now have a data type for Unicode strings. Unicode’s preferred byte stream format, UTF8, was developed to avoid the problems described above for older multibyte encodings. UTF-8, UTF-16, and UTF-32 require the programmer to know that the fixed-size code units are different than the “characters”, the main difficulty currently is incorrectly designed APIs that attempt to hide this difference (UTF-32 does make code points fixed-sized, but these are not “characters” due to composing codes).
Some languages, such as C ++ and Ruby, usually allow the contents of a string to be changed after it has been created; these are called mutable chains. In other languages such as Java and Python, the value is fixed and a new string must be created if a change is to be made; these are called immutable strings (some of these languages also offer another mutable type, such as Java and .NET StringBuilder, Java Thread-safe StringBuffer, and Cocoa NSMutableString).
The chains are typically implemented as byte matrices, characters, or code units to allow quick access to individual units or subcultures, including characters when they have a fixed length.
Some languages, such as Prolog and Erlang, avoid implementing a dedicated string datatype at all but instead adopt the convention of representing strings as character code lists.
String representations are highly dependent on the choice of character repertoire and the character encoding method used; older string implementations were designed to work with ASCII-defined repertoire and encoding, or newer extensions such as String ISO 8859 Modern implementations often use the extensive repertoire supported by Unicode is defined using a variety of complex encodings such as UTF8 and UTF16.
The term byte string generally refers to a general-purpose byte string and not just character strings (human-readable), bit strings, or the like. Byte strings often imply that bytes can have any value and all data can be stored since no value should be interpreted as a termination value.
Most string implementations are very similar to variable-length arrays, with the inputs storing the character codes of the corresponding characters. The main difference is that with certain encodings, a single logical character can occupy more than one entry in the array.
This is the case with UTF8, for example, where simple codes (UCS code points) can accept one to four bytes and individual characters can accept any number of codes. In these cases, the logical length of the string (number of characters) differs from the physical length of the array (number of bytes used). UTF32 avoids the first part of the problem.
The length of a string can be stored implicitly with a special terminator; often this is the zero character (NUL), which consists of only zero bits, a convention that is used and perpetuated by the popular C programming language . Therefore, this representation is generally referred to as a C string.
Characters take up n + 1 space (1 for the terminator) and are therefore an implicit data structure.
In terminated strings, the termination code is not a character that is allowed in a string. Strings with field lengths do not have this restriction and can also store any binary data.
To Sum Up:
Strings are like sentences. They consist of a list of characters that is actually an “array of characters”. Strings are very useful when communicating program information to the user of the program. They are less useful in storing information for the computer to use.