ECMAScript identifiers are currently specified as being Unicode. However, only the first 128 Unicode characters are allowed, effectively restricting identifiers to ASCII.
Implementations of ECMAScript are currently in use around the world. Developers whose native language is not English should be able to have identifiers that make sense to them. Although arbitrary strings can be used for named property lookup, allowing ideographs and other Unicode characters in identifiers will make it easier for global developers to write scripts.
Since implementations must currently accept Unicode characters, extending the range of characters allowed to that of the Unicode identifier class should not be an undue burden.
Java guarantees that escaped Unicode characters occurring in source code (in the form \uNNNN) will be unescaped before compilation. This can lead to problems in dynamic languages, for example when a newline character is escaped:
Program 1 (note that \u000A is the newline character):
int foo = 5;\u000Aint bar =6;
Program 2 (equivalent in Java, but not ECMAScript):
int foo = 5;
int bar = 6;
Because allowing Unicode escapes in identifiers would complicate interpreter implementations, this is forbidden. Note that Unicode escapes are still allowed in comments and literal strings, but are not decoded.
Section 5.14 of the Unicode Standard v2.0 gives implementation guidelines for identifiers. Most identifiers legal under these guidelines are legal in ECMAScript. ECMAScript differs in that no provisions are given for ignoring formatting characters (which are forbidden).
These recommendations are made against the April 22 ECMAScript draft. Specific changes to the document appear in bold type.
§6 Source Text
Amend the first section as follows:
"However, non-ASCII Unicode characters may appear
only within identifiers, comments, and string literals.
In identifiers, the exact set of Unicode characters allowed is
specified in Section 7.5 and corresponds to those Unicode
characters with the property of alphabetic, decimal digit,
combining mark, or ideographic. In string literals, any
Unicode character may also be expressed as a Unicode escape sequence
consisting of six ASCII characters, namely \u plus four hexadecimal
digits. Within a comment, such an escape sequence is effectively ignored
as part of the comment. Within a string literal, the Unicode
escape sequence contributes one character to the string value of the literal."
§7.5 Identifiers
Amend the first section as follows:
"An identifier is a character sequence of unlimited length, where each character
in the sequence must be a Unicode character with the property of
alphabetic (category "L"), decimal digit (category "Nd"), ideographic, or combining.
For historical reasons, the underscore (_) character and dollar sign ($) are also supported.
The first character may not be a Unicode decimal digit.
Two ECMAScript identifiers are the same only if they have the same sequence of Unicode characters (as defined by their Unicode code points). This means that two identifiers with the same external appearance may not be identical. Composite Unicode characters are treated as distinct from their decomposed equivalents. For example, LATIN CAPITAL LETTER A (\u0061) followed by COMBINING RING ABOVE (\u030A) is distinct from LATIN CAPITAL LETTER A WITH RING ABOVE (\u00C5)."
The Unicode Standard v2.0 specifies implementation guidlines for identifiers
(§5.14 Identifiers). These significant differences between ECMAScript
and these guidelines should be noted:
Amend the BNF as follows:
"IdentifierName ::
CombiningCharacter
Extender
IdentifierLetter :: one of
[ASCII table with _ and $]
Additionally, an IdentifierLetter may be a member of the Unicode letter class (those Unicode characters in category "L"), or the Unicode character FULLWIDTH LOW LINE (U+FF3F).
IdeographicCharacter ::
DecimalDigit :: one of
§15.9.1 Regular Expression Pattern Matching
The textual descriptions of the \w and \W character classes do not match with the character ranges given. The ranges given are what is intended (for historical reasons).
Amend the descriptions of \w and \W character classes:
\w | ASCII letters, digits, and underscore; equivalent to "[a-zA-Z0-9_]". |
\w | Any character not an ASCII letter, digit, or underscore; equivalent to "[^a-zA-Z0-9_]". |