Subject: HP ECMAScript comments From: MIKE_KSAR@HP-PaloAlto-om4.om.hp.com @ D06AU010 SendTo: e-tc39@ECMA.CH PostedDate: 19.05.98 20:54:18 TC39 ECMAScript experts, Here is the input from Tom McFarland of HP, who participated in the last TC39 editorial meeting. Most of the points he raises are related to Internationalization (I18N) issues which he suggests can be addressed in Version 2. I hope that this feedback is evaluated and is addressed in TC39 future meetings. I am copying Tom on this email so that he can see the feedback. I also kindly request that ECMA secretariat would add Tom to the email list of TC39 (Mme Broxner, please.) Mike ______________________________ Forward Header __________________________________ Subject: ECMAScript comments Author: tommc-at-cnd (tommc@hptommc.cnd.hp.com) at HP-PaloAlto,mimegw3 Date: 5/19/98 12:32 PM Hi Mike, Attached are my list of comments for the ECMAScript v2. Thanks! Tom ---------------------- Comments on ECMAScript V2 for I18N From: Tom McFarland General: ECMAScript needs to determine what controls/announces the locale for locale-sensitive operations. For example, the Date.toLocaleString() function generates a locale-sensitive date. But based on what locale? How does the application know? Each of the items below assumes that there is some way to control the locale behavior of the suggested operation. 7.7.5 Regular Expression Literals At a minimum, need to define behavior of searching for attempting to match characters for which alternate Unicode representations might be available. For example, if searching for A-umlaut, Unicode allows this "logical character" to be represented by either a single Unicode code point, or by a sequence of two Unicode code points - one for "A" and one for "non-spacing umlaut". A better solution would be to allow the programmer to specify how "exact" or "fuzzy" the match should be. As a reference, see the four strengths that can be set into Java's Collator class. 9.3.1 toNumber Applied to the String Type String representations of numbers vary from locale to locale. In the US, the "." is used as the separator for integers and fractional portions; in Europe, the "," is used for this separator. The character used for grouping also varies across locales. 9.8.1 The toString() function needs to handle the same situations as listed for 9.3.1 above. 11.8.5 Comparison operators The specification notes that only a simple lexicographic ordering on sequences of Unicode characters is done. This is understandable from a performance stand-point. However, applications will need some mechanism to perform comparisons in a locale sensitive fashion... to deal with local sorting customs. Java does this via the Collator class, a separate class for people willing to pay for the cost of doing a locale-sensitive sort. 11.9.3 Equality operators The specification notes that only a simple lexicographic ordering on sequences of Unicode characters is done. However, since ECMAScript has decided to use Unicode, it has to provide applications some method to compare two strings for "logical equality". While Unicode eliminates the problems of different coded character sets, it adds a new bundle of problems to the mix. For example, Unicode contains the character A-ring (0x00c5). However, it also contains the character A (0x0041) and the non-spacing character, combining ring-above (0x030A). So the logical character "a-ring" can be represented in Unicode as either 0x00c5 *or* as the sequence 0x0041 0x030A. As an application developer, I have no control as to which representation is passed to me from the input (keyboard, form, file, etc). Similarly, the user has no control over which is generated when they type the logical character at the keyboard. So ECMAScript applications must have some mechanism to ask if two strings are logically equal. In Java JDK 1.1.* and later, this is done via the Collator class. 12.11 The switch Statement The switch statement uses the strict not-equal comparison. This will introduce the same problems described for 11.9.3 above for international software developers. 15.7.4.8 Array prototype reverse() Because Unicode uses combining character sequences to represent "logical characters", reversing an array of Unicode characters can cause some very unexpected results. It might be worth an informative (e.g. non-normative) note in the spec about the hazards of reversing arrays of Unicode characters. 15.7.4.9 Array prototype shift() The same problem described in 15.7.4.8 also exists for shift operations on arrays of Unicode characters. Maybe the best thing would be for a non-normative note at the beginning of 15.7 mentioning the hazard of manipulating arrays of Unicode characters. 15.8.* String Object The string object needs some mechanism to allow programmers to identify "logical character" boundaries. Armed with this, they could then do substring operations (albiet with a bit of work) and utilize the other methods of this class. As it stands, there is a good chance that a program will corrupt character data using most of the methods of this class. Java provides the BreakIterator class to perform this functionality. Unfortunately, Java stops at this point at leaves it to the developer to combine the BreakIterator class with methods of the String class to perform meaningful string operations (such as substring search, indexOf(), lastIndexOf(), etc). 15.9.* Regular Expression Regular Expression doesn't handle Unicode - or at least the ambiguities of multiple ways to represent a single "logical character". 15.9.1 \s "white space" Delete the words "Any white space" - and just leave it as equivalent to [ \f\n\r\t\v]. In fact, the space character should be called out as an explicit Unicode value. The reason? Unicode includes many other "space" characters, including the range 0x2000-0x200F, including such gems as 0x200B - the zero width space character. Does \s include all these "space" characters, or only the ones enumerated? My guess is the latter; if so, it would be safer to explicitly list those characters identifyied by \s. 15.9.1 \w & \W I realize that early on the spec defines that identifiers are limited to the characters a-zA-Z_ from the first 128 Unicode characters. However, this section is defining regular expressions for operating on user data, which is not constrained to the English letters a-zA-Z_. At a minimum, the character range a-z is confusing, since it raises the issue of whether or not a-umlaut is included in the range (function of linguistic customs). If you really want to constrain it to those characters from the first 128 Unicode code points, then it might be worth adding a note or in being more explicit. 15.9.1 \b word boundary The concept of "word" is very language specific. Given the definition in the spec, it would be best to simply change the phrase "word boundary" to "boundary condition"... unless \b is to be expanded to identify a word boundary based on the language of the string data being operated on. 15.11.4.2 Number.prototype.toString(radix) Assuming number can be a floating point value, the some language needs to control formatting of the string produced... not all languages use "." as a decimal point indicator. 15.13.1.6 Day of the week number is locale-specific. In some cultures, Sunday is the first day of the week; in others, Monday is the first day of the week. 15.13.4.2 Date.parse(string) Needs to handle locale-specific string representations of time/date. Java created the DateFormat class to perform this functionality. 15.13.5.39 toLocaleString() Need some way to specify which locale will be used to control formatting of the time/date information. -- Tom McFarland Hewlett-Packard, Co. <tommc@cnd.hp.com> -------------------------------------------------------------------------------- FROM: tommc-at-cnd/HP-PaloAlto_mimegw3////////HPMEXT1/tommc#a#hptommc#f#cnd#f#hp#f#com@boi167 TO: MIKE_KSAR@HP-PaloAlto-om4.om.hp.com