Subject:  HP ECMAScript comments
From: @ D06AU010
SendTo:  e-tc39@ECMA.CH
PostedDate:  19.05.98 20:54:18

     TC39 ECMAScript experts,

     Here is the input from Tom McFarland of HP, who participated in the
     last TC39 editorial meeting.  Most of the points he raises are
     related to Internationalization (I18N) issues which he suggests can
     be addressed in Version 2.  I hope that this feedback is evaluated
     and is addressed in TC39 future meetings.

     I am copying Tom on this email so that he can see the feedback.

     I also kindly request that ECMA secretariat would add Tom to the
     email list of TC39 (Mme Broxner, please.)


______________________________ Forward Header __________________________________
Subject: ECMAScript comments
Author:  tommc-at-cnd ( at HP-PaloAlto,mimegw3
Date:    5/19/98 12:32 PM

Hi Mike,

Attached are my list of comments for the ECMAScript v2.




Comments on ECMAScript V2 for I18N

From: Tom McFarland

General:  ECMAScript needs to determine what controls/announces the
locale for locale-sensitive operations.  For example, the
Date.toLocaleString() function generates a locale-sensitive date.  But
based on what locale?  How does the application know?  Each of the items
below assumes that there is some way to control the locale behavior of
the suggested operation.

7.7.5 Regular Expression Literals

   At a minimum, need to define behavior of searching for attempting to
   match characters for which alternate Unicode representations might
   be available.  For example, if searching for A-umlaut, Unicode allows
   this "logical character" to be represented by either a single Unicode
   code point, or by a sequence of two Unicode code points - one for "A"
   and one for "non-spacing umlaut".

   A better solution would be to allow the programmer to specify how
   "exact" or "fuzzy" the match should be.  As a reference, see the
   four strengths that can be set into Java's Collator class.

9.3.1 toNumber Applied to the String Type

   String representations of numbers vary from locale to locale.  In
   the US, the "." is used as the separator for integers and fractional
   portions; in Europe, the "," is used for this separator.  The character
   used for grouping also varies across locales.

9.8.1 The toString() function needs to handle the same situations as
   listed for 9.3.1 above.

11.8.5 Comparison operators

   The specification notes that only a simple lexicographic ordering on
   sequences of Unicode characters is done.  This is understandable from
   a performance stand-point.  However, applications will need some
   mechanism to perform comparisons in a locale sensitive fashion... to
   deal with local sorting customs.  Java does this via the Collator
   class, a separate class for people willing to pay for the cost of
   doing a locale-sensitive sort.

11.9.3 Equality operators

   The specification notes that only a simple lexicographic ordering on
   sequences of Unicode characters is done.  However, since ECMAScript
   has decided to use Unicode, it has to provide applications some method
   to compare two strings for "logical equality".  While Unicode
   eliminates the problems of different coded character sets, it adds a
   new bundle of problems to the mix.

      For example, Unicode contains the character A-ring (0x00c5).
   However, it also contains the character A (0x0041) and the
   non-spacing character, combining ring-above (0x030A).  So the logical
   character "a-ring" can be represented in Unicode as either 0x00c5
   *or* as the sequence 0x0041 0x030A.  As an application developer, I
   have no control as to which representation is passed to me from the
   input (keyboard, form, file, etc).  Similarly, the user has no
   control over which is generated when they type the logical character
   at the keyboard.

   So ECMAScript applications must have some mechanism to ask if two
   strings are logically equal.  In Java JDK 1.1.* and later, this is
   done via the Collator class.

12.11  The switch Statement

   The switch statement uses the strict not-equal comparison.  This will
   introduce the same problems described for 11.9.3 above for
   international software developers. Array prototype reverse()

   Because Unicode uses combining character sequences to represent
   "logical characters", reversing an array of Unicode characters can
   cause some very unexpected results.  It might be worth an informative
   (e.g. non-normative) note in the spec about the hazards of reversing
   arrays of Unicode characters.  Array prototype shift()

   The same problem described in also exists for shift operations
   on arrays of Unicode characters.  Maybe the best thing would be for
   a non-normative note at the beginning of 15.7 mentioning the hazard of
   manipulating arrays of Unicode characters.

15.8.*  String Object

   The string object needs some mechanism to allow programmers to identify
   "logical character" boundaries.  Armed with this, they could then do
   substring operations (albiet with a bit of work) and utilize the other
   methods of this class.  As it stands, there is a good chance that a
   program will corrupt character data using most of the methods of this

   Java provides the BreakIterator class to perform this functionality.
   Unfortunately, Java stops at this point at leaves it to the developer
   to combine the BreakIterator class with methods of the String class to
   perform meaningful string operations (such as substring search,
   indexOf(), lastIndexOf(), etc).

15.9.* Regular Expression

   Regular Expression doesn't handle Unicode - or at least the ambiguities
   of multiple ways to represent a single "logical character".

15.9.1 \s "white space"

   Delete the words "Any white space" - and just leave it as equivalent
   to [ \f\n\r\t\v].  In fact, the space character should be called out
   as an explicit Unicode value.  The reason?  Unicode includes many other
   "space" characters, including the range 0x2000-0x200F, including such
   gems as 0x200B - the zero width space character.

   Does \s include all these "space" characters, or only the ones
   enumerated?  My guess is the latter; if so, it would be safer to
   explicitly list those characters identifyied by \s.

15.9.1 \w & \W

   I realize that early on the spec defines that identifiers are limited
   to the characters a-zA-Z_ from the first 128 Unicode characters.
   However, this section is defining regular expressions for operating
   on user data, which is not constrained to the English letters
   a-zA-Z_.  At a minimum, the character range a-z is confusing, since
   it raises the issue of whether or not a-umlaut is included in the
   range (function of linguistic customs).

   If you really want to constrain it to those characters from the first
   128 Unicode code points, then it might be worth adding a note or in
   being more explicit.

15.9.1 \b word boundary

   The concept of "word" is very language specific.  Given the definition
   in the spec, it would be best to simply change the phrase "word
   boundary" to "boundary condition"... unless \b is to be expanded to
   identify a word boundary based on the language of the string data
   being operated on.  Number.prototype.toString(radix)

   Assuming number can be a floating point value, the some language needs
   to control formatting of the string produced... not all languages use
   "." as a decimal point indicator.

   Day of the week number is locale-specific.  In some cultures, Sunday
   is the first day of the week; in others, Monday is the first day
   of the week.  Date.parse(string)

   Needs to handle locale-specific string representations of time/date.
   Java created the DateFormat class to perform this functionality.  toLocaleString()

   Need some way to specify which locale will be used to control
   formatting of the time/date information.


Tom McFarland
Hewlett-Packard, Co.