Subject: HP ECMAScript comments
From: MIKE_KSAR@HP-PaloAlto-om4.om.hp.com @ D06AU010
SendTo: e-tc39@ECMA.CH
PostedDate: 19.05.98 20:54:18
TC39 ECMAScript experts,
Here is the input from Tom McFarland of HP, who participated in the
last TC39 editorial meeting. Most of the points he raises are
related to Internationalization (I18N) issues which he suggests can
be addressed in Version 2. I hope that this feedback is evaluated
and is addressed in TC39 future meetings.
I am copying Tom on this email so that he can see the feedback.
I also kindly request that ECMA secretariat would add Tom to the
email list of TC39 (Mme Broxner, please.)
Mike
______________________________ Forward Header __________________________________
Subject: ECMAScript comments
Author: tommc-at-cnd (tommc@hptommc.cnd.hp.com) at HP-PaloAlto,mimegw3
Date: 5/19/98 12:32 PM
Hi Mike,
Attached are my list of comments for the ECMAScript v2.
Thanks!
Tom
----------------------
Comments on ECMAScript V2 for I18N
From: Tom McFarland
General: ECMAScript needs to determine what controls/announces the
locale for locale-sensitive operations. For example, the
Date.toLocaleString() function generates a locale-sensitive date. But
based on what locale? How does the application know? Each of the items
below assumes that there is some way to control the locale behavior of
the suggested operation.
7.7.5 Regular Expression Literals
At a minimum, need to define behavior of searching for attempting to
match characters for which alternate Unicode representations might
be available. For example, if searching for A-umlaut, Unicode allows
this "logical character" to be represented by either a single Unicode
code point, or by a sequence of two Unicode code points - one for "A"
and one for "non-spacing umlaut".
A better solution would be to allow the programmer to specify how
"exact" or "fuzzy" the match should be. As a reference, see the
four strengths that can be set into Java's Collator class.
9.3.1 toNumber Applied to the String Type
String representations of numbers vary from locale to locale. In
the US, the "." is used as the separator for integers and fractional
portions; in Europe, the "," is used for this separator. The character
used for grouping also varies across locales.
9.8.1 The toString() function needs to handle the same situations as
listed for 9.3.1 above.
11.8.5 Comparison operators
The specification notes that only a simple lexicographic ordering on
sequences of Unicode characters is done. This is understandable from
a performance stand-point. However, applications will need some
mechanism to perform comparisons in a locale sensitive fashion... to
deal with local sorting customs. Java does this via the Collator
class, a separate class for people willing to pay for the cost of
doing a locale-sensitive sort.
11.9.3 Equality operators
The specification notes that only a simple lexicographic ordering on
sequences of Unicode characters is done. However, since ECMAScript
has decided to use Unicode, it has to provide applications some method
to compare two strings for "logical equality". While Unicode
eliminates the problems of different coded character sets, it adds a
new bundle of problems to the mix.
For example, Unicode contains the character A-ring (0x00c5).
However, it also contains the character A (0x0041) and the
non-spacing character, combining ring-above (0x030A). So the logical
character "a-ring" can be represented in Unicode as either 0x00c5
*or* as the sequence 0x0041 0x030A. As an application developer, I
have no control as to which representation is passed to me from the
input (keyboard, form, file, etc). Similarly, the user has no
control over which is generated when they type the logical character
at the keyboard.
So ECMAScript applications must have some mechanism to ask if two
strings are logically equal. In Java JDK 1.1.* and later, this is
done via the Collator class.
12.11 The switch Statement
The switch statement uses the strict not-equal comparison. This will
introduce the same problems described for 11.9.3 above for
international software developers.
15.7.4.8 Array prototype reverse()
Because Unicode uses combining character sequences to represent
"logical characters", reversing an array of Unicode characters can
cause some very unexpected results. It might be worth an informative
(e.g. non-normative) note in the spec about the hazards of reversing
arrays of Unicode characters.
15.7.4.9 Array prototype shift()
The same problem described in 15.7.4.8 also exists for shift operations
on arrays of Unicode characters. Maybe the best thing would be for
a non-normative note at the beginning of 15.7 mentioning the hazard of
manipulating arrays of Unicode characters.
15.8.* String Object
The string object needs some mechanism to allow programmers to identify
"logical character" boundaries. Armed with this, they could then do
substring operations (albiet with a bit of work) and utilize the other
methods of this class. As it stands, there is a good chance that a
program will corrupt character data using most of the methods of this
class.
Java provides the BreakIterator class to perform this functionality.
Unfortunately, Java stops at this point at leaves it to the developer
to combine the BreakIterator class with methods of the String class to
perform meaningful string operations (such as substring search,
indexOf(), lastIndexOf(), etc).
15.9.* Regular Expression
Regular Expression doesn't handle Unicode - or at least the ambiguities
of multiple ways to represent a single "logical character".
15.9.1 \s "white space"
Delete the words "Any white space" - and just leave it as equivalent
to [ \f\n\r\t\v]. In fact, the space character should be called out
as an explicit Unicode value. The reason? Unicode includes many other
"space" characters, including the range 0x2000-0x200F, including such
gems as 0x200B - the zero width space character.
Does \s include all these "space" characters, or only the ones
enumerated? My guess is the latter; if so, it would be safer to
explicitly list those characters identifyied by \s.
15.9.1 \w & \W
I realize that early on the spec defines that identifiers are limited
to the characters a-zA-Z_ from the first 128 Unicode characters.
However, this section is defining regular expressions for operating
on user data, which is not constrained to the English letters
a-zA-Z_. At a minimum, the character range a-z is confusing, since
it raises the issue of whether or not a-umlaut is included in the
range (function of linguistic customs).
If you really want to constrain it to those characters from the first
128 Unicode code points, then it might be worth adding a note or in
being more explicit.
15.9.1 \b word boundary
The concept of "word" is very language specific. Given the definition
in the spec, it would be best to simply change the phrase "word
boundary" to "boundary condition"... unless \b is to be expanded to
identify a word boundary based on the language of the string data
being operated on.
15.11.4.2 Number.prototype.toString(radix)
Assuming number can be a floating point value, the some language needs
to control formatting of the string produced... not all languages use
"." as a decimal point indicator.
15.13.1.6
Day of the week number is locale-specific. In some cultures, Sunday
is the first day of the week; in others, Monday is the first day
of the week.
15.13.4.2 Date.parse(string)
Needs to handle locale-specific string representations of time/date.
Java created the DateFormat class to perform this functionality.
15.13.5.39 toLocaleString()
Need some way to specify which locale will be used to control
formatting of the time/date information.
--
Tom McFarland
Hewlett-Packard, Co.
<tommc@cnd.hp.com>
--------------------------------------------------------------------------------
FROM:
tommc-at-cnd/HP-PaloAlto_mimegw3////////HPMEXT1/tommc#a#hptommc#f#cnd#f#hp#f#com@boi167
TO: MIKE_KSAR@HP-PaloAlto-om4.om.hp.com