Subject: Re: HP ECMAScript comments From: MIKE_KSAR@HP-PaloAlto-om4.om.hp.com @ D06AU010 SendTo: e-tc39@ECMA.CH ReplyTo: MIKE_KSAR@HP-PaloAlto-om4.om.hp.com PostedDate: 19.05.98 22:40:31 TC39 technical experts, Here is some input from an experienced Unicode implementer, Ken Whistler, at Sybase, who is also one of the technical directors of Unicode. Ken responds to Tom McFarland's feedback on I18N issues for version 2. I would recommend to TC39 technical experts to review Ken's feedback and to retrieve and review the documents that he refers to in his message at the Unicode website. As Ken says, the issue is not for ECMAScript only but for all programming languages. Mike ______________________________ Forward Header __________________________________ Subject: Re: HP ECMAScript comments Author: Non-HP-kenw (kenw@sybase.com) at HP-PaloAlto,shargw3 Date: 5/19/98 3:06 PM Mike, I concur with most of Tom's comments. Although I am not familiar with the details of ECMAScript, what he has stated are all valid concerns which should be addressed by any language standard which uses Unicode as its reference character set. To this I would add that one elegant way for a programming language standard to sidestep some of the complexities introduced by alternative representations of the same text elements (particularly for Latin precomposed letters), is to define valid program text on a *normalized* form of Unicode. Programming standards have been loath to use normalized *decomposed* form of text, even though that is the most elegant way to handle the problem, partly because it expands the program text, but also for the practical reason that many systems do not handle the combining forms correctly as yet. The alternative is to use a normalized *composed* form of text. That is what Mark Davis' document: ftp://ftp.unicode.org/WorkingGroups/Properties/wdutr-Composition1.2.html. is trying to rigorously define. If a programming language standard were to specify use of canonical composition per that document as its normalized form for program text, then many of the complications posed by Tom's comments drop away. Binary comparison of Unicode identifiers and strings is then valid. The committee working on ECMAScript V2 for I18N should study the Java Collator class, of course, but also the Java resource bundles for localized data. Also, to gain perspective on the implementation guidelines for Unicode, they should be taking a look at the Unicode Technical Reports published or drafted on the Unicode website: http://www.unicode.org/unicode/reports/techreports.html Among these is the Draft UTR #10 on Unicode Collation, which describes in great detail the Unicode recommendations and data files for support of multilingual, culturally correct collation. Re some others of Tom's comments: 11.8.5 Comparison operators "to pay for the cost of doing a locale-sensitive sort" I would characterize this rather as a "culturally correct sort", so as not to beg the question of how the collation is defined. "locale-sensitive" tends to reflect a model which presumes the existence of a "locale" that the application queries, whereas that is not how Java does it and might not be the way for ECMAScript to do it. 11.9.3 Equality operators Introduction of normalization of program text would keep the basic equality operators simple. This does not preclude the need for definitions of levels of equality under collation as well, as Tom points out. 15.13.5.39 toLocaleString() "Need some way to specify which locale will be used" Java has a mechanism for this all spelled out, making use of a Formatter class (and subclasses DateFormatter, NumericFormatter, etc.). It would be good for ECMAScript to follow that well-thought-out model, if possible. --Ken -------------------------------------------------------------------------------- FROM: Non-HP-kenw/HP-PaloAlto_shargw3////////HPMEXT1/kenw#a#sybase#f#com@boi167 TO: MIKE_KSAR@hp.com CC: tommc@hp.com, kenw@sybase.com