ECMA 262 – TC39

Date:                       March 21, 1999

Location:                Microsoft

Next Meeting:                June 21, 1999 - 9:30am at Netscape, San Jose

 

People:

                Mike Cowlishaw

                Cedric Krumbein

                Clayton Lewis

                Herman Venter

                Dario Russi

                Rok Yu

                Mike McCabe

                Waldemar Horwat

Regular Expressions

15.10.2.13 CharacterClassEscapeCharacter

Semantics of the production CharacterClassEscapeCharacter :: d | D agreed.

For production CharacterClassEscapeCharacter :: s | S

15.10.2.8 Atom – Helper Function Canonicalize

Converts a single character input to a locale independent upper case. The one exception is when the conversion would result in multiple characters - “German SS”. In that case, the character is “B” is kept.

Atom 15.10.2.8 – Helper Function CharacterSetMatcher

The operation of “^” in character set has the affect of inverting the result of a regular expression match – not the set of characters being matched. e.g. If a string A is matched by the pattern /[abcC]/, the pattern [^abcC] matches the compliment set – not the set of characters matched by /[{all Unicode characters} – { a, b, c, C}]/. This is encoded in steps 8 through 11 in the algorithm.

Atom 15.10.2.8 – Multiline and Handling “.”, “^”, “$”

Perl’s implementation is described in “Mastering Regular Expressions” pp 232. The table of how they interact in Perl is as follows:

 

default

m

s

sm

$

end

before LB

end

before LB

^

begin

after LB

begin

after LB

.

non-LB

non-LB

any

any

 

We will only be supporting “m” behavior. In Perl and existing ECMAScript implementation, the “.” only matches everything except LF. We agree that we will standardize “.” to match everything except for all characters in the character class LineTerminator.

Atom 15.10.2.8 – Character Escape

Syntax requires correct number of digits for character escape sequences. Implementations may choose how to recover from errors where the correct number of digits haven’t been specified. e.g. “\x7” syntax error (implementation can do whatever). To specify the character you need “\x07”.

Atom 15.10.2.8 – Identity Escape

A “\” followed by a character that is not part of an identifer, the character is accepted directly. These characters are defined by the syntax class IdentityEscape. Characters not in IdentityEscape are defined to throw errors.

Atom 15.10.2.8 – Digit Escape

What happens with “\0[0-9]”?

 

Option 1 (Perl):

Current proposal captures Perl behavior where two digit octal values are allowed.

 

Option 2 (Simple):

\0 Octal (null)

\[0-3][0-7][0-7] Octal

\[1-9] Backref

\[1-9][0-9] Backref

 

Octals are always 3 digits with the one exception of “\0” which can be one digit. The first option is pretty complicated. The second option specifies legal behavior that is different from Perl.

Action: Waldemar will mail Larry Wall  to see how he feels about the simplification with respect to Perl RegExp.

Atom 15.10.2.8 – Internal function BackreferenceMatcher

Line 19 should be changed as follows:

19. If s is undefined, goto 27

Action: Waldemar will make necessary change.

Atom 15.10.2.8 – Atom :: PatternCharacter

Content agreed.

Atom 15.10.2.8 – Character Classes

Backslashes should be brackets in the CharacterClass productions.

Nuke algorithm steps 73 in RangeList(15.10.2.16).

Nuke algorithms 92, 99 – 101 in RangeListNoDash (15.10.2.17).

 

An issue similar to the Digit Escapes described above occurs in the production RangeAtomNoDash (15.10.2.19).

Action: Waldemar will make the necessary changes.

 

Is there a customer  need for regular expressions to support  Unicode character classes? The following were suggested as potential applications:

  1. Grepping through user code.
  2. Generating new character classes by subtract sets of characters from Unicode aware sets.

 

Consensus to defer to 4th Edition.

Action: When mailing Larry Wall, Waldemar will also ask about planned Unicode support in Perl RegExp.

 

The Turkish i maps to an ASCII I when mapping to upper case. This results in relatively non-intuitive behavior when matching a pattern like “/[\W]/I”. Are there any other non-ASCII characters that turn into ASCII characters? The internal function Canonicalize would be modified to handle this exception (step 13.5) much like handling of the case conversion of the German s-zet.

 

The PatternCharacter doesn’t currently allow line terminators.

Action: Waldemar will make the necessary changes to allow line terminators in PatternCharacter.

??? – Regular Expression Literals

Discussion: Does a regular expression literal introduce a new object? No This matches existing behavior and also matches Perl. Semantics aren’t consistent with nested functions, but this did not seem like an important enough reason. e.g.

                function foo() { return /a/; }

                r1 = foo();

                r2 = foo();

                r1.bar = 45

                r2.bar = ?  45

 

A scenario where it is desirable to always return a single object is when one wants to iterate over multiple matches using a literal.

for(i = 1; i < 4; i++)

                                print(/a*b/g.exec("aabaaab"));

 

The changes for the literal regular expression syntax are content agreed.

15.10.3 – 15.10.7 RegExp Methods

Some minor changes are required to 15.10.4 – can’t remember what.

 

The following changes need to be made to 15.10.6.2 RegExp.prototype.exec(string)

  1. In the algorithm, all occurrences of “string” needs to be changed to “ToString(string)”.
  2. Step 6 has a clause “go to step 6”. It should be “go to step 7”.
  3. The [[match]] property name in step 6 needs to be reconciled with the [[matcher]] property described in 15.10.4.1 (new RegExp)?

 

In section 15.10.7, the phrase “most recently specified” should be nuked.

 

Otherwise, the text is content agreed.

Action: Waldemar will make required changes.

15.5.4.10, 15.5.4.14 – 15.5.4.16 Methods on String dealing with RegExp

The “$+” removed  needs to be removed from String.prototype.replace (15.5.4.15).

 

The section header for String.prototype.split (15.5.4.10) requires a “limit” parameter. Both the separator and limit parameters should use the syntax indicating that they are optional (square brackets).

 

The algorithm for split also needs the following step added:

20.5 Set p = k+m

 

Otherwise, the text is content agreed.

Action: Waldemar will make required changes. He will also write up an update to String.prototype.split (15.5.4.10) to allow for capturing parentheses.

Exceptions

The following are potential options for structuring the top levels of the exception hierarchy.

1. Single classes

class Error extends Object

 

2. Two classes – no common hierarchy

class Error extends Object

class Exception extends Object

 

3. Two classes – Error parent

class Exception extends Object

class Error extends Exception

 

4. Two classes – Exception parent

class Error extends Object

class Exception extends Error

 

5. Three classes

class Signal extends Object

class Error extends Signal

class Exception extends Signal

 

General consensus is that less classes is better and that if we introduce a hierarchy, it is not clear how specific exceptions currently generated by the runtime would be grouped.

 

It is functionally agreed that there will be a single class called Error. All engine runtime errors will inherit from it. The method toString is implementation defined and will use the name and message properties.

 

There will be a toLocaleString that is implementation dependent.  Herman brings up the question as to whether there should be an additional toLocaleMessage method. The following reasoning is why the question exists:

 

  1. toLocaleString in other cases is the localized version of toString.
  2. toString is defined to generate a message using both the name and message properties.
  3. From 1 and 2, it follows that the string returned by an exceptions toLocaleString is some localized string that translates the combination of the name and message properties.
  4. It may be common that one wants to get the localized message without any extra modifications associated (such as adding the name property) in which case another method is required.

Action: Mike Mccabe will make the appropriate changes to proposal spec and confer with HermanV to figure out if toLocaleMessage is required or if toStrng’s definition should be changed to not require the exception name.

Other Topics

11.1.4 – Array Initializer

Content agreed except that ArrayLiteralHead should be nuked the associated R.H.S. should be merged into the productions which currently use ArrayLiteralHead.

Action: Mike Mccabe will make necessary change.

11.1.5 – Object Initializer

Functionally agreed. Similar changes for Object Initializer and ObjectLiteralHead as with array initializer. It is currently unclear what is valid for the member name in an object initializer.

Action: Mike Mccabe will make necessary change and come back with what he thinks should be allowed for member names.

??? Array.prototype.sort

Dario desires a [LocalGet] so that members on the prototype chain are never viewed and to remove an inconsistency with the sort function that was exposed by a Netscape proposal to change the sort algorithm to require the preservation of holes in sparse arrays.

 

The counter proposal was suggested that the behavior of sort be implementation be defined only if the prototype of the array object does not have members visible as part of the array data (member name is an integer index). This is agreed to.

 

The Netscape proposal for preserving holes is now a content agreed given the change to make it implementation defined as to what happens when the array prototype has members visible as part of the array data.

Action: Waldemar will come back with necessary wording to allow the implementation specified behavior..

15.1.2.1 – eval

A minor correction to convert the throw completion to a meta-throw is content agreed.

??? - UIRencode/URIdecode

The specification will define the following functions:

encodeUTF8(String sourceString, String doNotEscapeCharacters)

The string doNotEscapeCharacters contains the characters that should not be converted into the “%XX” form.

 

decodeUTF8(String sourceString, String doNotUnescapeCharacters)

The string doNotUnescapeCharacters contains the characters that should not be converted from “%XX” back to regular characters.

 

encodeURI(String sourceString)

encodeURIComponent(String sourceString)

decodeURI(String sourceString)

decodeURIComponent(String sourceString)

These will map to calls to encodeUTF8 and decodeUTF8 with the appropriate character sets for the parameters doNotEscapeCharacters and doNotUnescapeCharacters.

Action: Dario will make the necessary changes.

??? - Conformance clause

Action: Mike Cowlishaw will schedule in a slot for discussing conformance in light of the fact that we are now defining behavior for error cases that preclude the possibilities for extending the standard in the future in a compatible way.