ECMA 262 – TC39

Date: March 21, 1999

Location: Microsoft

Next Meeting: June 21, 1999 - 9:30am at Netscape, San Jose

People:

Mike Cowlishaw

Cedric Krumbein

Clayton Lewis

Herman Venter

Dario Russi

Rok Yu

Mike McCabe

Waldemar Horwat

Regular Expressions

15.10.2.13 CharacterClassEscapeCharacter

Semantics of the production CharacterClassEscapeCharacter :: d | D agreed.

For production CharacterClassEscapeCharacter :: s | S

15.10.2.8 Atom – Helper Function Canonicalize

Converts a single character input to a locale independent upper case. The one exception is when the conversion would result in multiple characters - “German SS”. In that case, the character is “B” is kept.

Atom 15.10.2.8 – Helper Function CharacterSetMatcher

The operation of “^” in character set has the affect of inverting the result of a regular expression match – not the set of characters being matched. e.g. If a string A is matched by the pattern /[abcC]/, the pattern [^abcC] matches the compliment set – not the set of characters matched by /[{all Unicode characters} – { a, b, c, C}]/. This is encoded in steps 8 through 11 in the algorithm.

Atom 15.10.2.8 – Multiline and Handling “.”, “^”, “$”

Perl’s implementation is described in “Mastering Regular Expressions” pp 232. The table of how they interact in Perl is as follows:

	default	m	s	sm
$	end	before LB	end	before LB
^	begin	after LB	begin	after LB
.	non-LB	non-LB	any	any

We will only be supporting “m” behavior. In Perl and existing ECMAScript implementation, the “.” only matches everything except LF. We agree that we will standardize “.” to match everything except for all characters in the character class LineTerminator.

Atom 15.10.2.8 – Character Escape

Syntax requires correct number of digits for character escape sequences. Implementations may choose how to recover from errors where the correct number of digits haven’t been specified. e.g. “\x7” è syntax error (implementation can do whatever). To specify the character you need “\x07”.

Atom 15.10.2.8 – Identity Escape

A “\” followed by a character that is not part of an identifer, the character is accepted directly. These characters are defined by the syntax class IdentityEscape. Characters not in IdentityEscape are defined to throw errors.

Atom 15.10.2.8 – Digit Escape

What happens with “\0[0-9]”?

Option 1 (Perl):

Current proposal captures Perl behavior where two digit octal values are allowed.

Option 2 (Simple):

\0 à Octal (null)

\[0-3][0-7][0-7] à Octal

\[1-9] à Backref

\[1-9][0-9] à Backref

Octals are always 3 digits with the one exception of “\0” which can be one digit. The first option is pretty complicated. The second option specifies legal behavior that is different from Perl.

Action: Waldemar will mail Larry Wall to see how he feels about the simplification with respect to Perl RegExp.

Atom 15.10.2.8 – Internal function BackreferenceMatcher

Line 19 should be changed as follows:

19. If s is undefined, goto 27

Action: Waldemar will make necessary change.

Atom 15.10.2.8 – Atom :: PatternCharacter

Content agreed.

Atom 15.10.2.8 – Character Classes

Backslashes should be brackets in the CharacterClass productions.

Nuke algorithm steps 73 in RangeList(15.10.2.16).

Nuke algorithms 92, 99 – 101 in RangeListNoDash (15.10.2.17).

An issue similar to the Digit Escapes described above occurs in the production RangeAtomNoDash (15.10.2.19).

Action: Waldemar will make the necessary changes.

Is there a customer need for regular expressions to support Unicode character classes? The following were suggested as potential applications:

Grepping through user code.
Generating new character classes by subtract sets of characters from Unicode aware sets.

Consensus to defer to 4^th Edition.

Action: When mailing Larry Wall, Waldemar will also ask about planned Unicode support in Perl RegExp.

The Turkish i maps to an ASCII I when mapping to upper case. This results in relatively non-intuitive behavior when matching a pattern like “/[\W]/I”. Are there any other non-ASCII characters that turn into ASCII characters? The internal function Canonicalize would be modified to handle this exception (step 13.5) much like handling of the case conversion of the German s-zet.

The PatternCharacter doesn’t currently allow line terminators.

Action: Waldemar will make the necessary changes to allow line terminators in PatternCharacter.

??? – Regular Expression Literals

Discussion: Does a regular expression literal introduce a new object? No à This matches existing behavior and also matches Perl. Semantics aren’t consistent with nested functions, but this did not seem like an important enough reason. e.g.

function foo() { return /a/; }

r1 = foo();

r2 = foo();

r1.bar = 45

r2.bar = ? à 45

A scenario where it is desirable to always return a single object is when one wants to iterate over multiple matches using a literal.

for(i = 1; i < 4; i++)

print(/a*b/g.exec("aabaaab"));

The changes for the literal regular expression syntax are content agreed.

15.10.3 – 15.10.7 RegExp Methods

Some minor changes are required to 15.10.4 – can’t remember what.

The following changes need to be made to 15.10.6.2 RegExp.prototype.exec(string)

In the algorithm, all occurrences of “string” needs to be changed to “ToString(string)”.
Step 6 has a clause “go to step 6”. It should be “go to step 7”.
The [[match]] property name in step 6 needs to be reconciled with the [[matcher]] property described in 15.10.4.1 (new RegExp)?

In section 15.10.7, the phrase “most recently specified” should be nuked.

Otherwise, the text is content agreed.

Action: Waldemar will make required changes.

15.5.4.10, 15.5.4.14 – 15.5.4.16 Methods on String dealing with RegExp

The “$+” removed needs to be removed from String.prototype.replace (15.5.4.15).

The section header for String.prototype.split (15.5.4.10) requires a “limit” parameter. Both the separator and limit parameters should use the syntax indicating that they are optional (square brackets).

The algorithm for split also needs the following step added:

20.5 Set p = k+m

Otherwise, the text is content agreed.

Action: Waldemar will make required changes. He will also write up an update to String.prototype.split (15.5.4.10) to allow for capturing parentheses.

Exceptions

The following are potential options for structuring the top levels of the exception hierarchy.

1. Single classes

class Error extends Object

2. Two classes – no common hierarchy

class Error extends Object

class Exception extends Object

3. Two classes – Error parent

class Exception extends Object

class Error extends Exception

4. Two classes – Exception parent

class Error extends Object

class Exception extends Error

5. Three classes

class Signal extends Object

class Error extends Signal

class Exception extends Signal

General consensus is that less classes is better and that if we introduce a hierarchy, it is not clear how specific exceptions currently generated by the runtime would be grouped.

It is functionally agreed that there will be a single class called Error. All engine runtime errors will inherit from it. The method toString is implementation defined and will use the name and message properties.

There will be a toLocaleString that is implementation dependent. Herman brings up the question as to whether there should be an additional toLocaleMessage method. The following reasoning is why the question exists:

toLocaleString in other cases is the localized version of toString.
toString is defined to generate a message using both the name and message properties.
From 1 and 2, it follows that the string returned by an exceptions toLocaleString is some localized string that translates the combination of the name and message properties.
It may be common that one wants to get the localized message without any extra modifications associated (such as adding the name property) in which case another method is required.

Action: Mike Mccabe will make the appropriate changes to proposal spec and confer with HermanV to figure out if toLocaleMessage is required or if toStrng’s definition should be changed to not require the exception name.

ECMA 262 – TC39

Regular Expressions

Exceptions

Other Topics