Stream: t-lang

Topic: nonascii identifiers(rfc 2457)


Charles Lew (May 11 2020 at 12:06, on Zulip):

Hello, let me create a topic to discuss and confirm on the details of RFC 2457.

Charles Lew (May 11 2020 at 12:07, on Zulip):

The RFC text is at https://github.com/rust-lang/rfcs/blob/master/text/2457-non-ascii-idents.md

Charles Lew (May 11 2020 at 12:14, on Zulip):

The RFC basically contains 6 major things.

  1. Tweak to the language parser: identifier grammar, nfc normalization (implemented, feature gated)
  2. Concerns about #[no_mangle] and #[path] (not yet implemented)
  3. confusable_idents lint, detects similarity between identifiers (implemented, discussing adjusting its behavior)
  4. uncommon_codepoints lint, detects code points not usually occurred (implemented)
  5. nonstandard_style lint tweaks, basically nothing to do. (need confirmation)
  6. mixed_script_confusables lint, detects scripts that occurred in the crate but all occurrences are considered confusable.
Charles Lew (May 11 2020 at 12:18, on Zulip):

In rationale section, there're two alternative mixed_script_detection lints that are NOT currently selected by this RFC.
a. single identifier mixed script detection.
b. a statistics approach of mixed script detection.

Charles Lew (May 11 2020 at 12:20, on Zulip):

Just create this summary to make sure we're on the same page =)

Charles Lew (May 11 2020 at 12:20, on Zulip):

@Manish Goregaokar @Josh Triplett

Manish Goregaokar (May 11 2020 at 14:45, on Zulip):

I agree with josh's comment on the thread, I think making the confusables lint be a mixed script AND local confusables lint would work. Also it would be super efficient since we can only care about mixed script idents.

Manish Goregaokar (May 11 2020 at 14:47, on Zulip):

So yeah I think your assessment is correct. We can instead have the confusable_idents lint only focus on mixed script confusables, and mixed_script_confusables continues to do a global analysis.

Manish Goregaokar (May 11 2020 at 14:48, on Zulip):

either way, the list of mixed script confusable characters is much smaller, so you don't actually need the skeleton code for this.

Manish Goregaokar (May 11 2020 at 14:49, on Zulip):

Actually this could probably be a single mixed_script_confsables lint, without the percentage check

Pyfisch (May 11 2020 at 14:59, on Zulip):

Removing the skeleton code means the compiler no longer warns about two otherwise identical idents with different diacritics, right?
There are quite a few diacritics and they are not all that easy to see and tell apart: https://en.wikipedia.org/wiki/Combining_Diacritical_Marks

Charles Lew (May 11 2020 at 16:37, on Zulip):

Well, the hardest thing about confusables.txt is that it's not one-to-one. The list itself doesn't really talk about one code point and another code point are confusable, instead it is mapping code points into intermediate forms.

Charles Lew (May 11 2020 at 16:42, on Zulip):

https://github.com/unicode-rs/unicode-security/pull/13
This newly written table generation code consist of all the one-to-one cases, and found out the directly mentioned one-to-many cases, but doesn't handle the A+B confusable to C+D case, which is really tricky to implement and use...

Charles Lew (May 11 2020 at 16:45, on Zulip):

If we really want to mixed script confusable, i think the only feasible way to still use the skeleton code to find out the confusables, and then filter the mixed script pairs out. If this plan is acceptable, i might give the implementation a try.

Josh Triplett (May 11 2020 at 16:55, on Zulip):

That or preprocess the confusables data.

Josh Triplett (May 11 2020 at 16:55, on Zulip):

To make it easier to figure that out.

Josh Triplett (May 11 2020 at 16:55, on Zulip):

For instance, a character that's only confusable with others in the same script isn't something we need flagged.

Charles Lew (May 11 2020 at 17:12, on Zulip):

This is basically "reverse apply"-ing the confusables data. The original table is reductive, it maps random data into a normalized form to check for equivalence. To reverse apply it, the data will become generative.

Charles Lew (May 11 2020 at 17:12, on Zulip):

let me take a random example, here are two lines from the table

00F8 ;  006F 0338 ; MA  # ( ø → o̸ ) LATIN SMALL LETTER O WITH STROKE → LATIN SMALL LETTER O, COMBINING LONG SOLIDUS OVERLAY    # →o̷→
1D428 ; 006F ;  MA  # ( 𝐨 → o ) MATHEMATICAL BOLD SMALL O → LATIN SMALL LETTER O    #
Charles Lew (May 11 2020 at 17:14, on Zulip):

From these two lines, we know that:
00F8 and 1D428 0338 are confusable.

Charles Lew (May 11 2020 at 17:15, on Zulip):

By first applying the second-line substitution, then applying the first-line substitution, they both fold into 006F 0338 form.

Charles Lew (May 11 2020 at 17:18, on Zulip):

There're even chaining cases, i believe. So... it's much more complex to prepare this data. I'm not sure we should dive into this approach...

Charles Lew (May 11 2020 at 17:26, on Zulip):

Instead, when we've got 00F8 identifier and 1D428 0338 identifier at hand, and we know they're confusable using skeleton approach, we just need some simple rules to tell whether they're same script confusable or not, this will be much easier.

Charles Lew (May 12 2020 at 17:47, on Zulip):

@Pyfisch I'm not sure i understand the "script group" concept from the mixed_script_confusable lint part of the RFC.

Charles Lew (May 12 2020 at 18:03, on Zulip):

The second paragraph "We identify lists of code points which are Allowed by UTS 39 section 3.1 (i.e., code points not already linted by less_used_codepoints) and are "exact" confusables between code points from other Allowed scripts." I think this means that this is pure-preprocessing and does not depend on the actual codebase. I have created such a list.

Charles Lew (May 12 2020 at 18:07, on Zulip):

In the fourth paragraph "In a code base, if the only code points from a given script group (aside from Latin, Common, and Inherited) are such exact confusables", is this the Script property of the UCD, or should i take Script_Extensions property into account, or should i use the TR39 augmented script set concept?

Manish Goregaokar (May 12 2020 at 18:14, on Zulip):

@Charles Lew it's the augmented script set

Charles Lew (May 12 2020 at 18:20, on Zulip):

Thanks, so, each code point corresponds to a augmented script set, and we need to check how many kinds of augmented script set occurred in this code base. And for each of them, whether its "source" code points only consists of confusable code points.

Charles Lew (May 12 2020 at 18:20, on Zulip):

Is this correct?

Pyfisch (May 13 2020 at 10:09, on Zulip):

I still don't understand why you only want to warn about confusable idents from different scripts. To me it does not matter whether two identifiers I confuse are written in the same or in different scripts. Since only warning about some confusable identifiers is harder to implement there must be some advantage to this that I am missing.

Pyfisch (May 13 2020 at 10:26, on Zulip):

@Charles Lew In general "script groups" are scripts that are commonly used together. For example Japanese is written in a mixture of Kanji + Hiragana + Katakana. I think the term was invented in the RFC and is not otherwise used. In the section "Alternative mixed script lints", subsection "Mixed script detection" lint is described. It restricts identifiers to characters from the same script or set of scripts as described in https://www.unicode.org/reports/tr39/#Restriction_Level_Detection But this lint is not part of the accepted RFC.

The mixed_script_confusable lint part was written by @Manish Goregaokar and you better ask him how he defines "script groups".

Manish Goregaokar (May 13 2020 at 14:47, on Zulip):

@Pyfisch so firstly the mixed_script_confusables lint already has all the implementation work necessary to warn about cross-script confusables. This isn't additional work.

The idea is that users of a script will typically not have a problem with this, and it will be solved with fonts or whatever. Now, it is true that some scripts like the perso-arabic script have confusables from different languages that both exist in the same script, and that might be tricky. But I'd rather wait for feedback on that before writing that lint.

Manish Goregaokar (May 13 2020 at 14:47, on Zulip):

Yes, by script groups I'm talking about the augmented script sets

Manish Goregaokar (May 13 2020 at 14:47, on Zulip):

@Charles Lew each code point may correspond to multiple augmented script sets

Manish Goregaokar (May 13 2020 at 14:48, on Zulip):

actually no sorry

Manish Goregaokar (May 13 2020 at 14:50, on Zulip):

@Charles Lew it's just https://github.com/unicode-rs/unicode-security/blob/master/src/mixed_script.rs#L9

Pyfisch (May 13 2020 at 17:40, on Zulip):

@Manish Goregaokar okay, I understand. Would you mind updating the RFC?

Charles Lew (May 14 2020 at 13:10, on Zulip):

From my understanding augmented script set is script extension property extended to capture CJK languages which uses more than one scripts together. However from the RFC text, the usage is a little, um, strange.

Let me create an example, assume there's somehow a piece of lib.rs that is written solely using Hiragana and Katakana, both are script that are related to the Japanese language. So their corresponding augmented script sets are (1) Hiragana + Jpan (2) Katakana + Jpan , and ... there're two kinds of augmented script sets? So imagine if all texts are Hiragana, only one Katakana codepoint, and if the Katakana codepoint is confusable to, say, a Hiragana codepoint, then it's reported as mixed_script_confusable. Is this correct?

Manish Goregaokar (May 15 2020 at 14:27, on Zulip):

@Pyfisch i'm really swamped right now, i can review any changs y'all make

Manish Goregaokar (May 15 2020 at 14:28, on Zulip):

@Charles Lew no, we should use augmented script sets there, katakana should not be reported as confusable with hiragana

Charles Lew (May 15 2020 at 16:10, on Zulip):

@Manish Goregaokar Sorry, it wan't clear to me how that works. Yes i'm thinking with augmented script set. While i think i do understand the intention, i can't really figure out the exactly rules for implementation, it seems not very straight forward to me, but maybe i'm just missing something obvious. Would you mind explaining it more? Either a brief description of rule or an example works for me. I think then i can work out the details.

Manish Goregaokar (May 20 2020 at 15:55, on Zulip):

@Charles Lew you are correct in your understanding of augmented script sets, _however_ the point of script sets is to intersect them. A+B and A+C are not considered incompatible, because they have the nonzero intersection of A

Charles Lew (May 20 2020 at 15:58, on Zulip):

-

Charles Lew (May 20 2020 at 15:58, on Zulip):

Yeah, that's my understanding too.

Charles Lew (May 20 2020 at 15:59, on Zulip):

So i think we need to invent some rules for the mixed script lint here. Actually we first collect all code points in a crate.

Charles Lew (May 20 2020 at 16:00, on Zulip):

And find its corresponding augmented scripted set for each of the code point.

Charles Lew (May 20 2020 at 16:01, on Zulip):

And then, maybe we want to intersect each pair of the augmented script set to see if the intersection is empty?

Charles Lew (May 20 2020 at 16:02, on Zulip):

No, i think it sounds that this will need to operate on the identifier level. To check if there's an intersection which means the identifier itself is single script.

Last update: Jun 05 2020 at 23:15UTC