Hello, let me create a topic to discuss and confirm on the details of RFC 2457.
The RFC text is at https://github.com/rust-lang/rfcs/blob/master/text/2457-non-ascii-idents.md
The RFC basically contains 6 major things.
#[path](not yet implemented)
confusable_identslint, detects similarity between identifiers (implemented, discussing adjusting its behavior)
uncommon_codepointslint, detects code points not usually occurred (implemented)
nonstandard_stylelint tweaks, basically nothing to do. (need confirmation)
mixed_script_confusableslint, detects scripts that occurred in the crate but all occurrences are considered confusable.
In rationale section, there're two alternative
mixed_script_detection lints that are NOT currently selected by this RFC.
a. single identifier mixed script detection.
b. a statistics approach of mixed script detection.
Just create this summary to make sure we're on the same page =)
@Manish Goregaokar @Josh Triplett
I agree with josh's comment on the thread, I think making the confusables lint be a mixed script AND local confusables lint would work. Also it would be super efficient since we can only care about mixed script idents.
So yeah I think your assessment is correct. We can instead have the confusable_idents lint only focus on mixed script confusables, and mixed_script_confusables continues to do a global analysis.
either way, the list of mixed script confusable characters is much smaller, so you don't actually need the skeleton code for this.
Actually this could probably be a single mixed_script_confsables lint, without the percentage check
Removing the skeleton code means the compiler no longer warns about two otherwise identical idents with different diacritics, right?
There are quite a few diacritics and they are not all that easy to see and tell apart: https://en.wikipedia.org/wiki/Combining_Diacritical_Marks
Well, the hardest thing about confusables.txt is that it's not one-to-one. The list itself doesn't really talk about one code point and another code point are confusable, instead it is mapping code points into intermediate forms.
This newly written table generation code consist of all the one-to-one cases, and found out the directly mentioned one-to-many cases, but doesn't handle the A+B confusable to C+D case, which is really tricky to implement and use...
If we really want to mixed script confusable, i think the only feasible way to still use the skeleton code to find out the confusables, and then filter the mixed script pairs out. If this plan is acceptable, i might give the implementation a try.
That or preprocess the confusables data.
To make it easier to figure that out.
For instance, a character that's only confusable with others in the same script isn't something we need flagged.
This is basically "reverse apply"-ing the confusables data. The original table is reductive, it maps random data into a normalized form to check for equivalence. To reverse apply it, the data will become generative.
let me take a random example, here are two lines from the table
00F8 ; 006F 0338 ; MA # ( ø → o̸ ) LATIN SMALL LETTER O WITH STROKE → LATIN SMALL LETTER O, COMBINING LONG SOLIDUS OVERLAY # →o̷→ 1D428 ; 006F ; MA # ( 𝐨 → o ) MATHEMATICAL BOLD SMALL O → LATIN SMALL LETTER O #
From these two lines, we know that:
00F8 and 1D428 0338 are confusable.
By first applying the second-line substitution, then applying the first-line substitution, they both fold into
006F 0338 form.
There're even chaining cases, i believe. So... it's much more complex to prepare this data. I'm not sure we should dive into this approach...
Instead, when we've got 00F8 identifier and 1D428 0338 identifier at hand, and we know they're confusable using skeleton approach, we just need some simple rules to tell whether they're same script confusable or not, this will be much easier.
@Pyfisch I'm not sure i understand the "script group" concept from the
mixed_script_confusable lint part of the RFC.
The second paragraph "We identify lists of code points which are Allowed by UTS 39 section 3.1 (i.e., code points not already linted by less_used_codepoints) and are "exact" confusables between code points from other Allowed scripts." I think this means that this is pure-preprocessing and does not depend on the actual codebase. I have created such a list.
In the fourth paragraph "In a code base, if the only code points from a given script group (aside from Latin, Common, and Inherited) are such exact confusables", is this the
Script property of the UCD, or should i take
Script_Extensions property into account, or should i use the TR39 augmented script set concept?
@Charles Lew it's the augmented script set
Thanks, so, each code point corresponds to a augmented script set, and we need to check how many kinds of augmented script set occurred in this code base. And for each of them, whether its "source" code points only consists of confusable code points.
Is this correct?
I still don't understand why you only want to warn about confusable idents from different scripts. To me it does not matter whether two identifiers I confuse are written in the same or in different scripts. Since only warning about some confusable identifiers is harder to implement there must be some advantage to this that I am missing.
@Charles Lew In general "script groups" are scripts that are commonly used together. For example Japanese is written in a mixture of Kanji + Hiragana + Katakana. I think the term was invented in the RFC and is not otherwise used. In the section "Alternative mixed script lints", subsection "Mixed script detection" lint is described. It restricts identifiers to characters from the same script or set of scripts as described in https://www.unicode.org/reports/tr39/#Restriction_Level_Detection But this lint is not part of the accepted RFC.
mixed_script_confusable lint part was written by @Manish Goregaokar and you better ask him how he defines "script groups".
@Pyfisch so firstly the mixed_script_confusables lint already has all the implementation work necessary to warn about cross-script confusables. This isn't additional work.
The idea is that users of a script will typically not have a problem with this, and it will be solved with fonts or whatever. Now, it is true that some scripts like the perso-arabic script have confusables from different languages that both exist in the same script, and that might be tricky. But I'd rather wait for feedback on that before writing that lint.
Yes, by script groups I'm talking about the augmented script sets
@Charles Lew each code point may correspond to multiple augmented script sets
actually no sorry
@Charles Lew it's just https://github.com/unicode-rs/unicode-security/blob/master/src/mixed_script.rs#L9
@Manish Goregaokar okay, I understand. Would you mind updating the RFC?
From my understanding augmented script set is
script extension property extended to capture CJK languages which uses more than one scripts together. However from the RFC text, the usage is a little, um, strange.
Let me create an example, assume there's somehow a piece of
lib.rs that is written solely using Hiragana and Katakana, both are script that are related to the Japanese language. So their corresponding augmented script sets are (1) Hiragana + Jpan (2) Katakana + Jpan , and ... there're two kinds of augmented script sets? So imagine if all texts are Hiragana, only one Katakana codepoint, and if the Katakana codepoint is confusable to, say, a Hiragana codepoint, then it's reported as mixed_script_confusable. Is this correct?
@Pyfisch i'm really swamped right now, i can review any changs y'all make
@Charles Lew no, we should use augmented script sets there, katakana should not be reported as confusable with hiragana
@Manish Goregaokar Sorry, it wan't clear to me how that works. Yes i'm thinking with augmented script set. While i think i do understand the intention, i can't really figure out the exactly rules for implementation, it seems not very straight forward to me, but maybe i'm just missing something obvious. Would you mind explaining it more? Either a brief description of rule or an example works for me. I think then i can work out the details.
@Charles Lew you are correct in your understanding of augmented script sets, _however_ the point of script sets is to intersect them. A+B and A+C are not considered incompatible, because they have the nonzero intersection of A
Yeah, that's my understanding too.
So i think we need to invent some rules for the mixed script lint here. Actually we first collect all code points in a crate.
And find its corresponding augmented scripted set for each of the code point.
And then, maybe we want to intersect each pair of the augmented script set to see if the intersection is empty?
No, i think it sounds that this will need to operate on the identifier level. To check if there's an intersection which means the identifier itself is single script.