Stream: wg-secure-code

Topic: str-from-utf8-with-validation

Alex Gaynor (Aug 24 2019 at 00:14, on Zulip):

So here's an interesting challenge: I'm performing some validation on some characters that are in a &[u8]. Then I'm creating an &str from it.
Right now my choices are to either
a) Do 2x passes over the data, once for my validation, and once in str::from_utf8 to verify they really are utf8 (even though my validation also ensures they are utf8,
b) Use the unsafe str::from_utf8_unchecked API to just skip the second validation

Neither of these appeals to me. What API _should_ exist to allow me to do everything in one pass, safely.

If there was from_utf8_extra_check(&[u8], fn(char) -> Result<(), Error>) -> Result<&str, Error> I think that'd work?

Nick12 (Aug 24 2019 at 01:28, on Zulip):

That would also be unsafe, i think

Nick12 (Aug 24 2019 at 01:28, on Zulip):

If the fn didnt correctly validate itd be ub in safe code

Nick12 (Aug 24 2019 at 01:28, on Zulip):

You could make it safe by having an unsafe trait

Nick12 (Aug 24 2019 at 01:29, on Zulip):

But youll still need unsafe to impl your validator

Nick12 (Aug 24 2019 at 01:29, on Zulip):

Whats wrong with option b?

Nick12 (Aug 24 2019 at 01:30, on Zulip):

Oh wait, the utf validation would happen outside your fn? So you validate only additional things? Im not sure i see how thats different from option a

Alex Gaynor (Aug 24 2019 at 01:31, on Zulip):

You'd do for b in data { user_verify(b)?; normal_utf8_verify(b)?; } and hope the optimizer sorted it out

rkruppe (Aug 24 2019 at 07:31, on Zulip):

I don't expect that this would be a clear performance win. Sure, you're only going through the entire string start to finish once, but
1. lots of strings are short enough it doesn't matter -- if your string fits into the cache (many KB for L1, and L2/L3 might be fine too) walking it twice back-to-back is likely free, and prefetching might make it irrelevant anyway
2. utf-8 validation can be optimized a lot by validating more than one char at a time and doing less work than full decoding into char -- having to add in actual decoding and doing the other validation char-by-char will not place nicely with that, likely throttling the utf-8 validation even on an algorithmic level

Thom Chiovoloni (Aug 24 2019 at 18:27, on Zulip):

IIRC str::from_utf8 tries pretty hard to avoid looking at characters individually, so with the extra check it would be quite a bit slower (edit: ah, whoops, someone beat me to saying this)

Tony Arcieri (Aug 25 2019 at 19:08, on Zulip):

what is the invariant on the characters in the &[u8] for your particular use case, @Alex Gaynor ? ASCII? something more than that?

Alex Gaynor (Aug 25 2019 at 19:09, on Zulip):

It's ASN.1's PrintableString that had me thinking about this, so it's an allow-list of particular ASCII chars

Tony Arcieri (Aug 25 2019 at 19:09, on Zulip):


Last update: Apr 04 2020 at 03:25UTC