Stream: wg-secure-code

Topic: rust for low level security


DPC (Dec 20 2019 at 08:49, on Zulip):

Hegza posted this on the discord server:

I was talking with some of our researchers about using Rust for low-level security code (SSL?) They use C, and one concern that was raised is that they are currently fighting the compiler to stop it from being "too smart". There's a good use case, listen out: A function call that compares two long byte-arrays must execute in constant time, such that the user-code cannot determine the point at which the comparison exited. Because the arrays are long, they need to be iterated over (eg. for, iter), and thus the compiler can easily insert an early exit for the first mismatching byte. We don't want this, because the early exit reveals information about the mismatching arrays, which could be a security key. The question is, what's the story of Rust currently, with regards to this issue. Do we have something that we can use to tell the compiler "I don't want this code/function/block optimized" or something else that fits the use case. These people sometimes fall back on assembly, and sometimes they end up "complexifying" their code until the compiler cannot optimize it. They find both of these approaches fairly unergonomic

Shnatsel (Dec 20 2019 at 10:41, on Zulip):

Constant-time execution is an unsolved problem in general, not only for Rust but also for C/C++. You can only guarantee it if you write assembly or capture assembly output of a compiler and then verify that and store the generated assembly instead of building from source, and even that only works for a specific CPU model.

Tony Arcieri (Dec 20 2019 at 15:25, on Zulip):

@DPC specifically for the problem of constant-time equality there's the subtle crate

Tony Arcieri (Dec 20 2019 at 15:26, on Zulip):

I would love to see a more comprehensive toolkit for writing constant-time code in Rust

DPC (Dec 20 2019 at 15:34, on Zulip):

thanks

Henri Lunnikivi (Dec 20 2019 at 15:41, on Zulip):

Thank you on my part as well. I'm the original poster, hegza from discord. The person that case is originally from also commented on it, I'll try to get him to join as well.

Tony Arcieri (Dec 20 2019 at 15:51, on Zulip):

it'd be possible to write a portable constant time toolkit which emits platform-specific assembly using core::arch (but it'd need to be done on a target-by-target basis)

Tony Arcieri (Dec 20 2019 at 15:51, on Zulip):

long ago there was this macro black magic https://github.com/klutzy/nadeko

Tony Arcieri (Dec 20 2019 at 15:52, on Zulip):

one drawback of that is I think code emitted by core::arch can still go through the LLVM optimizer

bjorn3 (Dec 20 2019 at 15:55, on Zulip):

Cranelift is a compiler which doesn't optimize much. Using my rustc backend cg_clif, it is possible to compile rust code using it.

Shnatsel (Dec 21 2019 at 01:14, on Zulip):

Sadly Rust compiler that doesn't optimize much is way less useful than a C compiler that doesn't optimize much. C is already pretty close to assembly so you are not reliant on the optimizer. Rust is more high-level and much more reliant on the optimizer for performance. Debug mode Rust can easily be 10x slower than release mode. So a non-optimizing Rust compiler is not very useful.

Shnatsel (Dec 21 2019 at 01:18, on Zulip):

Also I am yet to see an AES implementation for x86 that would be resilient to cache timing attacks. Various attempts have been made but AFAIK they all fall short.

Shnatsel (Dec 21 2019 at 01:24, on Zulip):

One specific case that can be made to work is constant-time comparison of two memory regions; this will likely also need protection from speculative execution side channel attacks, i.e. LFENCE everywhere, so you would need to doctor and store the assembly anyway. But that very specific, narrow task at least look manageable, while the general case is still a wide open research problem even in assembly, let alone high-level languages.

Shnatsel (Dec 21 2019 at 01:28, on Zulip):

There is a long history of people trying and failing to produce constant-time crypto in software; the most ironic is https://eprint.iacr.org/2018/747.pdf

Shnatsel (Dec 21 2019 at 01:33, on Zulip):

AFAIK the need for constant-time crypto code (or at least the padding oracle attacks) can be eliminated by using encrypt-then-authenticate crypto protocols; see e.g. Noise protocol - and any other protocol with a security proof because no other scheme was ever proven correct.

Tony Arcieri (Dec 21 2019 at 03:17, on Zulip):

C is already pretty close to assembly so it is not reliant on the optimizer to get the code to a reasonable level of performance.

Honestly I can reason about LLVM lowering more easily in Rust than I can in C

Tony Arcieri (Dec 21 2019 at 03:17, on Zulip):

just anecdotally, I feel like C is much more likely to surprise me

Tony Arcieri (Dec 21 2019 at 03:18, on Zulip):

I say this on a night where some of my code may have been :zip_it:

Tony Arcieri (Dec 21 2019 at 03:18, on Zulip):

(it's fine!)

Shnatsel (Dec 21 2019 at 11:17, on Zulip):

Do you mean unoptimized LLVM codegen or with optimizations enabled? I was only talking about debug mode, optimizers will mess things up in any language once they kick in

Nicola Tuveri (Dec 21 2019 at 16:05, on Zulip):

Hi, I am the researcher that was discussing with @Henri Lunnikivi about this.

I have to disclaim that unfortunately I know close to nothing about Rust, and this discussion sprouted from a chat with our local Rust evangelist (@Henri Lunnikivi) when we entered the "things I wish new system programming languages addressed" topic.

Tony Arcieri (Dec 21 2019 at 16:06, on Zulip):

@Shnatsel with optimizations enabled. I don't generally look at debug output (in e.g. Godbolt)

Tony Arcieri (Dec 21 2019 at 16:06, on Zulip):

also Godbolt is amazing

Tony Arcieri (Dec 21 2019 at 16:06, on Zulip):

/me using it right now

Tony Arcieri (Dec 21 2019 at 16:06, on Zulip):

seems I accidentally used the wrong core::arch intrinsic in a place, but LLVM is so smart it substituted the correct one :sweat_smile:

Alex Gaynor (Dec 21 2019 at 16:08, on Zulip):

I assume "wrong" here means not-optimal, and not dysfunctional

Tony Arcieri (Dec 21 2019 at 16:08, on Zulip):

yes

Tony Arcieri (Dec 21 2019 at 16:09, on Zulip):

I was using the core::arch intrinsic for shufpd instead of pshufd, however LLVM noticed the inputs and outputs were all integers, so it substituted pshufd for me

Tony Arcieri (Dec 21 2019 at 16:10, on Zulip):

just fixed it up to actually use the pshufd intrinsic which as it were also helped clean up the code

Tony Arcieri (Dec 21 2019 at 16:12, on Zulip):

it's for this, FWIW: https://blog.quarkslab.com/reversing-a-finite-field-multiplication-optimization.html

Tony Arcieri (Dec 21 2019 at 16:12, on Zulip):

it duplicates that assembly in either case (well, sans the v)

Tony Arcieri (Dec 21 2019 at 16:20, on Zulip):

https://github.com/RustCrypto/universal-hashes/pull/43

Tony Arcieri (Dec 21 2019 at 16:20, on Zulip):

@Nicola Tuveri using ^^^ sort of techniques I think it's very easy to get the desired assembly, but also, very much writing code in an architecture-specific way

Shnatsel (Dec 21 2019 at 16:58, on Zulip):

Rather, it is possible to get the desired assembly on a specific compiler version built with a specific LLVM version. It's not necessarily forward-compatible, since the optimizations change every release. It is also possible that rustc will optimize the code differently when built as part of larger project, so I don't think you can guarantee your code will be optimized or not optimized in a specific way in a library.

Nicola Tuveri (Dec 21 2019 at 16:59, on Zulip):

I guess my utopistic goal is to collaborate with the compiler rather than fighting it by embedding platform specific code.

I work mostly on OpenSSL, and there we have a lot of code that is very hard to read and maintain, that includes tricks in C90 or sometime directly in ASM to try and prevent the compiler to, e.g., insert branches where the programmer did not write (and probably does not expect) one.

A made up example in a pattern that some implementations needs when fetching secret-dependent precomputed values from a lookup table.
The idea is to look at every single element in the table, derive a mask from the comparison of the current index with the secret index used for the lookup and use bitwise logic to mask out all the values (that must be fetched and computed upon anyways to avoid leaks) but the one matching the secret index (which is masked with 0xFF.FF, retaining the result of the computation).
Here an example of what a programmer might write in C and what assemblers would do: https://godbolt.org/z/o8eqkh

Play with defining/undefining DO_BITWISE_EQUALITY to see the difference.

Even if in both alternatives to the main loop of the lookup function there are no branches, one of the 2 solutions results in having branches introduced during compilation.

The current _hackish_ ways of dealing with this have a series of shortcomings:

My (not original) wish is to have standardized ways in the language (be it annotations or keywords) to signal the compiler the programmer intent about doing stuff in constant time, as the compiler is the piece of software that is best placed to know the exact characteristics of the target platform and select which optimizations to apply, which to omit and which microarch specific tweaks are recommended to achieve the goal.

One could start with ways to let the programmer ask the compiler not to optimize(or to only apply optimization that guarantee not to introduce branches) a certain region of code, possibly at the block level.
Another step would be to have language builtins that the compiler can map to the best implementation for a given architecture: e.g. if the language had a reserved keyword to do const-time equality comparison for basic types, it would be possible to write and maintain more easily code that is supposed to run in constant time.

I guess what I am saying is that everybody can agree that there are use cases (especially in the software security domain) where it is not fine to maintain the long established convention that compilers can apply any transformation as long as the resulting implementation is semantically equivalent in terms of input/output, yet languages and compilers are not providing a standard way to ward const-time sections in the code we write, and it would be nice if a language like Rust, that already addresses many other shortcomings of traditional languages to aid programmers in writing code that is more robust and safe, could be at the forefront in trying to improve also this aspect.

Shnatsel (Dec 21 2019 at 17:07, on Zulip):

While it would definitely be nice if somebody solved the constant time execution problem at the compiler level, I do not see it happening anytime soon. It would require a huge continuous investment to develop and maintain such compiler features, largely due to the fact that this complicates all the other compiler passes. On the other hand the market for it is simply not big enough: constant-time memory comparison routine is easy enough to hand-roll in assembly, and if you're doing constant-time crypto then your crypto protocol is broken anyway and you should drop it anyway. OpenSSL needs constant-time crypto only because it supports TLS 1.1 and earlier, which are busted anyway.

Nicola Tuveri (Dec 21 2019 at 17:18, on Zulip):

On the other hand the market for it is simply not big enough: constant-time memory comparison routine is easy enough to hand-roll in assembly, and if you're doing constant-time crypto then your crypto protocol is broken anyway and you should drop it anyway. OpenSSL needs constant-time crypto only because it supports TLS 1.1 and earlier, which are busted anyway.

I don't agree completely: writing constant time, with no branches, and with no secret-dependent memory accesses is still a requirement for all the basic cryptographic implementations.
When you are generating a keypair or a signature, or deriving a shared secret, even for P-256, X/Ed25519 that are supported by TLS 1.3, you must make sure that nothing is leaked about any bit of the secret (be it the long-term key or a nonce).
Fast const-time implementations of P-256 do use precomputed values stored in lookup tables, for example, and having a branch introduced by the compiler would allow a local attacker to monitor the execution of the algorithm and retrieve secret bits (with different degrees of signal-to-noise depending on the side-channel used).

One might think that a "local attacker" is of no concern, but especially in the age of the cloud, where code is written without having many guarantees on where our code is being run and what is being run in parallel on the same physical processor, it might be dangerously short-sighted.

Nicola Tuveri (Dec 21 2019 at 17:20, on Zulip):

But I agree that it is a niche need: that's why we can't just say disable all optimization for all the cryptographic software and be done with it!

Tony Arcieri (Dec 21 2019 at 17:49, on Zulip):

I think a constant time toolkit based on core::arch would be interesting. it certainly can get you closer to the desired assembly that Rust alone. like @Shnatsel said it can't guarantee the LLVM optimizer won't come along and insert a branch in code you desired to be constant time, and about all I can say is such a toolkit is best used with algorithms that are amenable to constant time implementations in the first place

Tony Arcieri (Dec 21 2019 at 17:49, on Zulip):

"constant time" is as big of a rat hole as some recent Rust discussions about whether something is truly "async" if e.g. you throw mmap into the mix, heh

centril (Dec 21 2019 at 18:01, on Zulip):

Fundamentally, a language like Rust is specified in terms of an interpreter (abstract machine). That machine has no notion of time at all.

Tony Arcieri (Dec 21 2019 at 18:20, on Zulip):

constant time is a property of a combined hardware/software system, so it's probably best reasoned about when working on code with a particular CPU architecture in mind

Tony Arcieri (Dec 21 2019 at 18:20, on Zulip):

in my case: post Sandy Bridge x86(-64)

Tony Arcieri (Dec 21 2019 at 18:22, on Zulip):

here's the assembly in question https://blog.quarkslab.com/reversing-a-finite-field-multiplication-optimization.html

Tony Arcieri (Dec 21 2019 at 18:22, on Zulip):

Screen-Shot-2019-12-21-at-10.22.16-AM.png

Tony Arcieri (Dec 21 2019 at 18:23, on Zulip):

here's what I have in Godbolt

Tony Arcieri (Dec 21 2019 at 18:23, on Zulip):

Screen-Shot-2019-12-21-at-10.23.22-AM.png

Tony Arcieri (Dec 21 2019 at 18:23, on Zulip):

(note: I'm not trying to do vectorized PCLMULQDQ... yet)

Tony Arcieri (Dec 21 2019 at 18:24, on Zulip):

but otherwise it's pretty much the ideal target code for this particular routine (Shay Gueron's Montgomery fast reduction)

centril (Dec 21 2019 at 18:41, on Zulip):

constant time is a property of a combined hardware/software system, so it's probably best reasoned about when working on code with a particular CPU architecture in mind

Right, which makes the idea of keywords or whatnot at the language level which provides guarantees quite unlikely

Tony Arcieri (Dec 21 2019 at 18:53, on Zulip):

oh hey, check this out...

Tony Arcieri (Dec 21 2019 at 18:53, on Zulip):

Screen-Shot-2019-12-21-at-10.52.33-AM.png

Tony Arcieri (Dec 21 2019 at 18:53, on Zulip):

tada

Tony Arcieri (Dec 21 2019 at 18:55, on Zulip):

uhh, wow :open_mouth:

Tony Arcieri (Dec 21 2019 at 18:56, on Zulip):

100% performance improvement :tada:

Tony Arcieri (Dec 21 2019 at 19:12, on Zulip):

https://godbolt.org/z/Zjuvwu

Last update: Apr 05 2020 at 00:25UTC