I bring this up because of https://github.com/3Hren/msgpack-rust/issues/151 but I've seen this a couple times now. It seems to be a reoccurring bug in serde crates that have a binary stream with the style of [length][datastream]. I fixed a similar bug in the bson crate but luckily with that one the spec has a small maximum size defined.. I looked at fixing the msgpack one with https://github.com/rust-lang/rust/issues/48043 but that might not be the best path since an attacker can still use the bug to gobble RAM but just won't be able to hit the same DOS panic bug. Does anyone have a better approach to this class of bugs? Should I look to the serde comunity?
deep rabbit hole
some would argue this is a problematic family in formal language theory
it wasn't even properly classified until a few years ago
Thanks for the link on more reading
it's a really good read if you're interested in this particular problem
lots of language theory, but they do boil it down to some practical advice about how to parse this family of languages
in general I would say making efficient zero-copy parsers for these sorts of data formats is very much a classical source of vulnerabilities
and there are tons of sharp edges that are rarely considered
Looking at this in the larger context of serde is making a lot of this sharp edges jump to the forefront too.
yeah, I think in general
serde is designed with context-free grammars in mind
or if not that, "packed structs"
I've written a serde parser/serializer for a format that's a combination thereof (which I didn't design), although that makes things a little easier: packed struct in front, with some messages having a variable length (length-prefixed) payload at the end
that avoids nested length fields, which are where things generally start to go awry
Makes sense. For most of these I'm looking the format is already defined and I just want to be able to handle safely. I was thinking about peeking ahead to see if the length value in the protocol is longer than what is left in the stream. Last I played with this was in the winter but IIRC serde didn't make that easy. This is all just coming up again because of that thread getting bumped.
I think I need to do some reading and catch back on what I was playing with last time I looked at this. It makes me feel better that this is an active research problem.
This isn't really just a Rust problem -- fuzzing ImageMagick/GraphicsMagick turned up tons of ooms of the shape "allocate however many X the length prefix said!"
thanks for sharing!
funny that they claim "(1) Length-prefix languages arenot context-free" is a contribution of theirs, AFAIK that as well-known and I saw talks about it many years ago... but it might be just clumsy wording in the abstract
oh wait, they claim to disprove (1)? Now I am confused.^^
@RalfJ it's a little more nuanced than that. If you look at the image I pasted above (Fig 3), their novel contribution is that "Calc-Regular" languages exist outside the normal confines of the Chomsky Hierarchy
that makes sense
but then what do they mean by "disproving (1)"...
oh, they mean the "parsing is inefficient" part
got it :D
interpreting a "Calc-Regular" language requires a sort of "recursive regular" automaton
which first parses and comprehends the length prefix and then uses that as its Kleene star when consuming the message, rinse repeat. but in addition to that, those sort of recursive sub-machines allow you to recursively descend into nested length-prefix data, providing properties closer to a context-free language, but without the need for a pushdown automaton, i.e. while it's popular to use a stack machine for comprehending nested length prefix data, the paper shows it isn't strictly necessary
Calc-regular languages are almost as easy to parse as regular languages, using finite state machines with additional accumulators.
AFAIK this is basically unsolved as of right now. Every single image or audio decoder has also bumped into this, see e.g. https://libpng.sourceforge.io/decompression_bombs.html
So far the approach in C libraries is "put checks everywhere and hope you got them all and nothing ever overflows"
I've found a number of such issues in Lewton, the Vorbis decoder in Rust, and the author came up with an interesting idea: make the memory allocator keep track of memory used by the crate and panic or abort or return error if someone tries to exceed the limit. This way deserialization of crafted input fails safely and in an isolated manner, without impacting the rest of the application. Here's their implementation of that approach, but sadly it's not complete enough to be actually used in lewton: https://github.com/est31/balloc
More info on the issue and their approach: https://github.com/RustAudio/lewton/issues/34
there's a simple solution. it's just not always applicable.
make the maximum field length a knob in the parser
Well, that's what they're trying to do, but actually implementing that is a game of whack-a-mole.
I'd like to rewrite that into a zero copy pull parser based on iterators
using the techniques in the Calc-Regular paper
libpng has a knob, a bunch of Rust imaging crates got a knob after I bugged the maintainers about it, but it's still not very reliable because you need to get all the math right, never overflow, and not forget a single place with a single check. That's like avoiding buffer overflows in C all over again.
https://github.com/est31/balloc is supposed to handle that for you transparently and reliably... or at least that's the idea. I don't think it's mature enough for production use.
@Shnatsel my parser neatly sidesteps all that. unfortunately the solution means it isn't applicable to those sorts of problems (max overall message size cap which defaults to 1kB :stuck_out_tongue_wink:)
I could probably turn it up just a wee bit, heh
1kb compressed or decompressed?
there is no compression :stuck_out_tongue_wink:
it's for credentials