id	summary	reporter	owner	description	type	status	priority	milestone	component	version	resolution	keywords	cc
127	give up on resource-consumption defenses	Brian Warner		"One of the original motivations for building Foolscap (relative to its
Perspective Broker ancestry) was to create something that would defend your
code against resource-consumption attacks. A PB server can be trivially
damaged by sending it a long series of OPEN tokens: the receiver will blow
the Python stack limit, and !StackOverflow errors typically cause a lot of
collateral damage (unrelated code can find itself running in low-stack
environments, causing errors in weird places). Likewise, sending a dense tree
of lists can consume target memory at a rate faster than one byte per byte,
and can exhaust the receiver's memory (and !MemoryError is even more damaging
than a stack overflow).

So when Glyph and I first started talking about ""PB2"" (which turned into
Foolscap), one of my desires was to have code which would protect you from
this sort of thing. Since we were talking about having optional explicit
typechecking too (which is rather non-Pythonic for local code, but we felt it
would make remotely-reachable code safer), I figured we could use the same
information to do fine-grained size checking of the inbound token stream,
letting us hang up on a caller who was sending more data than the enclosing
!RemoteInterface said they should be doing.

== The Approach ==

One way to think about the receive-side deserialization code in Foolscap is
to pretend that it is holding a parse tree, something which has one
first-level child node for each Referenceable object which is remotely
accessible (indexed by the CLID: the connection-local ID). Each such object
then has a child node for each method that could be invoked on it: everything
declared in that object's !RemoteInterface. These method nodes then have
child nodes for each argument which they accept, which are either simple
atoms (foo= takes a string, bar= takes an int) with maximum sizes, or
composite types ({{{ListOf(DictOf(int,str))}}}) with upper bounds on number
of elements, etc.

The receive code then knows, at any given moment, where in this parse tree
we're currently sitting. It knows which low-level tokens are allowed to come
next: if we just received the CLID integer, then the next token must be the
method name, so we require a string with a max length of the longest declared
methodname. The arguments stage requires a dictionary (i.e. an OPEN token
followed by the string ""dict""), which itself is limited in the number and
length of its keys. And so on.

The tokens themselves have a 65-byte header+type limit, at which point we
know what the token type is and how long the body can be, so we can test it
against the current constraint. If we're accepting a STRING token but the
length is limited to 13 bytes (because that's the longest method name that's
been declared), we can know to reject a 4 billion byte string.

One additional use of this approach is basic bug-detection. It's always a
good idea to push the explosion closer to the trigger (well, unless you're a
demolitionist). If something gets out of sync on the wire and starts sending
strings where it's supposed to be sending integers, it's nice to detect it
early. Doing this on the inbound token stream is hypothetically better than
doing it after deserialization.

== The Reality ==

This approach has two main problems. The first is that, to make it work,
there must be no holes. To start with, all objects that are ever published or
passed over the wire must have a !RemoteInterface, and their schemas must not
use the ""Any"" wildcard (which turns out to be hard to do in the face of
forward references). Every size limit that you impose in these schemas
affects the overall resource-consumption-attack exposure, but to actually
measure that exposure (which is equivalent to calculating the deepest path in
the parse tree) is a hassle. There is some half-written code to look at your
!RemoteInterface and tell you the maximum depth (or throw an !UnboundedSchema
exception if it's infinite), but it was never really completed.

Moreover, every node in the parse tree must be bounded, even the ones that
aren't explicitly tied to a !RemoteInterface. The 3rd-party reference handoff
""gifting"" mechanism (one of the biggest features that Foolscap adds over PB)
relies upon sending the reference's FURL to the gift-recipient, generally in
an argument that they merely describe as expecting a reference of some sort
(not necessarily a 3rd-party one). So the parse-tree node for this sort of
slot must be able to accept a ""their-reference"" sequence. What constraints
should it impose?

Since I had resource-consumption attacks and constraints in mind as I wrote
the code, I tried to impose limits everywhere (since leaving them
unconstrained would expose the whole thing to attack). But the limits I
imposed (like the 200-byte limit on ""their-reference"" FURLs) were arbitrary,
and based upon my (narrow) expectations of how people would use the system. I
didn't consider that multi-homed Solaris users would have 5 IP addresses in
their FURLs and blow that 200-byte limit, or that people would want to sent
100MB strings and be surprised by the default 2KB limit, or the 30-key limit
on dictionaries. Foolscap-0.1.7 removed the default sizelimits from
user-controllable constraints, but all the internal ones were left in place
(since otherwise it would be impossible to create something that was actually
bounded).

So it's easy to have an unbounded schema, and it's hard to really plug all
the holes. I strongly suspect that, even with perfectly bounded schemas,
there are at least a few unconstrained points in the deserialization code.

The second problem is that these limits all surprise the programmer. Hidden
limits always do. You write your code, you run your tests, everything works
fine (because your datasets are small), you deploy it, things are still fine
(because the datasets are medium), and it takes years before someone uses the
thing enough to run into the hidden limitation. Even when the original
programmer set the constraints to match their use of it, some other
programmer down the line can make a different set of assumptions, change the
use case without knowing to update the constraints, and then run into limits
that they never knew existed. A distributed system makes it worse, because
now some of the servers (but not all) are enforcing some of the constraints
too, depending upon which versions they're running..

== The Advice ==

When I first described Foolscap to Mark Miller, Mark Stiegler, and Tyler
Close, in perhaps 2005 or 2006, they discouraged me from going down the
count-every-byte route, for all of the reasons described above. Their
recommendation was instead to use a per-user proxy. The idea is to funnel all
of Bob's traffic through a single process, upon which you could impose coarse
OS-based resource limits (like setting a ulimit on memory size, or on CPU
time spent). The proxy receives the messages, deserializes them (thus
exposing itself to any attacks), then packages them back up and sends them to
the real server. The outside world never gets access to the real server, only
to a proxy-builder.

If done right, a resource-consumption attack will kill the proxy, but not the
real server, and since Bob is the only one using Bob's proxy, he's only
hurting himself, and can't deny service to Alice or Carol (who have their own
proxies).

But this felt too hard to implement (it still does). Should the proxies be
per-remote-reference? (eek, one process per remote object?). I expected
performance to be a problem, since every incoming message gets an extra
deserialize+reserialize pass (and of course I commit the usual mistake of
letting hypothetical performance impacts drive design decisions without
actually implementing and measuring them first). Keeping the CLIDs in sync
between the proxy and the real server sounded hard, and not doing that
requires the proxy to also behave as a Membrane. Building the proxy-generator
sounded hard. And how would this work on a platform that makes it difficult
or expensive to spawn off new processes, like windows?

Plus, I had already sunk a lot of time into the token-checking scheme, so I
was reluctant to give it up. Also a common mistake. So I stuck with it.

== The Present ==

As of May 2009, every programmer on Tahoe to date (well, Brian, Zooko, and
Rob) has run into a mysterious failure that ultimately resulted from a
Foolscap constraint being violated. (I've hit fewer of them, of course, since
I tend to program under the same assumptions that I baked into Foolscap in
the first place). In many cases, these errors were mysterious because the
Violations were not reported very well, or because they resembled remote
errors that Tahoe normally tries to work around by asking alternate servers
(""no shares found"", well because everybody we asked gave us an error, oh and
all the errors were constraint Violations: you had to dig a lot to discover
the triply-wrapped Failure, but of course ""no shares"" makes you assume that
you've got the filecap wrong so you investigate that for a while first).

The maxKeys=30 limit on dictionaries hit us once (I think, in the introducer,
when we grew to more than 30 servers). The 2MB sizelimit that I declared for
Tahoe shares bit us when we raised the max_segment_size for mutable files. We
increased it, but then it bit us again when we used older servers that had
not yet been upgraded, and the mismatch between the two sides caused
confusion for a release or two (was max size 2MB or 2MiB? oops). The internal
200-byte limit on FURL sizes bit two separate end users who happened to have
a lot of IP addresses.

The fact that schemas no longer have default sizelimits on their constraints
means that to get a bounded schema, you must set a sizelimit on absolutely
everything. The fact that nobody ever finished the depth-checking code means
that it's hard to discover where the holes are in your schemas. The fact that
forward references require the use of Any (#18) makes less likely that you
can even try to get a bounded schema. The internal limitations (like FURL
size in Gifts) cause surprises even if the !RemoteInterface writer didn't add
any. And the likely remaining unbounded holes in the deserialization code
mean that all of this work is probably moot anyways.

== The Future ==

So, I'm going to remove deserialization-based resource-consumption-attack
mitigation from Foolscap. !RemoteInterface will still be there, and any
typelimits and sizelimits you impose will still be honored, but they'll be
applied to the deserialized arguments, rather than to the incoming token
stream. Your {{{foo=StringConstraint(1999)}}} argument will still be guaranteed to
never see a 2kB string (or a list or int), but an attacker will be able to
crash your process by sending a 2TB string (or an infinite list, etc).
Removing the token-type checkers from the deserialization code should
simplify things, and maybe help performance a little bit.

I'll also open up a ticket about implementing the proxy-based approach. It
might be feasible, and would be nice to have. It's certainly more likely to
solve the problem than the deserialization-based approach.

(note: [http://allmydata.org/trac/tahoe/ticket/690 Tahoe#690] is an
expression of the 200-byte gift-FURL limit problem)
"	task	new	major	eventually	schemas	0.4.1