id	summary	reporter	owner	description	type	status	priority	milestone	component	version	resolution	keywords	cc
117	performance improvements: receive-side tokenization, send-side streamingness	Brian Warner		"Brian Granger asked me (in email) to identify some of the places where we
might be able to improve Foolscap's performance. Here's my list.

> What is your sense of where the bottlenecks in Foolscap are that
> would affect latency?  Am I correct that send and produce are the
> primary candidates for optimization?

Yes, I think you're right. I'm seeing 15ms operation times in a similar Tahoe
environment.. I don't really know how much the two ends are using themselves,
but I expect that a lot of that 15ms is in Foolscap
serialization/deserialization. (we're considering moving to a simpler
HTTP-based protocol for Tahoe, because the higher-layer data structures
already have encryption and integrity checks, so we don't really need it at
the transport level).

I haven't spent any time trying to optimize/improve performance (or really
even measuring it) yet.

First thing I'd do is to figure out whether to focus on the sending side or
on the receiving side. To measure the receiving side separately, I'd generate
a bunch of pre-serialized test cases, instantiate a Banana object, poke it a
bit to make it think that it's undergone negotation already, attach it to a
Broker that will ignore receiveChild() (so the message-send won't actually do
anything), then time/profile banana.dataReceived(teststring). Measuring the
sending side is easier, since serialization is complete by the time
callRemote() returns. (there is additional time spent by Twisted to move the
serialized data from the Transport's send buffer into the kernel, but that's
probably pretty efficient compared to what Foolscap is doing).

The lowest-hanging fruit I can think of is the receive-side tokenization
loop, in Banana.handleData(), since it uses a lot of string copies (one per
extracted token). You might try:

 * modify Banana.handleData() to use python ""buffers"" instead of doing so
   many string copies: the ""rest = buffer[pos+1:]"" is the biggest culprit,
   doing a copy for each received token

 * write a C function that takes a buffer of input and returns a list of
   tokens plus a leftover string, and use that to drive the loop instead of
   that big ""while"" loop (i.e. move the tokenization into C). Have that
   function return tokens of type (int, float, str, tuple), with tuples being
   used to indicate OPEN, CLOSE, ABORT, LIST, etc:

{{{
    tokens, rest = c_tokenizer(buffer)
    for t in tokens:
      t1 = type(t)
      if t1 is tuple:
       if t[0] == OPEN:
        self.handleOpen(t[1:]) ...
       elif t[0] == CLOSE:
        self.handleClose(t[1:]) ...
      if t1 in (int, long, float, str):
       obj = t
       ...
}}}

 * inline some of the handleFOO() methods, to reduce the number of function
   calls

The behavior of the tokenizer will depend upon how many tokens arrive in a
single TCP hunk: an argument object graph with lots of nodes will do lots of
copies. The size of each copy will depend upon how much data is in the graph:
you might benchmark

  callRemote(""foo"", [""a"" for i in range(1000)])

separately from

  callRemote(""foo"", [""a""*10 for i in range(100)])

to measure the difference. The first will create 1000 STRING(""a"") tokens
(each of which will be 3 bytes long), the second will create 100
STRING(""a""*10) tokens (each of which will be 12 bytes long). Assuming that
both the 3000ish-long message and the 1200ish-long message make it into a
single socket read() call, the first will do 1000 copies of successively
shorter leftover strings, while the second will only do 100.

If the workload includes a lot of simple graphs of primitive types (list of
ints, no cycles), you might be able to have a C-based tokenizer look at the
full list of tokens that came out of the current hunk, locate the matched
OPEN/CLOSE pairs, and create composite python objects directly. This would
probably break under more complicated object graphs, so it might require a
change to the protocol to use safely, like a new OPEN_UNSHARED token (which
promises that nobody will ever reference any of its child objects from
outside, so you don't need to keep track of openID numbers for them).


On the transmit side, I'd guess the main bottleneck to be the overly flexible
serialization scheme. In particular, any Slicer can return a Deferred, so the
banana serialization loop must be prepared to handle that at any time, even
though this is only used for streaming serialization and there isn't any API
to enable that. It must also be prepared to handle a new Slicer. And it's
using a lot of delegate-to-self (which introduces extra method calls, but was
handy for testing). Banana.produce() is the method to look at.

I've got one note in there which suggests an optimization to try (cache
'next'). I'd also look at inlining some of the delegate-to-self methods
(making produce() bigger). Some other options:

 * stop sending then openID numbers with the OPEN and CLOSE tokens: those are
   mainly to detect programming errors that cause loss-of-sync, and aren't
   needed once you've got confidence in the code
 * find a way to avoid tracking references for every object we send out.
   There's a dictionary that maps openID to object (used to avoid getting
   trapped in cycles, and to allow the recipient to correctly reconstruct
   cycles in the object graph). We might be putting more objects into this
   dictionary that we really need to. (but maybe not).
 * consider ripping out produce() and replacing it with a synchronous
   stack-based serializer, possibly keyed by a Schema note that says ""I
   promise that all of my arguments can be serialized synchronously, none of
   them contains a shared reference or a cycle, and the diameter of the graph
   is smaller than python's maximum stack depth"".
  * if that is true, the serialization code can use Slicers as usual, but
    drive it with a different loop that isn't quite so flexible. It could
    special-case all the primitive types (dict, list, tuple, set, int, str)
    instead of using the general-purpose Slicers.. that could be a lot
    faster.


But start with measuring the receive-side tokenizer. I'm sure there's a
pretty simple way to use buffer objects and replace that ""rest =
buffer[pos+1:]"" with a simple incrementing offset: that would eliminate the
string copies and probably give you a big win on the rx side.
"	enhancement	new	major	undecided	banana	0.3.0		performance