Opened 15 years ago

Last modified 14 years ago

#117 new enhancement

performance improvements: receive-side tokenization, send-side streamingness

Reported by: Brian Warner Owned by:
Priority: major Milestone: undecided
Component: banana Version: 0.3.0
Keywords: performance Cc:

Description

Brian Granger asked me (in email) to identify some of the places where we might be able to improve Foolscap's performance. Here's my list.

What is your sense of where the bottlenecks in Foolscap are that would affect latency? Am I correct that send and produce are the primary candidates for optimization?

Yes, I think you're right. I'm seeing 15ms operation times in a similar Tahoe environment.. I don't really know how much the two ends are using themselves, but I expect that a lot of that 15ms is in Foolscap serialization/deserialization. (we're considering moving to a simpler HTTP-based protocol for Tahoe, because the higher-layer data structures already have encryption and integrity checks, so we don't really need it at the transport level).

I haven't spent any time trying to optimize/improve performance (or really even measuring it) yet.

First thing I'd do is to figure out whether to focus on the sending side or on the receiving side. To measure the receiving side separately, I'd generate a bunch of pre-serialized test cases, instantiate a Banana object, poke it a bit to make it think that it's undergone negotation already, attach it to a Broker that will ignore receiveChild() (so the message-send won't actually do anything), then time/profile banana.dataReceived(teststring). Measuring the sending side is easier, since serialization is complete by the time callRemote() returns. (there is additional time spent by Twisted to move the serialized data from the Transport's send buffer into the kernel, but that's probably pretty efficient compared to what Foolscap is doing).

The lowest-hanging fruit I can think of is the receive-side tokenization loop, in Banana.handleData(), since it uses a lot of string copies (one per extracted token). You might try:

  • modify Banana.handleData() to use python "buffers" instead of doing so many string copies: the "rest = buffer[pos+1:]" is the biggest culprit, doing a copy for each received token
  • write a C function that takes a buffer of input and returns a list of tokens plus a leftover string, and use that to drive the loop instead of that big "while" loop (i.e. move the tokenization into C). Have that function return tokens of type (int, float, str, tuple), with tuples being used to indicate OPEN, CLOSE, ABORT, LIST, etc:
    tokens, rest = c_tokenizer(buffer)
    for t in tokens:
      t1 = type(t)
      if t1 is tuple:
       if t[0] == OPEN:
        self.handleOpen(t[1:]) ...
       elif t[0] == CLOSE:
        self.handleClose(t[1:]) ...
      if t1 in (int, long, float, str):
       obj = t
       ...
  • inline some of the handleFOO() methods, to reduce the number of function calls

The behavior of the tokenizer will depend upon how many tokens arrive in a single TCP hunk: an argument object graph with lots of nodes will do lots of copies. The size of each copy will depend upon how much data is in the graph: you might benchmark

callRemote("foo", for i in range(1000)?)

separately from

callRemote("foo", ["a"*10 for i in range(100)])

to measure the difference. The first will create 1000 STRING("a") tokens (each of which will be 3 bytes long), the second will create 100 STRING("a"*10) tokens (each of which will be 12 bytes long). Assuming that both the 3000ish-long message and the 1200ish-long message make it into a single socket read() call, the first will do 1000 copies of successively shorter leftover strings, while the second will only do 100.

If the workload includes a lot of simple graphs of primitive types (list of ints, no cycles), you might be able to have a C-based tokenizer look at the full list of tokens that came out of the current hunk, locate the matched OPEN/CLOSE pairs, and create composite python objects directly. This would probably break under more complicated object graphs, so it might require a change to the protocol to use safely, like a new OPEN_UNSHARED token (which promises that nobody will ever reference any of its child objects from outside, so you don't need to keep track of openID numbers for them).

On the transmit side, I'd guess the main bottleneck to be the overly flexible serialization scheme. In particular, any Slicer can return a Deferred, so the banana serialization loop must be prepared to handle that at any time, even though this is only used for streaming serialization and there isn't any API to enable that. It must also be prepared to handle a new Slicer. And it's using a lot of delegate-to-self (which introduces extra method calls, but was handy for testing). Banana.produce() is the method to look at.

I've got one note in there which suggests an optimization to try (cache 'next'). I'd also look at inlining some of the delegate-to-self methods (making produce() bigger). Some other options:

  • stop sending then openID numbers with the OPEN and CLOSE tokens: those are mainly to detect programming errors that cause loss-of-sync, and aren't needed once you've got confidence in the code
  • find a way to avoid tracking references for every object we send out. There's a dictionary that maps openID to object (used to avoid getting trapped in cycles, and to allow the recipient to correctly reconstruct cycles in the object graph). We might be putting more objects into this dictionary that we really need to. (but maybe not).
  • consider ripping out produce() and replacing it with a synchronous stack-based serializer, possibly keyed by a Schema note that says "I promise that all of my arguments can be serialized synchronously, none of them contains a shared reference or a cycle, and the diameter of the graph is smaller than python's maximum stack depth".
    • if that is true, the serialization code can use Slicers as usual, but drive it with a different loop that isn't quite so flexible. It could special-case all the primitive types (dict, list, tuple, set, int, str) instead of using the general-purpose Slicers.. that could be a lot faster.

But start with measuring the receive-side tokenizer. I'm sure there's a pretty simple way to use buffer objects and replace that "rest = buffer[pos+1:]" with a simple incrementing offset: that would eliminate the string copies and probably give you a big win on the rx side.

Change History (2)

comment:1 Changed 14 years ago by Brian Warner

Component: performancebanana
Keywords: performance added

comment:2 Changed 14 years ago by Brian Warner

Summary: potential performance improvementsperformance improvements: receive-side tokenization, send-side streamingness
Note: See TracTickets for help on using tickets.