Opened 14 years ago

Closed 14 years ago

Last modified 14 years ago

#149 closed defect (fixed)

O(n**2) CPU/malloc during receive of large strings

Reported by: Brian Warner Owned by:
Priority: major Milestone: 0.5.1
Component: performance Version: 0.5.0
Keywords: Cc:


While tracking down some performance problems in Tahoe, I found a smoking gun in Foolscap's receive-side handling of large tokens (in particular large strings, more than about 10MB).

Specifically, the inbound token parser, which maintains a single linear buffer of all unhandled inbound data, performs a copy for each dataReceived chunk. For large strings, each copy handles more bytes than the previous, until the whole token has been received. In addition, the start of the parser loop performs a copy of the post-typebyte body, which again grows in size for large strings.

Some quick performance tests (sending a large string to a receiver which merely discards it) shows the following. "pack" is the time for the sender to serialize the data, "unpack" is the time for the receipient to deserialize it (and also includes network transit time, but this was on a LAN), and the right-most number is the increase in size of the recipient's "VmPeak" value, giving a rough estimate of the impact on memory footprint:

# size     sender            receipient   
#  40kB: 742us pack,  40ms unpack,       0B
# 0.5MB: 2.4ms pack,  56ms unpack,   1.44MB
#   1MB: 4.2ms pack, 198ms unpack,   2.96MB
#   2MB: 7.3ms pack, 726ms unpack,   5.96MB
#   4MB:  14ms pack,  2.8s unpack,  11.95MB
#   8MB:  26ms pack,  9.6s unpack,  23.96MB
#  16MB:  52ms pack,   36s unpack,  47.96MB
#  32MB: 103ms pack,  150s unpack,  95.95MB
#  64MB: 206ms pack,  593s unpack, 191.96MB

Clearly the sender's time is linear, the receipient's memory is linear (but about 3x, much larger than I'd like), and the unpacking time is quadratic.

I have some plans to fix this.. more in a followup comment.

Attachments (2)

149.WIP.diff (40.8 KB) - added by Brian Warner 14 years ago.
here's my current WIP patch. Tests are still failing, but I like the overall design.
stringchain.patch.txt (11.5 KB) - added by Zooko 14 years ago.

Download all attachments as: .zip

Change History (19)

comment:1 Changed 14 years ago by Brian Warner

Here are some quick notes:

# banana.Banana.handleChunk#L814, "rest=buffer[pos+1]" is the
# post-TYPEBYTE pointer. This could probably be replaced with a buffer
# object, because for large tokens, this will invoke a successively
# larger and larger string copy on each dataReceived call.

# later, "buffer=rest" means the token has no body: OPEN/CLOSE/ABORT and
# "buffer=rest[LEN:] , x=rest[:LEN]" is where the body gets consumed.
#  (STRING/LONGINT/LONGNEG have variable length, INT/NEG/FLOAT are constant)

# large tokens suffer from the "rest=buffer[pos+1]" copy. If we have a
# 1MB string and get 4KB dataReceived chunks, we'll perform 250 copies
# with a total of about 125MB copied. They also suffer from the
# "buffer=self.buffer+chunk" at the beginning of handleData(): this will
# perform an additional 250 copies over about 125MB of data. Worse, this
# is O(N**2). If the string is S bytes long, and we get chunks of size C,
# we'll perform 2*S/C copies, with an average size of of S/2, for a total
# size of S*S/C. Ouch.

# lots of small tokens will suffer from the "buffer=rest[LEN:]" copy, if
# we can extract a lot of tokens from a single buffer. The number of
# tokens processed in a single dataReceived() call is bounded by the
# chunk size (expected 4KB) and the minimum token size. Worst case would
# be a chunk full of OPENs, CLOSEs, or small INTs,, since those are only
# a few bytes long. callRemote("rx",[0 for i in range(4096)]) would
# probably do it. Again, if "buffer=rest[LEN:]" actually used a buffer
# object instead of a copy, this would go away. The "x=rest[:LEN]" should
# continue to use a copy, so that the inbound chunk can be retired
# quickly.

comment:2 Changed 14 years ago by Zooko

The line of code in question is visible on this trac at this link: foolscap/

comment:3 Changed 14 years ago by Zooko

I don't understand Brian's quick notes from comment:1. Oh! The problem is not rest=buffer[pos+1], it is rest=buffer[pos+1:].

comment:4 Changed 14 years ago by Zooko

I've had success in the past with using a Python list of references to Python string instead of a Python string. So for example self.buffer as used in handleData() could be a list, could be renamed self.bufferchain and could be accompanied by a self.bufferchainignore which is an integer showing how many of the leading bytes of the first element of self.bufferchain to ignore and a self.bufferchainlen which is the sum of the lens of the elements excluding the initial self.bufferchainignore bytes.

foolscap/ would say:


instead of:

        buffer = self.buffer  + chunk

Of course there are some places that you want to process the leading few bytes of the input, such as for ch in buffer: in foolscap/ or buffer = rest[8:] on foolscap/ What you do there is join the first and second elements of self.bufferchain until the size of the first element of self.bufferchain is sufficient to handle the processing you need in a single string. (Technically O(N2), but that size is 64 for foolscap/, and 8 for foolscap/ Finally, the places in handleData() that check whether the size is >= strlen and then do something, such as foolscap/ can be rewritten to test self.bufferchainlen and then to do something specific for that type. foolscap/ is the important case that is the cause of Tahoe-LAFS's issue . Here is how I would write it.

Instead of the current:

elif typebyte  == STRING:
    strlen = header
    if len(rest) >= strlen:
        # the whole string is available
        buffer = rest[strlen:]
        obj = rest[:strlen]

you could have:

elif typebyte  == STRING:
    if self.bufferchainlen >= strlen:
        # the whole string is available
        # trim off the ignored bytes from the first element of the buffer chain
        self.bufferchain[0] = self.bufferchain[0][self.bufferchainignore:]
        self.bufferchainignore = 0
        # find which element is the final one we need to put together the result string
        l = 0; i = 0
        while l < strlen:
            l += len(self.bufferchain[i])
            i += 1
        # collect the elements we need
        strchain = self.bufferchain[:i]
        # trim off any trailing bytes that aren't part of the result string
        strchain[-1] = strchain[-1][:l-strlen]
        # construct the result string
        obj = ''.join(strchain)

        # remove completely consumed elements from the bufferchain
        self.bufferchain = self.bufferchain[i-1:]
        # ignore the leading bytes of the partially consumed element
        self.bufferchainignore = l-strlen

This is uglier than the old version! And it requires careful testing of edge cases such as what happens when this code leaves self.bufferchainignore == len(self.bufferchain[0]) or what happens when at various places in the code self.bufferchain becomes an empty list.

However, this code is relatively efficient, requiring only two relatively small copies of data (one at the beginning and one at the end of the buffer chain) and finally only copy of the entire string.

Like I said, I've used this idiom in the past in Mojo Nation, Mnet, and Mountain View and it worked out.

comment:5 Changed 14 years ago by Zooko

        # remove completely consumed elements from the bufferchain
        self.bufferchain = self.bufferchain[i-1:]

Oh, I guess that ought to be done like:

        # remove completely consumed elements from the bufferchain
        del self.bufferchain[:i-1]

comment:6 Changed 14 years ago by Zooko

I cooked up a class which encapsulates this idiom of managing a list of strings. Hopefully a class like this could manage these chunks of data efficiently without complicating (i.e. buggifying) the surrounding code in

Changed 14 years ago by Brian Warner

Attachment: 149.WIP.diff added

here's my current WIP patch. Tests are still failing, but I like the overall design.

comment:7 Changed 14 years ago by Brian Warner

the patch I just attached is how I'm thinking of addressing this. It splits the receive stack into two layers. The lower layer alternates between WANT_HEADER and WANT_BODY, and uses a list-of-strings to efficiently produce tokens. These tokens are then fed to the upper layer.

I'm afraid I haven't looked closely enough at stringchain to see if it would meet my needs. I suspect that the alternating-modes stuff means foolscap is a weird use case. But, once I get the upper-layer stuff working and tests passing, I'll look at it and see if it'd be cleaner to use stringchain (or its approach) instead of the "bufferChunks" implementation that I came up with.

comment:8 Changed 14 years ago by Zooko

Argh. As far as I can tell your bufferChunks is almost the same as my stringchain except:

  • bufferChunks logic is mixed into the Banana logic, so it is harder (for me) to tell what it does
  • bufferChunks is probably a tad slower than stringchain, since I've carefully optimized and benchmarked stringchain
  • bufferChunks is probably buggier than stringchain, since I have tests of stringchain's behavior as an encapsulated object in isolation; (And it turns out that I wrote plenty of bugs in stringchain which were caught by the tests...)

I'll see if I can hammer my point home more sometime, for example by copying your bufferChunks code out of this patch and into stringchain's tests and benchmarks, or by writing benchmarks of Banana and then writing a patch that replaces bufferChunks (per your new WIP patch) with stringchain. Anyway, I do wonder why you didn't read . Is it hard to read? Did you start to read it and something about it turned you off so you stopped? If so, what? Thanks.

comment:9 Changed 14 years ago by Zooko

Here is a patch which plugs stringchain into the current HEAD of foolscap. (I started by trying to apply stringchain to your WIP patch but I couldn't figure out how to swap out the bufferChunks.) It took me a long time to fix all the bugs in this, and of course the only reason that I fixed them is that you have such good unit tests. However the resulting patch is actually pretty straightforward -- it replaces self.buffer = '' with self.buffer = StringChain(), and it changes the pattern of keeping two different buffers and swapping their pointers with the pattern of popping things from the front of the buffer and then pushing the things back onto the front if it turns out that you aren't ready to consume them.

I added a benchmark script, which you can run like this:

HACL Wonwin-McBrootles-Computer:~/playground/foolscap/foolscap-trunk$ PYTHONPATH= python -OOu foolscap/test/
    1000 best: 2.980e+01,   3th-best: 3.004e+01, mean: 3.285e+01,   3th-worst: 3.195e+01, worst: 5.293e+01 (of     10)
   10000 best: 6.080e+00,   3th-best: 6.294e+00, mean: 6.409e+00,   3th-worst: 6.604e+00, worst: 6.604e+00 (of     10)
  100000 best: 4.774e+01,   3th-best: 4.809e+01, mean: 5.050e+01,   3th-worst: 5.208e+01, worst: 5.566e+01 (of     10)
 1000000 best: 4.439e+02,   3th-best: 4.446e+02, mean: 4.489e+02,   3th-worst: 4.511e+02, worst: 4.601e+02 (of     10)
10000000 best: 5.377e+03,   1th-best: 5.377e+03, mean: 5.414e+03,   1th-worst: 5.450e+03, worst: 5.450e+03 (of      2)
HACL Wonwin-McBrootles-Computer:~/playground/foolscap/foolscap-trunk$ cd ../stringchain/
HACL Wonwin-McBrootles-Computer:~/playground/foolscap/stringchain$ PYTHONPATH= python -OOu foolscap/test/
    1000 best: 5.388e+01,   3th-best: 5.507e+01, mean: 5.991e+01,   3th-worst: 5.698e+01, worst: 9.322e+01 (of     10)
   10000 best: 1.121e+01,   3th-best: 1.142e+01, mean: 1.166e+01,   3th-worst: 1.159e+01, worst: 1.259e+01 (of     10)
  100000 best: 8.929e+00,   3th-best: 9.038e+00, mean: 1.068e+01,   3th-worst: 1.114e+01, worst: 1.917e+01 (of     10)
 1000000 best: 8.766e+00,   3th-best: 8.929e+00, mean: 9.466e+00,   3th-worst: 9.750e+00, worst: 1.147e+01 (of     10)
10000000 best: 9.486e+00,   3th-best: 9.499e+00, mean: 9.849e+00,   3th-worst: 9.751e+00, worst: 1.192e+01 (of     10)

The left-hand column is number of bytes of the string being decoded. The units are in nanoseconds per byte.

Changed 14 years ago by Zooko

Attachment: stringchain.patch.txt added

comment:10 Changed 14 years ago by Zooko

Oh there are two other things in this patch that need comment:

@@ -701,26 +703,22 @@ class Banana(protocol.Protocol):
         # buffer, assemble into tokens
         # call self.receiveToken(token) with each
         if self.skipBytes:
-            if len(chunk) < self.skipBytes:
+            if len(chunk) <= self.skipBytes:
                 # skip the whole chunk
                 self.skipBytes -= len(chunk)

I guess this was just a tiny mistake in the original.

--- a/foolscap/test/    Sun Mar 14 18:44:52 2010 -0700
+++ b/foolscap/test/    Thu Mar 25 02:53:09 2010 -0600
@@ -143,8 +143,8 @@ class Copyable(TargetMixin, unittest.Tes
         self.failUnlessEqual(f2.frames, [])
         self.failUnlessEqual(f2.tb, None)
         self.failUnlessEqual(f2.stack, [])
-        self.failUnless(f2.traceback.find("raise RuntimeError") != -1,
-                        "no 'raise RuntimeError' in '%s'" % (f2.traceback,))
+        self.failUnless(f2.traceback.find("RuntimeError") != -1,
+                        "no 'RuntimeError' in '%s'" % (f2.traceback,))
     def testFailure2(self):
         self.callingBroker.unsafeTracebacks = False

I got a test failure because it was generating a string saying something like "exception RuntimeError" instead of saying "raise RuntimeError". When I added debugprints to learn more about it and re-ran the tests then it stopped happening. I decided it was too late at night to care about such things so I made this change.

comment:11 Changed 14 years ago by Brian Warner

Hey, this looks great! I'm going to apply this patch, make a few changes, and release the result as 0.5.1 . Some comments:

  • the if len(chunk) < self.skipBytes: was a bug.. thanks for the catch! It was pretty harmless though, I think it would merely have resulted in an extra loop pass, so just a tiny performance hit in a very rare case.
  • The f2.traceback.find("raise RuntimeError") clause was intentional.. the test asserts that the copied failure contains a traceback (with source), and "raise" is only present if that is the case. I've seen this test fail in environments where the source is not available (or at least not available to the traceback module), such as when the .pyc is present but not the .py, or some sys.path or sys.modules magic has confused the location of the source, or when the source is coming from an .egg and pathnames are all messed up. It passes for me, so I'm going to leave it in place.
  • I copied in your test suite, adapted it to trial, and removed the dependency on pyutil

After 0.5.1 is out, I plan to adapt my WIP cleanup effort to use stringchain, since I really want the simplification of separating the header-scanning routine from the body-scanning code. To do this, I'll be wanting a "peekleft" method in stringchain, which would return the first N bytes of the string, but not remove them. I'll use this in the header-scanning section, to non-destructively grab the first 65 bytes and look for a complete header. If and only iff the header is there, I'll then use stringchain.trim to discard them. I'd prefer this to the pop-and-pushback approach that you were forced to use. (I apologize for the behavior of that loop and applaud your efforts to fix it with minimal changes! I look forward to killing that whole function in 0.5.2).

I think I didn't investigate stringchain earlier because:

  • I was turned off by the idea of an entire Trac instance for a single class
  • I didn't understand what I needed until I wrote something similar myself
  • I didn't want to use the pop-and-pushback approach, and believed (correctly, as it turns out) that stringchain wouldn't have a peek() method (although it's easy to implement, having just done it myself..)

Thanks for writing this.. it will save me a lot of time, especially with the test suite.

comment:12 Changed 14 years ago by Brian Warner


comment:13 Changed 14 years ago by Brian Warner

Resolution: fixed
Status: newclosed

comment:14 Changed 14 years ago by Zooko

I reviewed [a7221ebb6a06d89a561027f54aea591c60c79a3a] and it looks right to me. You tidied up a couple of thing that my patch didn't do quite right, such as that you log the first65 as well as the next 200 in case of an illegally large prefix.

However, did you forget to hg add

comment:15 Changed 14 years ago by Zooko

Here is the request to upload foolscap v0.5.1 to Lucid:

comment:16 in reply to:  11 Changed 14 years ago by Zooko

Replying to warner:

  • I was turned off by the idea of an entire Trac instance for a single class

I'm not surprised! I know this about you. I do think it is kind of funny. :-) I've been considering including stringchain in pyutil, but at least for the time being I wanted to draw more attention to stringchain by announcing it as its own package. Anyway, I'll probably fold it into pyutil in the future. How would you feel about that? Good, bad, indifferent?

It would be nice if foolscap's use of stringchain and other people's uses of stringchain could share new features (such as peek), bugfixes, improved unit tests, optimizations, etc...

Anyway, just to be clear, are you saying that you would have looked at more closely had I attached to a ticket on foolscap instead of putting a link to a trac instance?

  • I didn't understand what I needed until I wrote something similar myself

I know how that goes.

  • I didn't want to use the pop-and-pushback approach, and believed (correctly, as it turns out) that stringchain wouldn't have a peek() method (although it's easy to implement, having just done it myself..)

Can you say more about this? I guess you want this for clarity of code rather than for performance. One might think that pop-and-push has worse performance than peek, but I'll bet that the performance isn't worse.

Thanks for writing this.. it will save me a lot of time, especially with the test suite.

You're welcome! Thanks for accepting my patch! :-)

comment:17 Changed 14 years ago by Brian Warner

Yeah, I think I'd have looked at it slightly sooner if you'd attached to this ticket, or perhaps a summary of its API. But really, it was more about the frame of mind I was in than the way stringchain was presented.

As for pop-and-pushback: I remain nervous about your code in, because it follows a pattern of:

  • 1: pop 65 bytes off
  • 2: think about it a lot, maybe pull more, maybe push some back
  • 3: when done, push any unused portion back

If an exception occurs during this sequence, it won't do step three, and the inbound buffer will be corrupted. I checked all the code paths (as I'm sure you did too), and confirmed that there are no unexpected exits from the loop, but it's an awfully big function, and some future author might add a 'break' or a 'return' or an exception that is then caught by some enclosing function, without paying careful attention to the need to push exactly the right set of (unused) bytes back onto the buffer.

(of course, you know that I know that your patch was exactly the right way to do it, given the goal of minimizing the delta so we could get this released quickly)

The replacement code (in my WIP) would reduce the size of this "precious data is being held in a temporary variable and needs to be pushed back onto the input stream or else" window, by splitting the code into separate WANT_HEADER and WANT_BODY methods, since once you're in the WANT_BODY phase, we don't need any peek-ahead. But I'd still like the WANT_HEADER code to survive a non-local exit (most likely a 'raise Violation') without losing sync.

Of course, I could also accomplish this with:

 left65 = sc.popleft(65)

I'd probably want a peekleft() function that was more efficient than that, else I'd just go with this pop-then-pushback-right-away style. In particular, if the buffer contains a lot of small tokens (perhaps missing the header completely), I'd like to avoid doing a lot of string copies while parsing it. So I kinda want peekleft() to give me back a buffer object if possible. Also, I'd like to avoid chopping the buffer into a bunch of tiny pieces.

It might be useful to benchmark a second case, in which you deliver a bunch of small tokens, say INT(5) which can be expressed in two bytes. If you get a 2KiB buffer full of 1024 INTs, how many string copies get made?

Note: See TracTickets for help on using tickets.