#149 closed defect (fixed)
O(n**2) CPU/malloc during receive of large strings
Reported by: | Brian Warner | Owned by: | |
---|---|---|---|
Priority: | major | Milestone: | 0.5.1 |
Component: | performance | Version: | 0.5.0 |
Keywords: | Cc: |
Description
While tracking down some performance problems in Tahoe, I found a smoking gun in Foolscap's receive-side handling of large tokens (in particular large strings, more than about 10MB).
Specifically, the inbound token parser, which maintains a single linear buffer of all unhandled inbound data, performs a copy for each dataReceived chunk. For large strings, each copy handles more bytes than the previous, until the whole token has been received. In addition, the start of the parser loop performs a copy of the post-typebyte body, which again grows in size for large strings.
Some quick performance tests (sending a large string to a receiver which
merely discards it) shows the following. "pack" is the time for the sender to
serialize the data, "unpack" is the time for the receipient to deserialize it
(and also includes network transit time, but this was on a LAN), and the
right-most number is the increase in size of the recipient's "VmPeak
" value,
giving a rough estimate of the impact on memory footprint:
# size sender receipient # 40kB: 742us pack, 40ms unpack, 0B # 0.5MB: 2.4ms pack, 56ms unpack, 1.44MB # 1MB: 4.2ms pack, 198ms unpack, 2.96MB # 2MB: 7.3ms pack, 726ms unpack, 5.96MB # 4MB: 14ms pack, 2.8s unpack, 11.95MB # 8MB: 26ms pack, 9.6s unpack, 23.96MB # 16MB: 52ms pack, 36s unpack, 47.96MB # 32MB: 103ms pack, 150s unpack, 95.95MB # 64MB: 206ms pack, 593s unpack, 191.96MB
Clearly the sender's time is linear, the receipient's memory is linear (but about 3x, much larger than I'd like), and the unpacking time is quadratic.
I have some plans to fix this.. more in a followup comment.
Attachments (2)
Change History (19)
comment:1 Changed 15 years ago by
comment:2 Changed 15 years ago by
The line of code in question is visible on this trac at this link: foolscap/banana.py@2a063689bfdb#L813.
comment:3 Changed 15 years ago by
I don't understand Brian's quick notes from comment:1. Oh! The problem is not rest=buffer[pos+1]
, it is rest=buffer[pos+1:]
.
comment:4 Changed 15 years ago by
I've had success in the past with using a Python list of references to Python string instead of a Python string. So for example self.buffer
as used in handleData()
could be a list, could be renamed self.bufferchain
and could be accompanied by a self.bufferchainignore
which is an integer showing how many of the leading bytes of the first element of self.bufferchain
to ignore and a self.bufferchainlen
which is the sum of the lens of the elements excluding the initial self.bufferchainignore
bytes.
foolscap/banana.py@2a063689bfdb#L710 would say:
self.bufferchain.append(chunk)
instead of:
buffer = self.buffer + chunk
Of course there are some places that you want to process the leading few bytes of the input, such as for ch in buffer:
in foolscap/banana.py@2a063689bfdb#L722 or buffer = rest[8:]
on foolscap/banana.py@2a063689bfdb#L963. What you do there is join the first and second elements of self.bufferchain
until the size of the first element of self.bufferchain
is sufficient to handle the processing you need in a single string. (Technically O(N2), but that size is 64 for foolscap/banana.py@2a063689bfdb#L722, and 8 for foolscap/banana.py@2a063689bfdb#L963.)
Finally, the places in handleData()
that check whether the size is >= strlen and then do something, such as foolscap/banana.py@2a063689bfdb#L910 can be rewritten to test self.bufferchainlen
and then to do something specific for that type.
foolscap/banana.py@2a063689bfdb#L910 is the important case that is the cause of Tahoe-LAFS's issue http://allmydata.org/trac/tahoe-lafs/ticket/983 . Here is how I would write it.
Instead of the current:
elif typebyte == STRING: strlen = header if len(rest) >= strlen: # the whole string is available buffer = rest[strlen:] obj = rest[:strlen]
you could have:
elif typebyte == STRING: if self.bufferchainlen >= strlen: # the whole string is available # trim off the ignored bytes from the first element of the buffer chain self.bufferchain[0] = self.bufferchain[0][self.bufferchainignore:] self.bufferchainignore = 0 # find which element is the final one we need to put together the result string l = 0; i = 0 while l < strlen: l += len(self.bufferchain[i]) i += 1 # collect the elements we need strchain = self.bufferchain[:i] # trim off any trailing bytes that aren't part of the result string strchain[-1] = strchain[-1][:l-strlen] # construct the result string obj = ''.join(strchain) # remove completely consumed elements from the bufferchain self.bufferchain = self.bufferchain[i-1:] # ignore the leading bytes of the partially consumed element self.bufferchainignore = l-strlen
This is uglier than the old version! And it requires careful testing of edge cases such as what happens when this code leaves self.bufferchainignore
== len(self.bufferchain[0])
or what happens when at various places in the code self.bufferchain
becomes an empty list.
However, this code is relatively efficient, requiring only two relatively small copies of data (one at the beginning and one at the end of the buffer chain) and finally only copy of the entire string.
Like I said, I've used this idiom in the past in Mojo Nation, Mnet, and Mountain View and it worked out.
comment:5 Changed 15 years ago by
# remove completely consumed elements from the bufferchain self.bufferchain = self.bufferchain[i-1:]
Oh, I guess that ought to be done like:
# remove completely consumed elements from the bufferchain del self.bufferchain[:i-1]
comment:6 Changed 15 years ago by
I cooked up a class which encapsulates this idiom of managing a list of strings. Hopefully a class like this could manage these chunks of data efficiently without complicating (i.e. buggifying) the surrounding code in banana.py.
Changed 15 years ago by
Attachment: | 149.WIP.diff added |
---|
here's my current WIP patch. Tests are still failing, but I like the overall design.
comment:7 Changed 15 years ago by
the patch I just attached is how I'm thinking of addressing this. It splits the receive stack into two layers. The lower layer alternates between WANT_HEADER and WANT_BODY, and uses a list-of-strings to efficiently produce tokens. These tokens are then fed to the upper layer.
I'm afraid I haven't looked closely enough at stringchain to see if it would meet my needs. I suspect that the alternating-modes stuff means foolscap is a weird use case. But, once I get the upper-layer stuff working and tests passing, I'll look at it and see if it'd be cleaner to use stringchain (or its approach) instead of the "bufferChunks" implementation that I came up with.
comment:8 Changed 15 years ago by
Argh. As far as I can tell your bufferChunks
is almost the same as my stringchain
except:
bufferChunks
logic is mixed into theBanana
logic, so it is harder (for me) to tell what it doesbufferChunks
is probably a tad slower thanstringchain
, since I've carefully optimized and benchmarkedstringchain
bufferChunks
is probably buggier thanstringchain
, since I have tests ofstringchain
's behavior as an encapsulated object in isolation; (And it turns out that I wrote plenty of bugs instringchain
which were caught by the tests...)
I'll see if I can hammer my point home more sometime, for example by copying your bufferChunks
code out of this patch and into stringchain's tests and benchmarks, or by writing benchmarks of Banana
and then writing a patch that replaces bufferChunks
(per your new WIP patch) with stringchain
.
Anyway, I do wonder why you didn't read http://tahoe-lafs.org/trac/stringchain/browser/stringchain/stringchain.py . Is it hard to read? Did you start to read it and something about it turned you off so you stopped? If so, what?
Thanks.
comment:9 Changed 15 years ago by
Here is a patch which plugs stringchain
into the current HEAD of foolscap. (I started by trying to apply stringchain
to your WIP patch but I couldn't figure out how to swap out the bufferChunks
.) It took me a long time to fix all the bugs in this, and of course the only reason that I fixed them is that you have such good unit tests. However the resulting patch is actually pretty straightforward -- it replaces self.buffer = ''
with self.buffer = StringChain()
, and it changes the pattern of keeping two different buffers and swapping their pointers with the pattern of popping things from the front of the buffer and then pushing the things back onto the front if it turns out that you aren't ready to consume them.
I added a benchmark script, which you can run like this:
HACL Wonwin-McBrootles-Computer:~/playground/foolscap/foolscap-trunk$ PYTHONPATH= python -OOu foolscap/test/bench_banana.py 1000 best: 2.980e+01, 3th-best: 3.004e+01, mean: 3.285e+01, 3th-worst: 3.195e+01, worst: 5.293e+01 (of 10) 10000 best: 6.080e+00, 3th-best: 6.294e+00, mean: 6.409e+00, 3th-worst: 6.604e+00, worst: 6.604e+00 (of 10) 100000 best: 4.774e+01, 3th-best: 4.809e+01, mean: 5.050e+01, 3th-worst: 5.208e+01, worst: 5.566e+01 (of 10) 1000000 best: 4.439e+02, 3th-best: 4.446e+02, mean: 4.489e+02, 3th-worst: 4.511e+02, worst: 4.601e+02 (of 10) 10000000 best: 5.377e+03, 1th-best: 5.377e+03, mean: 5.414e+03, 1th-worst: 5.450e+03, worst: 5.450e+03 (of 2) HACL Wonwin-McBrootles-Computer:~/playground/foolscap/foolscap-trunk$ cd ../stringchain/ HACL Wonwin-McBrootles-Computer:~/playground/foolscap/stringchain$ PYTHONPATH= python -OOu foolscap/test/bench_banana.py 1000 best: 5.388e+01, 3th-best: 5.507e+01, mean: 5.991e+01, 3th-worst: 5.698e+01, worst: 9.322e+01 (of 10) 10000 best: 1.121e+01, 3th-best: 1.142e+01, mean: 1.166e+01, 3th-worst: 1.159e+01, worst: 1.259e+01 (of 10) 100000 best: 8.929e+00, 3th-best: 9.038e+00, mean: 1.068e+01, 3th-worst: 1.114e+01, worst: 1.917e+01 (of 10) 1000000 best: 8.766e+00, 3th-best: 8.929e+00, mean: 9.466e+00, 3th-worst: 9.750e+00, worst: 1.147e+01 (of 10) 10000000 best: 9.486e+00, 3th-best: 9.499e+00, mean: 9.849e+00, 3th-worst: 9.751e+00, worst: 1.192e+01 (of 10)
The left-hand column is number of bytes of the string being decoded. The units are in nanoseconds per byte.
Changed 15 years ago by
Attachment: | stringchain.patch.txt added |
---|
comment:10 Changed 15 years ago by
Oh there are two other things in this patch that need comment:
@@ -701,26 +703,22 @@ class Banana(protocol.Protocol): # buffer, assemble into tokens # call self.receiveToken(token) with each if self.skipBytes: - if len(chunk) < self.skipBytes: + if len(chunk) <= self.skipBytes: # skip the whole chunk self.skipBytes -= len(chunk) return
I guess this was just a tiny mistake in the original.
--- a/foolscap/test/test_copyable.py Sun Mar 14 18:44:52 2010 -0700 +++ b/foolscap/test/test_copyable.py Thu Mar 25 02:53:09 2010 -0600 @@ -143,8 +143,8 @@ class Copyable(TargetMixin, unittest.Tes self.failUnlessEqual(f2.frames, []) self.failUnlessEqual(f2.tb, None) self.failUnlessEqual(f2.stack, []) - self.failUnless(f2.traceback.find("raise RuntimeError") != -1, - "no 'raise RuntimeError' in '%s'" % (f2.traceback,)) + self.failUnless(f2.traceback.find("RuntimeError") != -1, + "no 'RuntimeError' in '%s'" % (f2.traceback,)) def testFailure2(self): self.callingBroker.unsafeTracebacks = False
I got a test failure because it was generating a string saying something like "exception RuntimeError" instead of saying "raise RuntimeError". When I added debugprints to learn more about it and re-ran the tests then it stopped happening. I decided it was too late at night to care about such things so I made this change.
comment:11 follow-up: 16 Changed 15 years ago by
Hey, this looks great! I'm going to apply this patch, make a few changes, and release the result as 0.5.1 . Some comments:
- the
if len(chunk) < self.skipBytes:
was a bug.. thanks for the catch! It was pretty harmless though, I think it would merely have resulted in an extra loop pass, so just a tiny performance hit in a very rare case. - The
f2.traceback.find("raise RuntimeError")
clause was intentional.. the test asserts that the copied failure contains a traceback (with source), and "raise" is only present if that is the case. I've seen this test fail in environments where the source is not available (or at least not available to the traceback module), such as when the .pyc is present but not the .py, or some sys.path or sys.modules magic has confused the location of the source, or when the source is coming from an .egg and pathnames are all messed up. It passes for me, so I'm going to leave it in place. - I copied in your test suite, adapted it to trial, and removed the dependency on pyutil
After 0.5.1 is out, I plan to adapt my WIP cleanup effort to use stringchain, since I really want the simplification of separating the header-scanning routine from the body-scanning code. To do this, I'll be wanting a "peekleft
" method in stringchain, which would return the first N bytes of the string, but not remove them. I'll use this in the header-scanning section, to non-destructively grab the first 65 bytes and look for a complete header. If and only iff the header is there, I'll then use stringchain.trim
to discard them. I'd prefer this to the pop-and-pushback approach that you were forced to use. (I apologize for the behavior of that loop and applaud your efforts to fix it with minimal changes! I look forward to killing that whole function in 0.5.2).
I think I didn't investigate stringchain earlier because:
- I was turned off by the idea of an entire Trac instance for a single class
- I didn't understand what I needed until I wrote something similar myself
- I didn't want to use the pop-and-pushback approach, and believed (correctly, as it turns out) that stringchain wouldn't have a peek() method (although it's easy to implement, having just done it myself..)
Thanks for writing this.. it will save me a lot of time, especially with the test suite.
comment:12 Changed 15 years ago by
Milestone: | 0.6.0 → 0.5.1 |
---|
comment:13 Changed 15 years ago by
Resolution: | → fixed |
---|---|
Status: | new → closed |
comment:14 Changed 15 years ago by
I reviewed [a7221ebb6a06d89a561027f54aea591c60c79a3a] and it looks right to me. You tidied up a couple of thing that my patch didn't do quite right, such as that you log the first65 as well as the next 200 in case of an illegally large prefix.
However, did you forget to hg add stringchain.py
?
comment:15 Changed 15 years ago by
Here is the request to upload foolscap v0.5.1 to Lucid: https://bugs.launchpad.net/ubuntu/+source/foolscap/+bug/548993
comment:16 Changed 15 years ago by
Replying to warner:
- I was turned off by the idea of an entire Trac instance for a single class
I'm not surprised! I know this about you. I do think it is kind of funny. :-) I've been considering including stringchain in pyutil, but at least for the time being I wanted to draw more attention to stringchain by announcing it as its own package. Anyway, I'll probably fold it into pyutil in the future. How would you feel about that? Good, bad, indifferent?
It would be nice if foolscap's use of stringchain and other people's uses of stringchain could share new features (such as peek), bugfixes, improved unit tests, optimizations, etc...
Anyway, just to be clear, are you saying that you would have looked at stringchain.py more closely had I attached stringchain.py to a ticket on foolscap instead of putting a link to a trac instance?
- I didn't understand what I needed until I wrote something similar myself
I know how that goes.
- I didn't want to use the pop-and-pushback approach, and believed (correctly, as it turns out) that stringchain wouldn't have a peek() method (although it's easy to implement, having just done it myself..)
Can you say more about this? I guess you want this for clarity of code rather than for performance. One might think that pop-and-push has worse performance than peek, but I'll bet that the performance isn't worse.
Thanks for writing this.. it will save me a lot of time, especially with the test suite.
You're welcome! Thanks for accepting my patch! :-)
comment:17 Changed 15 years ago by
Yeah, I think I'd have looked at it slightly sooner if you'd attached stringchain.py to this ticket, or perhaps a summary of its API. But really, it was more about the frame of mind I was in than the way stringchain was presented.
As for pop-and-pushback: I remain nervous about your code in banana.py, because it follows a pattern of:
- 1: pop 65 bytes off
- 2: think about it a lot, maybe pull more, maybe push some back
- 3: when done, push any unused portion back
If an exception occurs during this sequence, it won't do step three, and the inbound buffer will be corrupted. I checked all the code paths (as I'm sure you did too), and confirmed that there are no unexpected exits from the loop, but it's an awfully big function, and some future author might add a 'break' or a 'return' or an exception that is then caught by some enclosing function, without paying careful attention to the need to push exactly the right set of (unused) bytes back onto the buffer.
(of course, you know that I know that your patch was exactly the right way to do it, given the goal of minimizing the delta so we could get this released quickly)
The replacement code (in my WIP) would reduce the size of this "precious data is being held in a temporary variable and needs to be pushed back onto the input stream or else" window, by splitting the code into separate WANT_HEADER and WANT_BODY methods, since once you're in the WANT_BODY phase, we don't need any peek-ahead. But I'd still like the WANT_HEADER code to survive a non-local exit (most likely a 'raise Violation') without losing sync.
Of course, I could also accomplish this with:
left65 = sc.popleft(65) sc.appendleft(left65) PROCESS(left65)... sc.trim(len(header)+1)
I'd probably want a peekleft() function that was more efficient than that, else I'd just go with this pop-then-pushback-right-away style. In particular, if the buffer contains a lot of small tokens (perhaps missing the header completely), I'd like to avoid doing a lot of string copies while parsing it. So I kinda want peekleft() to give me back a buffer object if possible. Also, I'd like to avoid chopping the buffer into a bunch of tiny pieces.
It might be useful to benchmark a second case, in which you deliver a bunch of small tokens, say INT(5) which can be expressed in two bytes. If you get a 2KiB buffer full of 1024 INTs, how many string copies get made?
Here are some quick notes: