-*- outline -*- Reasonably independent newpb sub-tasks that need doing. Most important come first. * decide on a version negotiation scheme Should be able to telnet into a PB server and find out that it is a PB server. Pointing a PB client at an HTTP server (or an HTTP client at a PB server) should result in an error, not a timeout. Implement in banana.Banana.connectionMade(). desiderata: negotiation should take place with regular banana sequences: don't invent a new protocol that is only used at the start of the connection Banana should be useable one-way, for storage or high-latency RPC (the mnet folks want to create a method call, serialize it to a string, then encrypt and forward it on to other nodes, sometimes storing it in relays along the way if a node is offline for a few days). It should be easy for the layer above Banana to feed it the results of what its negotiation would have been (if it had actually used an interactive connection to its peer). Feeding the same results to both sides should have them proceed as if they'd agreed to those results. negotiation should be flexible enough to be extended but still allow old code to talk with new code. Magically predict every conceivable extension and provide for it from the very first release :). There are many levels to banana, all of which could be useful targets of negotiation: which basic tokens are in use? Is there a BOOLEAN token? a NONE token? Can it accept a LONGINT token or is the target limited to 32-bit integers? are there any variations in the basic Banana protocol being used? Could the smaller-scope OPEN-counter decision be deferred until after the first release and handled later with a compatibility negotiation flag? What "base" OPEN sequences are known? 'unicode'? 'boolean'? 'dict'? This is an overlap between expressing the capabilities of the host language, the Banana implementation, and the needs of the application. How about 'instance', probably only used for StorageBanana? What "top-level" OPEN sequences are known? PB stuff (like 'call', and 'your-reference')? Are there any variations or versions that need to be known? We may add new functionality in the future, it might be useful for one end to know whether this functionality is available or not. (the PB 'call' sequence could some day take numeric argument names to convey positional parameters, a 'reference' sequence could take a string to indicate globally-visible PB URLs, it could become possible to pass target.remote_foo directly to a peer and have a callable RemoteMethod object pop out the other side). What "application-level" sequences are available? (Which RemoteInterface classes are known and valid in 'call' sequences? Which RemoteCopy names are valid for targets of the 'copy' sequence?). This is not necessarily within the realm of Banana negotiation, but applications may need to negotiate this sort of thing, and any disagreements will be manifested when Banana starts raising Violations, so it may be useful to include it in the Banana-level negotiation. On the other hand, negotiation is only useful if one side is prepared to accomodate a peer which cannot do some of the things it would prefer to use, or if it wants to know about the incapabilities so it can report a useful failure rather than have an obscure protocol-level error message pop up an hour later. So negotiation isn't the only goal: simple capability awareness is a useful lesser goal. It kind of makes sense for the first object of a stream to be a negotiation blob. We could make a new 'version' opentype, and declare that the contents will be something simple and forever-after-parseable (like a dict, with heavy constraints on the keys and values, all strings emitted in full). DONE, at least the framework is in place. Uses HTTP-style header-block exchange instead of banana sequences, with client-sends-first and server-decides. This correctly handles PB-vs-HTTP, but requires a timeout to detect oldpb clients vs newpb servers. No actual feature negotiation is performed yet, because we still only have the one version of the code. * connection initiation ** define PB URLs [newcred is the most important part of this, the URL stuff can wait] A URL defines an endpoint: a pb.Referenceable, with methods. Somewhere along the way it defines a transport (tcp+host+port, or unix+path) and an object reference (pathname). It might also define a RemoteInterface, or that might be put off until we actually invoke a method. URL = f("pb:", host, port, pathname) d = pb.callRemoteURL(URL, ifacename, methodname, args) probably give an actual RemoteInterface instead of just its name a pb.RemoteReference claims to provide access to zero-or-more RemoteInterfaces. You may choose which one you want to use when invoking callRemote. TODO: decide upon a syntax for URLs that refer to non-TCP transports pb+foo://stuff, pby://stuff (for yURL-style self-authenticating names) TODO: write the URL parser, implementing pb.getRemoteURL and pb.callRemoteURL DONE: use a Tub/PBService instead TODO: decide upon a calling convention for callRemote when specifying which RemoteInterface is being used. DONE, PB-URL is the way to go. ** more URLs relative URLs (those without a host part) refer to objects on the same Broker. Absolute URLs (those with a host part) refer to objects on other Brokers. SKIP, interesting but not really useful ** build/port pb.login: newcred for newpb Leave cred work for Glyph. has some enhanced PB cred stuff (challenge/response, pb.Copyable credentials, etc). URL = pb.parseURL("pb://lothar.com:8789/users/warner/services/petmail", IAuthorization) URL = doFullLogin(URL, "warner", "x8yzzy") URL.callRemote(methodname, args) NOTDONE * constrain ReferenceUnslicer properly The schema can use a ReferenceConstraint to indicate that the object must be a RemoteReference, and can also require that the remote object be capable of handling a particular Interface. This needs to be implemented. slicer.ReferenceUnslicer must somehow actually ask the constraint about the incoming tokens. An outstanding question is "what counts". The general idea is that RemoteReferences come over the wire as a connection-scoped ID number and an optional list of Interface names (strings and version numbers). In this case it is the far end which asserts that its object can implement any given Interface, and the receiving end just checks to see if the schema-imposed required Interface is in the list. This becomes more interesting when applied to local objects, or if a constraint is created which asserts that its object is *something* (maybe a RemoteReference, maybe a RemoteCopy) which implements a given Interface. In this case, the incoming object could be an actual instance, but the class name must be looked up in the unjellyableRegistry (and the class located, and the __implements__ list consulted) before any of the object's tokens are accepted. * decide upon what the "Shared" constraint should mean The idea of this one was to avoid some vulnerabilities by rejecting arbitrary object graphs. Fundamentally Banana can represent most anything (just like pickle), including objects that refer to each other in exciting loops and whorls. There are two problems with this: it is hard to enforce a schema that allows cycles in the object graph (indeed it is tricky to even describe one), and the shared references could be used to temporarily violate a schema. I think these might be fixable (the sample case is where one tuple is referenced in two different places, each with a different constraint, but the tuple is incomplete until some higher-level node in the graph has become referenceable, so [maybe] the schema can't be enforced until somewhat after the object has actually finished arriving). However, Banana is aimed at two different use-cases. One is kind of a replacement for pickle, where the goal is to allow arbitrary object graphs to be serialized but have more control over the process (in particular we still have an unjellyableRegistry to prevent arbitrary constructors from being executed during deserialization). In this mode, a larger set of Unslicers are available (for modules, bound methods, etc), and schemas may still be useful but are not enforced by default. PB will use the other mode, where the set of conveyable objects is much smaller, and security is the primary goal (including putting limits on resource consumption). Schemas are enforced by default, and all constraints default to sensible size limits (strings to 1k, lists to [currently] 30 items). Because complex object graphs are not commonly transported across process boundaries, the default is to not allow any Copyable object to be referenced multiple times in the same serialization stream. The default is to reject both cycles and shared references in the object graph, allowing only strict trees, making life easier (and safer) for the remote methods which are being given this object tree. The "Shared" constraint is intended as a way to turn off this default strictness and allow the object to be referenced multiple times. The outstanding question is what this should really mean: must it be marked as such on all places where it could be referenced, what is the scope of the multiple-reference region (per- method-call, per-connection?), and finally what should be done when the limit is violated. Currently Unslicers see an Error object which they can respond to any way they please: the default containers abandon the rest of their contents and hand an Error to their parent, the MethodCallUnslicer returns an exception to the caller, etc. With shared references, the first recipient sees a valid object, while the second and later recipient sees an error. * figure out Deferred errors for immutable containers Somewhat related to the previous one. The now-classic example of an immutable container which cannot be created right away is the object created by this sequence: t = ([],) t[0].append((t,)) This serializes into (with implicit reference numbers on the left): [0] OPEN(tuple) [1] OPEN(list) [2] OPEN(tuple) [3] OPEN(reference #0) CLOSE CLOSE CLOSE In newbanana, the second TupleUnslicer cannot return a fully-formed tuple to its parent (the ListUnslicer), because that tuple cannot be created until the contents are all referenceable, and that cannot happen until the first TupleUnslicer has completed. So the second TupleUnslicer returns a Deferred instead of a tuple, and the ListUnslicer adds a callback which updates the list's item when the tuple is complete. The problem here is that of error handling. In general, if an exception is raised (perhaps a protocol error, perhaps a schema violation) while an Unslicer is active, that Unslicer is abandoned (all its remaining tokens are discarded) and the parent gets an Error object. (the parent may give up too.. the basic Unslicers all behave this way, so any exception will cause everything up to the RootUnslicer to go boom, and the RootUnslicer has the option of dropping the connection altogether). When the error is noticed, the Unslicer stack is queried to figure out what path was taken from the root of the object graph to the site that had an error. This is really useful when trying to figure out which exact object cause a SchemaViolation: rather than being told a call trace or a description of the *object* which had a problem, you get a description of the path to that object (the same series of dereferences you'd use to print the object: obj.children[12].peer.foo.bar). When references are allowed, these exceptions could occur after the original object has been received, when that Deferred fires. There are two problems: one is that the error path is now misleading, the other is that it might not have been possible to enforce a schema because the object was incomplete. The most important thing is to make sure that an exception that occurs while the Deferred is being fired is caught properly and flunks the object just as if the problem were caught synchronously. This may involve discarding an otherwise complete object graph and blaming the problem on a node much closer to the root than the one which really caused the failure. * adaptive VOCAB compression We want to let banana figure out a good set of strings to compress on its own. In Banana.sendToken, keep a list of the last N strings that had to be sent in full (i.e. they weren't in the table). If the string being sent appears more than M times in that table, before we send the token, emit an ADDVOCAB sequence, add a vocab entry for it, then send a numeric VOCAB token instead of the string. Make sure the vocab mapping is not used until the ADDVOCAB sequence has been queued. Sending it inline should take care of this, but if for some reason we need to push it on the top-level object queue, we need to make sure the vocab table is not updated until it gets serialized. Queuing a VocabUpdate object, which updates the table when it gets serialized, would take care of this. The advantage of doing it inline is that later strings in the same object graph would benefit from the mapping. The disadvantage is that the receiving Unslicers must be prepared to deal with ADDVOCAB sequences at any time (so really they have to be stripped out). This disadvantage goes away if ADDVOCAB is a token instead of a sequence. Reasonable starting values for N and M might be 30 and 3. * write oldbanana compatibility code? An oldbanana peer can be detected because the server side sends its dialect list from connectionMade, and oldbanana lists are sent with OLDLIST tokens (the explicit-length kind). * add .describe methods to all Slicers This involves setting an attribute between each yield call, to indicate what part is about to be serialized. * serialize remotely-callable methods? It might be useful be able to do something like: class Watcher(pb.Referenceable): def remote_foo(self, args): blah w = Watcher() ref.callRemote("subscribe", w.remote_foo) That would involve looking up the method and its parent object, reversing the remote_*->* transformation, then sending a sequence which contained both the object's RemoteReference and the appropriate method name. It might also be useful to generalize this: passing a lambda expression to the remote end could stash the callable in a local table and send a Callable Reference to the other side. I can smell a good general-purpose object classification framework here, but I haven't quite been able to nail it down exactly. * testing ** finish testing of LONGINT/LONGNEG test_banana.InboundByteStream.testConstrainedInt needs implementation ** thoroughly test failure-handling at all points of in/out serialization places where BananaError or Violation might be raised sending side: Slicer creation (schema pre-validation? no): no no pre-validation is done before sending the object, Broker.callFinished, RemoteReference.doCall slicer creation is done in newSlicerFor .slice (called in pushSlicer) ? .slice.next raising Violation .slice.next returning Deferrable when streaming isn't allowed .sendToken (non-primitive token, can't happen) .newSlicerFor (no ISlicer adapter) top.childAborted receiving side: long header (>64 bytes) checkToken (top.openerCheckToken) checkToken (top.checkToken) typebyte == LIST (oldbanana) bad VOCAB key too-long vocab key bad FLOAT encoding top.receiveClose top.finish top.reportViolation oldtop.finish (in from handleViolation) top.doOpen top.start plus all of these when discardCount != 0 OPENOPEN send-side uses: f = top.reportViolation(f) receive-side should use it too (instead of f.raiseException) ** test failure-handing during callRemote argument serialization ** implement/test some streaming Slicers ** test producer Banana * profiling/optimization Several areas where I suspect performance issues but am unwilling to fix them before having proof that there is a problem: ** Banana.produce This is the main loop which creates outbound tokens. It is called once at connectionMade() (after version negotiation) and thereafter is fired as the result of a Deferred whose callback is triggered by a new item being pushed on the output queue. It runs until the output queue is empty, or the production process is paused (by a consumer who is full), or streaming is enabled and one of the Slicers wants to pause. Each pass through the loop either pushes a single token into the transport, resulting in a number of short writes. We can do better than this by telling the transport to buffer the individual writes and calling a flush() method when we leave the loop. I think Itamar's new cprotocol work provides this sort of hook, but it would be nice if there were a generalized Transport interface so that Protocols could promise their transports that they will use flush() when they've stopped writing for a little while. Also, I want to be able to move produce() into C code. This means defining a CSlicer in addition to the cprotocol stuff before. The goal is to be able to slice a large tree of basic objects (lists, tuples, dicts, strings) without surfacing into Python code at all, only coming "up for air" when we hit an object type that we don't recognize as having a CSlicer available. ** Banana.handleData The receive-tokenization process wants to be moved into C code. It's definitely on the critical path, but it's ugly because it has to keep calling into python code to handle each extracted token. Maybe there is a way to have fast C code peek through the incoming buffers for token boundaries, then give a list of offsets and lengths to the python code. The b128 conversion should also happen in C. The data shouldn't be pulled out of the input buffer until we've decided to accept it (i.e. the memory-consumption guarantees that the schemas provide do not take any transport-level buffering into account, and doing cprotocol tokenization would represent memory that an attacker can make us spend without triggering a schema violation). Itamar's CLineReceiver is a good example: you tokenize a big buffer as much as you can, pass the tokens upstairs to Python code, then hand the leftover tail to the next read() call. The tokenizer always works on the concatenation of two buffers: the tail of the previous read() and the complete contents of the current one. ** Unslicer.doOpen delegation Unslicers form a stack, and each Unslicer gets to exert control over the way that its descendents are deserialized. Most don't bother, they just delegate the control methods up to the RootUnslicer. For example, doOpen() takes an opentype and may return a new Unslicer to handle the new OPEN sequence. Most of the time, each Unslicer delegates doOpen() to their parent, all the way up the stack to the RootUnslicer who actually performs the UnslicerRegistry lookup. This provides an optimization point. In general, the Unslicer knows ahead of time whether it cares to be involved in these methods or not (i.e. whether it wants to pay attention to its children/descendants or not). So instead of delegating all the time, we could just have a separate Opener stack. Unslicers that care would be pushed on the Opener stack at the same time they are pushed on the regular unslicer stack, likewise removed. The doOpen() method would only be invoked on the top-most Opener, removing a lot of method calls. (I think the math is something like turning avg(treedepth)*avg(nodes) into avg(nodes)). There are some other methods that are delegated in this way. open() is related to doOpen(). setObject()/getObject() keep track of references to shared objects and are typically only intercepted by a second-level object which defines a "serialization scope" (like a single remote method call), as well as connection-wide references (like pb.Referenceables) tracked by the PBRootUnslicer. These would also be targets for optimization. The fundamental reason for this optimization is that most Unslicers don't care about these methods. There are far more uses of doOpen() (one per object node) then there are changes to the desired behavior of doOpen(). ** CUnslicer Like CSlicer, the unslicing process wants to be able to be implemented (for built-in objects) entirely in C. This means a CUnslicer "object" (a struct full of function pointers), a table accessible from C that maps opentypes to both CUnslicers and regular python-based Unslicers, and a CProtocol tokenization code fed by a CTransport. It should be possible for the python->C transition to occur in the reactor when it calls ctransport.doRead python->and then not come back up to Python until Banana.receivedObject(), at least for built-in types like dicts and strings.