Opened 8 years ago

Last modified 8 years ago

#257 new enhancement

use CBOR instead of Banana for serialization?

Reported by: Brian Warner Owned by:
Priority: major Milestone: undecided
Component: banana Version: 0.9.1
Keywords: Cc:

Description

I've been reading up on CBOR, which is like JSON except dense (binary) and supports bytestrings properly. There are some accelerated C codecs for it, as well as pure-python ones. It can handle cyclic object graphs. And it has a "tag" mechanism to mark objects for special processing, which we could use for our my-reference/your-reference/call sequences.

http://cbor.io/

I'm thinking it might be faster than our banana implementation (which has a lot of unused extensibility, and schema-enforcement hooks that we've already given up on), and might make it a bit easier for someone to write a compatible non-python implementation.

Of course we'd need the Negotiation phase to decide whether a given connection will use banana or CBOR, to deploy it gradually.

Change History (1)

comment:1 Changed 8 years ago by Brian Warner

I skimmed through the spec (RFC 7049), and it looks like CBOR is *very* similar to Banana: type marker, numeric header, optional body. The main differences:

Banana CBOR
order header-TYPE-body TYPE-header-body
header <=64 bytes with high-bit clear (base128) 0/1/2/4/8 bytes following type byte
type one byte with high-bit set first byte, 3-bit major, 5-bit minor
definite-length lists/maps none header indicates length
indefinite-length lists/maps (OPEN "list" items.. CLOSE) sequences (INDEFINITE marker (minor=31) items.. STOP-token)
max "small" int 2**31-1 2**64
bignum really big really big
floats 64-bit doubles only 16/32/64-bit
special values True/False/None (as sequences) True/False/null/undef (as tokens)
higher-level objects (OPEN type items.. CLOSE) sequences (TAG item) pairs

Both handle bytes and UTF-8-encoded unicode objects (as separate types).

CBOR has optional "tags" that can be put in front of an item to signal the decoder to handle it specially. Decoders can either deliver a "tagged item" marker (and let the application decide how to interpret the body) or can do additional decoding itself.

There's an IANA registry for tag values: the assigned ones include datetime, epoch time, bignums, decimal fractions, "bigfloat", a regular expression, and various markers to enable JSON/CBOR roundtrips where bytestrings are stored as base64/etc in the JSON form. For Foolscap, I think we'd turn off automatic tag handling (the regexp option scares me), and use some foolscap-specific tag values to encode our "sequences" (for "call" / "my-reference" / "your-reference" / "their-reference", but not "list"/"dict" which can be handled natively).

The format is designed to enable streaming (via "indefinite-length" strings, arrays, and maps), and the codecs appear to support encoding a generator to a file, or a decoder that pulls from a file and yields a generator. I don't yet know how decoding would work from a twisted Protocol, where you typically accumulate inbound data chunks until you have a complete message (but we need the CBOR parser to tell us how many complete messages are in our buffer, and how many bytes are left over for next time).

I'd really prefer to use someone else's parser, but that might either require emitting an extra length prefix (redundant) or doing trial-decoding after each received chunk (wasteful).

Note: See TracTickets for help on using tickets.