Opened 16 years ago

Closed 16 years ago

#85 closed defect (fixed)

incident gatherer needs clustering/rate-limiting/batching?

Reported by: Brian Warner Owned by:
Priority: major Milestone: 0.3.1
Component: logging Version: 0.3.0
Keywords: Cc:

Description

"When it rains, it pours": failures tend to occur in groups. There are a couple of places in Tahoe where, when a connection is lost, several log.WEIRD -level error messages are fired at the same time, resulting in a huge swarm (100 to 150) of Incidents. This is likely to swamp the incident-gatherer, or rather the subsequent flurry of get-incident messages will cause a lot of traffic: I've seen the sending node go to 100% CPU, and I'm worried about the size of the outbound foolscap queue growing unboundedly. Each incident tends to be about 100kB uncompressed, they're more like 30kB in .bz2 form but that doesn't help on the wire.

One improvement would be to make the incident-gatherer only request one at a time.

Another would be to have the incident-qualifier, um, not declare an incident if there's already an active reporter? maybe the reporter could be told that there's an additional trigger and to simply extend the trailing-event period. This would mean an Incident could have multiple triggers, and would probably give us the cleanest behavior on the gatherer end.

Change History (2)

comment:1 Changed 16 years ago by Brian Warner

Ok, [4a1be0f81c8014c5f5936ea41d0b364bcefd0164] adds incident-batching, combining multiple overlapping incidents into a single one. That should help a bit.

comment:2 Changed 16 years ago by Brian Warner

Resolution: fixed
Status: newclosed

.. and [f598b555378595fd3011492c8cdd39e4a01a8543] adds the only-fetch-one-incident-at-a-time change. With those two, I think this problem is resolved.

Note: See TracTickets for help on using tickets.