Opened 16 years ago
Closed 16 years ago
#85 closed defect (fixed)
incident gatherer needs clustering/rate-limiting/batching?
Reported by: | Brian Warner | Owned by: | |
---|---|---|---|
Priority: | major | Milestone: | 0.3.1 |
Component: | logging | Version: | 0.3.0 |
Keywords: | Cc: |
Description
"When it rains, it pours": failures tend to occur in groups. There are a couple of places in Tahoe where, when a connection is lost, several log.WEIRD -level error messages are fired at the same time, resulting in a huge swarm (100 to 150) of Incidents. This is likely to swamp the incident-gatherer, or rather the subsequent flurry of get-incident messages will cause a lot of traffic: I've seen the sending node go to 100% CPU, and I'm worried about the size of the outbound foolscap queue growing unboundedly. Each incident tends to be about 100kB uncompressed, they're more like 30kB in .bz2 form but that doesn't help on the wire.
One improvement would be to make the incident-gatherer only request one at a time.
Another would be to have the incident-qualifier, um, not declare an incident if there's already an active reporter? maybe the reporter could be told that there's an additional trigger and to simply extend the trailing-event period. This would mean an Incident could have multiple triggers, and would probably give us the cleanest behavior on the gatherer end.
Change History (2)
comment:1 Changed 16 years ago by
comment:2 Changed 16 years ago by
Resolution: | → fixed |
---|---|
Status: | new → closed |
.. and [f598b555378595fd3011492c8cdd39e4a01a8543] adds the only-fetch-one-incident-at-a-time change. With those two, I think this problem is resolved.
Ok, [4a1be0f81c8014c5f5936ea41d0b364bcefd0164] adds incident-batching, combining multiple overlapping incidents into a single one. That should help a bit.