Opened 8 years ago
Closed 8 years ago
#258 closed defect (fixed)
allocate_tcp_port sometimes returns in-use port
Reported by: | Brian Warner | Owned by: | |
---|---|---|---|
Priority: | major | Milestone: | 0.12.3 |
Component: | network | Version: | 0.12.2 |
Keywords: | Cc: |
Description
I've noticed intermittent unit test failures (in foolscap and in tahoe, which uses the same code) in which the port number returned by allocate_tcp_port()
causes an EADDRINUSE when actually used to listen.
I did some quick experiments, but I still don't entirely understand what's going on. I know one issue is that allocate_tcp_port()
binds the port to 127.0.0.1, and it seems like some operating systems will give us a port that someone else is already listening on (but bound to 0.0.0.0). It feels a bit like a kernel bug: it's comparing the interface identifiers and concluding that they don't overlap, when in fact 0.0.0.0 overlaps with everything.
But I'm pretty sure that's not the only failure mode. I think I tried changing our function to bind to 0.0.0.0, and there were other ports that got returned incorrectly. Also I think it behaves differently on linux and OS-X.
It's a rare failure, as there are only a handful of ports in use by other processes, and the kernel allocation uses a range of at least 10000 port numbers. But it's a drag to re-run the tests each time it happens.
Change History (5)
comment:1 Changed 8 years ago by
Resolution: | → fixed |
---|---|
Status: | new → closed |
comment:2 Changed 8 years ago by
Incidentally Tahoe#2795 is about copying this fix into the Tahoe tree.
comment:3 Changed 8 years ago by
Resolution: | fixed |
---|---|
Status: | closed → reopened |
Version: | 0.9.1 → 0.12.2 |
I'm seeing this happen again. I think it's because some other process is listening on a socket that is bound to 127.0.0.1, and the kernel is willing to give us that same number. (in this case, it's a Tahoe node's controlport/logport causing the conflict).
I think the fix might be to make it a three-phase test:
- create socket, bind to 0.0.0.0 port 0, get port number, close
- create socket, bind to 0.0.0.0 port NUMBER, attempt to listen, close
- create socket, bind to 127.0.0.1 port NUMBER, attempt to listen, close
comment:4 Changed 8 years ago by
Milestone: | 0.12.0 → 0.12.3 |
---|
comment:5 Changed 8 years ago by
Resolution: | → fixed |
---|---|
Status: | reopened → closed |
re-closed in [a1cde254e3c8f22334bb6de595b66462ae0ceb96], which does the three-phase test described above. Some manual testing suggests that this should work.
Fixed, in [6a1de25d].
There turned out to be multiple issues, some of them platform-specific.
listen()
on the socket, both to follow the same philosophy, and to avoid causing security alerts for monitoring tools like LittleSnitch? or seLinux.If we only had to worry about linux, we could just bind to 0.0.0.0. To handle OS-X too, we need a two-phase test:
One final note, on OS-X this process seems to give us sequential ports in the range 49152-65535. On Linux we get random ports from 32768-49152 (ish), because when SO_REUSEADDR is in use, the kernel tries to assign a port from the lower half of the
/proc/sys/net/ipv4/ip_local_port_range
(which defaults to 32768-60999), and only uses the upper half if the lower half is full. Non-SO_REUSEADDR use the whole range.