Lost events bug in kqueue?
Hi all,
I have been trying to track down some odd behavior in a multithreaded mac os x program in which the program gets stuck waiting for events on files forever. I wrote a small C program that exhibits the same problem and I think that this may be causing the problem in my original program, too.
The test program is a tiny "web server" and you can find it here: https://github.com/AndreasVoellmy/epollbug/blob/master/kqueueserver.c
The test program simply creates NUM_WORKER+1 threads where 1 thread runs a loop to accept new connections and the other NUM_WORKER "worker" threads actually serve the connections. Note that the accept thread distributes the accepted TCP connections among the different workers and the workers serve disjoint sets of connections. Each worker thread runs a loop that uses kevent64 to wait for any of its sockets to be readable and uses the EV_ONESHOT flag when adding sockets to the watched files.
I test the program using the http server benchmarking program weighttp (http://redmine.lighttpd.net/projects/weighttp/wiki). I run it like this:
./weighttp -n 100 -c 10 -t 4 -k "10.12.0.8:8080"
where the IP address and port are where my test program are runnning. -n is the number of requests to make, -c is the number of concurrent requests to make and -t is the number of threads that weighttp will use to generate the http requests. weighttp makes the requests but gets stuck with some responses never coming back. For example, on a run with the above parameters, I get:
weighttp - a lightweight and simple webserver benchmarking tool
starting benchmark...
spawning thread #1: 3 concurrent requests, 25 total requests
spawning thread #2: 3 concurrent requests, 25 total requests
spawning thread #3: 2 concurrent requests, 25 total requests
spawning thread #4: 2 concurrent requests, 25 total requests
progress: 8% done
progress: 16% done
progress: 24% done
progress: 32% done
progress: 40% done
progress: 48% done
progress: 56% done
progress: 64% done
progress: 72% done
progress: 80% done
progress: 88% done
progress: 96% done
finished in 728 sec, 373 millisec and 310 microsec, 0 req/s, 0 kbyte/s
requests: 100 total, 100 started, 100 done, 95 succeeded, 5 failed, 0 errored
status codes: 95 2xx, 0 3xx, 0 4xx, 0 5xx
traffic: 30685 bytes total, 16340 bytes http, 14345 bytes data
In other words, 5 requests failed to get responses. On the test program (i.e. server) side, after 10 seconds, the test program prints out (using ioctl) which sockets have any data ready. Sure enough, I see that 5 sockets have a request waiting (weighttp requests are 90 bytes long):
socket 6 has 90 bytes of data ready
socket 8 has 90 bytes of data ready
socket 10 has 90 bytes of data ready
socket 12 has 90 bytes of data ready
socket 14 has 90 bytes of data ready
I would really appreciate getting some advice and help on this problem. I may be misunderstanding the semantics of kqueue operations and flags, or this may be a kernel bug. By the way, this program is based on a very similar program I wrote, which uncovered a bug in Linux, which has since been fixed (http://lkml.indiana.edu/hypermail/linux/kernel/1301.0/00065.html). To Linux's credit, the bug only appeared using lots of cores, whereas on os x it appears to occur using just two cores and two threads!
Regards,
Andi
P.S. in case it helps I also took a "sample" of the kqueueserver program at the point that it was waiting on the sockets while the sockets have data ready. Here is the result:
Sampling process 17982 for 3 seconds with 1 millisecond of run time between samples
Sampling completed, processing symbols...
Analysis of sampling kqueueserver (pid 17982) every 1 millisecond
Process: kqueueserver [17982]
Path: /Users/andre/repos/epollbug/kqueueserver
Load Address: 0x10b22b000
Identifier: kqueueserver
Version: ??? (???)
Code Type: X86-64 (Native)
Parent Process: bash [17853]
Date/Time: 2013-02-06 17:22:26.909 -0500
OS Version: Mac OS X 10.8.2 (12C3012)
Report Version: 7
Call graph:
2665 Thread_568458 DispatchQueue_1: com.apple.main-thread (serial)
+ 2665 start (in libdyld.dylib) + 1 [0x7fff9099c7e1]
+ 2665 main (in kqueueserver) + 85 [0x10b22cb15]
+ 2665 acceptLoop (in kqueueserver) + 334 [0x10b22c6ee]
+ 2665 __accept (in libsystem_kernel.dylib) + 10 [0x7fff8eff9996]
2665 Thread_568459
+ 2665 thread_start (in libsystem_c.dylib) + 13 [0x7fff85c63181]
+ 2665 _pthread_start (in libsystem_c.dylib) + 327 [0x7fff85c76742]
+ 2665 workerLoop (in kqueueserver) + 95 [0x10b22c8cf]
+ 2665 kevent64 (in libsystem_kernel.dylib) + 10 [0x7fff8effad2a]
2665 Thread_568460
2665 thread_start (in libsystem_c.dylib) + 13 [0x7fff85c63181]
2665 _pthread_start (in libsystem_c.dylib) + 327 [0x7fff85c76742]
2665 workerLoop (in kqueueserver) + 95 [0x10b22c8cf]
2665 kevent64 (in libsystem_kernel.dylib) + 10 [0x7fff8effad2a]
Total number in stack (recursive counted multiple, when >=5):
Sort by top of stack, same collapsed (when >= 5):
kevent64 (in libsystem_kernel.dylib) 5330
__accept (in libsystem_kernel.dylib) 2665
Binary Images:
0x10b22b000 - 0x10b22cfff +kqueueserver (??? - ???) <10550045-081D-3F86-A765-EE0D08E3F398> /Users/andreas/repos/epollbug/kqueueserver
0x7fff6ae2b000 - 0x7fff6ae5f93f dyld (210.2 - ???) <A40597AA-5529-3337-8C09-D8A014EB1578> /usr/lib/dyld
0x7fff857d2000 - 0x7fff85800ff7 libsystem_m.dylib (??? - ???) <B434BE5C-25AB-3EBD-BAA7-5304B34E3441> /usr/lib/system/libsystem_m.dylib
0x7fff859c0000 - 0x7fff859c0fff libkeymgr.dylib (25.0.0 - compatibility 1.0.0) <CC9E3394-BE16-397F-926B-E579B60EE429> /usr/lib/system/libkeymgr.dylib
0x7fff85ac4000 - 0x7fff85b13ff7 libcorecrypto.dylib (??? - ???) <CE0C29A3-C420-339B-ADAA-52F4683233CC> /usr/lib/system/libcorecrypto.dylib
0x7fff85c62000 - 0x7fff85d2efe7 libsystem_c.dylib (825.25.0 - compatibility 1.0.0) <8CBCF9B9-EBB7-365E-A3FF-2F3850763C6B> /usr/lib/system/libsystem_c.dylib
0x7fff8614d000 - 0x7fff86158fff libsystem_notify.dylib (98.5.0 - compatibility 1.0.0) <C49275CC-835A-3207-AFBA-8C01374927B6> /usr/lib/system/libsystem_notify.dylib
0x7fff86159000 - 0x7fff86161fff liblaunch.dylib (442.26.2 - compatibility 1.0.0) <2F71CAF8-6524-329E-AC56-C506658B4C0C> /usr/lib/system/liblaunch.dylib
0x7fff87246000 - 0x7fff8724dfff libcopyfile.dylib (89.0.0 - compatibility 1.0.0) <876573D0-E907-3566-A108-577EAD1B6182> /usr/lib/system/libcopyfile.dylib
0x7fff87437000 - 0x7fff8745cff7 libc++abi.dylib (24.4.0 - compatibility 1.0.0) <E7BD9363-1D25-3551-A68A-2E2FF6ABECD7> /usr/lib/libc++abi.dylib
0x7fff89df0000 - 0x7fff89df8ff7 libsystem_dnssd.dylib (??? - ???) <F0AE5B8E-C7CC-3393-927F-B65DB04016AA> /usr/lib/system/libsystem_dnssd.dylib
0x7fff89df9000 - 0x7fff89e61ff7 libc++.1.dylib (65.1.0 - compatibility 1.0.0) <20E31B90-19B9-3C2A-A9EB-474E08F9FE05> /usr/lib/libc++.1.dylib
0x7fff8a2a7000 - 0x7fff8a3bf92f libobjc.A.dylib (228.0.0 - compatibility 1.0.0) <90D31928-F48D-3E37-874F-220A51FD9E37> /usr/lib/libobjc.A.dylib
0x7fff8a8ff000 - 0x7fff8a901fff libquarantine.dylib (52.0.0 - compatibility 1.0.0) <4BE2E642-A14F-340A-B482-5BD2AEFD9C24> /usr/lib/system/libquarantine.dylib
0x7fff8af64000 - 0x7fff8af65ff7 libremovefile.dylib (23.1.0 - compatibility 1.0.0) <DBBFAF35-AC78-3856-92F6-6E4FD9DF14A2> /usr/lib/system/libremovefile.dylib
0x7fff8bc91000 - 0x7fff8bc9ffff libcommonCrypto.dylib (50000.0.0 - compatibility 1.0.0) <DA98AE56-BC3F-3AD0-9660-8CC79E4AECE6> /usr/lib/system/libcommonCrypto.dylib
0x7fff8c0db000 - 0x7fff8c111fff libsystem_info.dylib (??? - ???) <4FFCA242-7F04-365F-87A6-D4EFB89503C1> /usr/lib/system/libsystem_info.dylib
0x7fff8c61d000 - 0x7fff8c61eff7 libSystem.B.dylib (169.3.0 - compatibility 1.0.0) <365477AB-D641-389D-B8F4-A1FAE9657EEE> /usr/lib/libSystem.B.dylib
0x7fff8c6a9000 - 0x7fff8c6abff7 libunc.dylib (25.0.0 - compatibility 1.0.0) <92805328-CD36-34FF-9436-571AB0485072> /usr/lib/system/libunc.dylib
0x7fff8c6ac000 - 0x7fff8c6b2ff7 libunwind.dylib (35.1.0 - compatibility 1.0.0) <21703D36-2DAB-3D8B-8442-EAAB23C060D3> /usr/lib/system/libunwind.dylib
0x7fff8c9de000 - 0x7fff8ca00ff7 libxpc.dylib (140.41.0 - compatibility 1.0.0) <FAC04D8B-680E-325F-8F0C-DD69859D0E01> /usr/lib/system/libxpc.dylib
0x7fff8de1f000 - 0x7fff8de25fff libmacho.dylib (829.0.0 - compatibility 1.0.0) <BF332AD9-E89F-387E-92A4-6E1AB74BD4D9> /usr/lib/system/libmacho.dylib
0x7fff8efe8000 - 0x7fff8f003ff7 libsystem_kernel.dylib (2050.20.9 - compatibility 1.0.0) <EC0A9F5B-C9F5-336B-A7DD-49A718042F39> /usr/lib/system/libsystem_kernel.dylib
0x7fff8f02e000 - 0x7fff8f033fff libcache.dylib (57.0.0 - compatibility 1.0.0) <65187C6E-3FBF-3EB8-A1AA-389445E2984D> /usr/lib/system/libcache.dylib
0x7fff8f14e000 - 0x7fff8f15cff7 libsystem_network.dylib (??? - ???) <0D99F24E-56FE-380F-B81B-4A4C630EE587> /usr/lib/system/libsystem_network.dylib
0x7fff8f1aa000 - 0x7fff8f1abfff libsystem_blocks.dylib (59.0.0 - compatibility 1.0.0) <D92DCBC3-541C-37BD-AADE-ACC75A0C59C8> /usr/lib/system/libsystem_blocks.dylib
0x7fff8f71c000 - 0x7fff8f731ff7 libdispatch.dylib (228.23.0 - compatibility 1.0.0) <D26996BF-FC57-39EB-8829-F63585561E09> /usr/lib/system/libdispatch.dylib
0x7fff8f732000 - 0x7fff8f733ff7 libsystem_sandbox.dylib (??? - ???) <3C3B03CF-C525-3CB3-8557-62E91B93AC95> /usr/lib/system/libsystem_sandbox.dylib
0x7fff8f742000 - 0x7fff8f747fff libcompiler_rt.dylib (30.0.0 - compatibility 1.0.0) <08F8731D-5961-39F1-AD00-4590321D24A9> /usr/lib/system/libcompiler_rt.dylib
0x7fff8fd39000 - 0x7fff8fd85ff7 libauto.dylib (??? - ???) <73CDC482-16E3-3FC7-9BB4-FBA2DA44DBC2> /usr/lib/libauto.dylib
0x7fff9099a000 - 0x7fff9099dff7 libdyld.dylib (210.2.3 - compatibility 1.0.0) <F59367C9-C110-382B-A695-9035A6DD387E> /usr/lib/system/libdyld.dylib
0x7fff91130000 - 0x7fff91131ff7 libdnsinfo.dylib (453.18.0 - compatibility 1.0.0) <E7595861-ECF9-336E-9901-BED2620FAA80> /usr/lib/system/libdnsinfo.dylib
Sample analysis of process 17982 written to file /dev/stdout
MacBook Pro, OS X Mountain Lion (10.8.2)