<div dir="ltr"><div>Hi all, </div><div>I encountered 2 times of infinite-loop in red_worker,</div><div>Both happened after I shutdown a windows guest in 'spicy' , </div><div>watching "windows is shutting down... " , </div>
<div>then close the 'spicy' window.</div><div><br></div><div>Usually , qemu process should quit after a few seconds,</div><div>but there is a low chance to go infinite.</div><div><br></div><div>Here is the detail (long but funny, I think :)</div>
<div><br></div><div>===================== process status ==============================</div><div><br></div><div>top - 17:54:37 up 65 days, 8:02, 3 users, load average: 1.93, 1.69, 1.68</div><div>Tasks: 321 total, 2 running, 316 sleeping, 3 stopped, 0 zombie</div>
<div>Cpu(s): 3.6%us, 1.8%sy, 0.0%ni, 94.6%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st</div><div>Mem: 49401784k total, 37162056k used, 12239728k free, 600832k buffers</div><div>Swap: 39535612k total, 15148k used, 39520464k free, 28640352k cached</div>
<div><br></div><div> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND </div><div> 7409 root 20 0 1620m 1.1g 7660 S 100 2.4 53:03.95 qemu-system-x86 <== our qemu process (pid 7409) keeps consuming 1 cpu on host</div>
<div><br></div><div>$ ps -p 7409 -L -l</div><div>F S UID PID PPID LWP C PRI NI ADDR SZ WCHAN TTY TIME CMD</div><div>6 S 0 7409 1 7409 0 80 0 - 414790 unix_s pts/1 00:00:15 qemu-system-x86</div>
<div>3 S 0 7409 1 7431 0 80 0 - 414790 kvm_vc pts/1 00:00:22 qemu-system-x86</div><div>3 S 0 7409 1 7432 0 80 0 - 414790 kvm_vc pts/1 00:00:21 qemu-system-x86</div><div>3 R 0 7409 1 7433 97 80 0 - 414790 ? pts/1 02:07:27 qemu-system-x86 <== one thread goes mad :)</div>
<div><br></div><div>$ sudo netstat --tcp -np | grep 7409</div><div>tcp 1 0 <a href="http://172.168.12.23:10088">172.168.12.23:10088</a> <a href="http://172.168.12.168:33085">172.168.12.168:33085</a> CLOSE_WAIT 7409/qemu-system-x8 <== one of the spice channels failed to clean up </div>
<div><br></div><div>$ sudo netstat -nlp | grep 7409</div><div>tcp 0 0 <a href="http://0.0.0.0:10088">0.0.0.0:10088</a> 0.0.0.0:* LISTEN 7409/qemu-system-x8</div><div>unix 2 [ ACC ] STREAM LISTENING 6310958 7409/qemu-system-x8 /home/grizzlybears/q131/dig_into/qmon_win7qq</div>
<div><br></div><div>================ full qemu command line and version =========================</div><div><br></div><div>/home/grizzlybears/q131/qemu/x86_64-softmmu/qemu-system-x86_64 </div><div> -enable-kvm -L /home/grizzlybears/q131/qemu-kvm-build/usr/share/qemu -nodefconfig -nodefaults </div>
<div> -smp 2,sockets=2,cores=8,threads=1 -m 1024 </div><div> -drive if=virtio,aio=native,file=/home/grizzlybears/images/win7_qq.qcow2 </div><div> -spice port=10088,addr=0.0.0.0,disable-ticketing,image-compression=auto_glz,streaming-video=filter,disable-copy-paste </div>
<div> -vga qxl </div><div> -soundhw hda </div><div> -readconfig /home/grizzlybears/q131/qemu/docs/ich9-ehci-uhci.cfg </div><div> -chardev spicevmc,id=charredir0,name=usbredir -device usb-redir,chardev=charredir0,id=redir0 </div>
<div> -device usb-tablet,id=input0,bus=ehci.0 </div><div> -netdev tap,id=nic1,script=/home/grizzlybears/q131/qemu-ifup,downscript=/home/grizzlybears/q131/qemu-ifdown -device virtio-net-pci,netdev=nic1,mac=52:54:00:fb:d5:88 </div>
<div> -balloon none -no-hpet -rtc base=localtime,driftfix=none -global kvm-pit.lost_tick_policy=discard </div><div> -chardev socket,id=charmonitor,path=/home/grizzlybears/q131/dig_into/qmon_win7qq,server,nowait -mon chardev=charmonitor,id=monitor,mode=readline</div>
<div><br></div><div>BTW, my source version:</div><div>qemu 1.3.1 </div><div>spice 0.12.3</div><div><br></div><div>============== then profiling ======================</div><div><br></div><div>I use 'perf' to inspect qemu, and found something interesting.</div>
<div><br></div><div> 1 # Events: 78K cycles</div><div> 2 #</div><div> 3 # Overhead Command Shared Object Symbol</div><div> 4 # ........ ............... .................... ..............................................</div>
<div> 5 #</div><div> 6 9.73% qemu-system-x86 [kernel.kallsyms] [k] fget_light</div><div> 7 |</div><div> 8 --- fget_light</div><div> 9 |</div><div> 10 |--97.60%-- sys_ioctl</div>
<div> 11 | system_call_fastpath</div><div> 12 | __ioctl</div><div> 13 | |</div><div> 14 | --100.00%-- spice_timer_queue_cb</div>
<div> 15 | red_worker_main</div><div> 16 | start_thread</div><div> 17 |</div><div> 18 --2.40%-- system_call_fastpath</div>
<div> 19 __ioctl</div><div> 20 |</div><div> 21 --100.00%-- spice_timer_queue_cb</div><div> 22 red_worker_main</div>
<div> 23 start_thread</div><div> 24 </div><div> 25 6.47% qemu-system-x86 qemu-system-x86_64 [.] ring_add</div><div> 26 |</div><div> 27 --- ring_add</div>
<div> 28 |</div><div> 29 |--96.68%-- ring_add_before</div><div> 30 | _spice_timer_set</div><div> 31 | |</div><div> 32 | |--52.97%-- spice_timer_set</div>
<div> 33 | | red_channel_client_start_ping_timer</div><div> 34 | | red_channel_client_ping_timer</div><div> 35 | | spice_timer_queue_cb</div>
<div> 36 | | red_worker_main</div><div> 37 | | start_thread</div><div> 38 | |</div><div> 39 | --47.03%-- spice_timer_queue_cb</div>
<div>...</div><div><br></div><div>The whole profiling graph is very long and I attached it in this msg.</div><div>From profiling graph, we can see that , almost in all the pathes ,</div><div>there is 'spice_timer_queue_cb()' and 'red_worker_main()'</div>
<div><br></div><div><br></div><div>============= So , let's glimpse at red_worker_main() =============</div><div><br></div><div>SPICE_GNUC_NORETURN void *red_worker_main(void *arg)</div><div>{</div><div> RedWorker *worker = spice_malloc(sizeof(RedWorker));</div>
<div><br></div><div> do some initialize ...</div><div><br></div><div> worker->event_timeout = INF_EVENT_WAIT;</div><div> for (;;) { <== big loop</div><div> int i, num_events;</div><div> unsigned int timers_queue_timeout;</div>
<div><br></div><div> timers_queue_timeout = spice_timer_queue_get_timeout_ms();</div><div> worker->event_timeout = MIN(red_get_streams_timout(worker), worker->event_timeout);</div><div> worker->event_timeout = MIN(timers_queue_timeout, worker->event_timeout);</div>
<div> num_events = poll(worker->poll_fds, MAX_EVENT_SOURCES, worker->event_timeout);</div><div> red_handle_streams_timout(worker);</div><div><br></div><div> spice_timer_queue_cb(); <== Yes, here is where cpu is actualy consumed</div>
<div><br></div><div> if (worker->display_channel) {</div><div> /* during migration, in the dest, the display channel can be initialized</div><div> while the global lz data not since migrate data msg hasn't been received yet */</div>
<div> red_channel_apply_clients(&worker->display_channel->common.base, red_display_cc_free_glz_drawables);</div><div> }</div><div><br></div><div> worker->event_timeout = INF_EVENT_WAIT;</div>
<div> if (num_events == -1) {</div><div> if (errno != EINTR) { spice_error("poll failed, %s", strerror(errno)); }</div><div> }</div><div><br></div><div> for (i = 0; i < MAX_EVENT_SOURCES; i++) {</div>
<div> process any 'ready' io watchers ...</div><div> }</div><div><br></div><div> Clear the poll_fd for any removed watches ...</div><div><br></div><div> if (worker->running) {</div>
<div> int ring_is_empty;</div><div> red_process_cursor(worker, MAX_PIPE_SIZE, &ring_is_empty);</div><div> red_process_commands(worker, MAX_PIPE_SIZE, &ring_is_empty);</div><div> }</div>
<div> red_push(worker);</div><div> }</div><div> abort();</div><div>}</div><div><br></div><div>============= gdb failed to attach to the live process, I dont know why, so I made a coredump and gdb it ==========</div>
<div><br></div><div>(gdb) thread 2</div><div>[Switching to thread 2 (Thread 7433)]#0 0x00007fa93df59c87 in ioctl () at ../sysdeps/unix/syscall-template.S:82</div><div>82<span class="" style="white-space:pre"> </span>../sysdeps/unix/syscall-template.S: No such file or directory.</div>
<div><span class="" style="white-space:pre"> </span>in ../sysdeps/unix/syscall-template.S</div><div>(gdb) bt</div><div>#0 0x00007fa93df59c87 in ioctl () at ../sysdeps/unix/syscall-template.S:82</div><div>#1 0x00007fa9410000d3 in red_channel_client_ping_timer (opaque=0x7fa8e4070070) at red_channel.c:752</div>
<div>#2 0x00007fa940f45370 in spice_timer_queue_cb () at spice_timer_queue.c:262</div><div>#3 0x00007fa940f289e6 in red_worker_main (arg=0x7fffed47bd00) at red_worker.c:12265</div><div>#4 0x00007fa93ec1a9ca in start_thread (arg=<value optimized out>) at pthread_create.c:300</div>
<div>#5 0x00007fa93df6221d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112</div><div>#6 0x0000000000000000 in ?? ()</div><div>(gdb) frame 3</div><div>#3 0x00007fa940f289e6 in red_worker_main (arg=0x7fffed47bd00) at red_worker.c:12265</div>
<div>12265<span class="" style="white-space:pre"> </span> spice_timer_queue_cb();</div><div>(gdb) l</div><div>12260<span class="" style="white-space:pre"> </span> timers_queue_timeout = spice_timer_queue_get_timeout_ms();</div>
<div>12261<span class="" style="white-space:pre"> </span> worker->event_timeout = MIN(red_get_streams_timout(worker), worker->event_timeout);</div><div>12262<span class="" style="white-space:pre"> </span> worker->event_timeout = MIN(timers_queue_timeout, worker->event_timeout);</div>
<div>12263<span class="" style="white-space:pre"> </span> num_events = poll(worker->poll_fds, MAX_EVENT_SOURCES, worker->event_timeout);</div><div>12264<span class="" style="white-space:pre"> </span> red_handle_streams_timout(worker);</div>
<div>12265<span class="" style="white-space:pre"> </span> spice_timer_queue_cb();</div><div>12266<span class="" style="white-space:pre"> </span></div><div>12267<span class="" style="white-space:pre"> </span> if (worker->display_channel) {</div>
<div>12268<span class="" style="white-space:pre"> </span> /* during migration, in the dest, the display channel can be initialized</div><div>12269<span class="" style="white-space:pre"> </span> while the global lz data not since migrate data msg hasn't been</div>
<div>(gdb) print num_events</div><div>$1 = 0 <== no fd is ready, poll() returned for timeout</div><div><br></div><div>(gdb) print worker->poll_fds</div><div>$4 = {{fd = 20, events = 1, revents = 0}, {fd = 29, events = 5, revents = 0}, {fd = 31, events = 1, revents = 0}, {fd = -1, events = 0, revents = 0} <repeats 17 times>}</div>
<div>ll /proc/7409/fds (before I made the coredump:)</div><div>lrwx------ 1 root root 64 2013-12-12 18:08 20 -> socket:[6309964]</div><div>lrwx------ 1 root root 64 2013-12-12 18:08 29 -> socket:[6299877]</div><div>
lrwx------ 1 root root 64 2013-12-12 18:08 31 -> socket:[6261651]</div><div><br></div><div>(gdb) print worker->event_timeout</div><div>$5 = 10 <== curiously number </div><div><br></div><div>(gdb) print worker->running</div>
<div>$7 = 1</div><div><br></div><div>=========== Finally , my question is: ==================</div><div><br></div><div>1. The big loop has no 'exit', even 'thread_exit' is not found in spice source.</div>
<div>How does red_worker thread quit in normal situation , or it just runs until qemu main thread quit?</div><div><br></div><div>2. In my understanding, spice communicate with spice-gtk via tcp,</div><div>but here we dont see tcp fds, but 3 unix sockets, what does those unix sockets for?</div>
<div><br></div><div>3. Despite of that 'poll() timeout' is non-zero, the big loop still consumes 100% cpu, </div><div>does this mean there is something wrong in timer func -- red_channel_client_ping_timer() ?</div>
<div><br></div><div><br></div><div>Thanks for reading,</div><div>more thanks for your hint and idea:)</div><div><br></div><div><br></div><div><br></div></div>