SPICE connection to VM no longer seamless with 12th login
@teuf
Submitted by Christophe Fergau Fergau Assigned to Spice Bug List
Description
Moving here the content of https://bugzilla.redhat.com/show_bug.cgi?id=1104697
Markus Stockhausen 2014-06-04 10:04:06 EDT
Description of problem:
After Live migration of a Windows VM several times between two hypervisor nodes SPICE and RDP connection suddenly stops working.
Version-Release number of selected component (if applicable):
Ovirt 3.4.1 Hypervisor nodes: Fedora 20 QEMU: 1.6.2 SPICE: 0.12.5 VM: Windows XP VM drivers: guest tools 0.74
How reproducible:
100%
Steps to Reproduce:
- Create a spice based windows VM
- Install XP into that VM
- Migrate VM several times between two hypervisor hosts
Actual results:
- After 10-20 migrations the VM console cannot be opened any more. Neither through RDP nor through SPICE
Expected results:
Connection should work.
Additional info:
SPICE log (level 4) attaced
screenshot of spice connection try attached
[reply] [−] Private Comment 1 Markus Stockhausen 2014-06-04 10:07:36 EDT
Created attachment 902192 [details] qemu/spice log
[reply] [−] Private Comment 2 Markus Stockhausen 2014-06-04 10:08:23 EDT
Created attachment 902193 [details] spice console
[reply] [−] Private Comment 3 Markus Stockhausen 2014-06-04 10:24:53 EDT
As not clearly explained: I have spice-server 0.12.5 installed from the Fedora testing repositories. In 0.12.4 we had several endless loops because 64 bit time stamps were truncated to 32 bit.
[reply] [−] Private Comment 4 Markus Stockhausen 2014-06-04 10:46:26 EDT
A working VM shows the following SPICE log lines around the error:
((null):19914): SpiceWorker-Info **: red_worker.c:9635:display_channel_client_wait_for_init: creating encoder with id == 0
<----- Here the connection hangs ---->
<----- Below the lines of a noraml working VM ---->
((null):19914): Spice-Info **: reds.c:2014:reds_handle_auth_mechanism: Auth method: 1
((null):19914): Spice-Info **: reds.c:1421:reds_info_new_channel: channel 3:0, connected successfully, over Secure link
inputs_connect: inputs channel client create
[reply] [−] Private Comment 5 Markus Stockhausen 2014-06-04 15:30:01 EDT
Compiled spice-server 0.12.5 from scratch and added further logging instructions:
Ouput:
((null):10731): SpiceWorker-Info **: red_worker.c:9635:display_channel_client_wait_for_init: MST creating encoder with id == 0 ((null):10731): SpiceWorker-Info **: red_worker.c:9637:display_channel_client_wait_for_init: MST glz_encoder_created ((null):10731): SpiceWorker-Info **: red_worker.c:9668:on_new_display_channel_client: MST red_channel_client_ack_zero_messages_window ((null):10731): SpiceWorker-Info **: red_worker.c:9678:on_new_display_channel_client: MST end ((null):10731): SpiceWorker-Info **: red_worker.c:10652:handle_new_display_channel: MST end ((null):10731): SpiceWorker-Info **: red_worker.c:11449:handle_dev_display_connect: MST connected ((null):10731): SpiceWorker-Info **: red_worker.c:10119:display_channel_handle_message: MST before_handle_message ((null):10731): Spice-Info **: red_channel.c:1531:red_channel_client_handle_message: MST start type=1 <---- The last tracepoint that is called in error case The next tracepoint that is called in ok case ---> ((null):10731): Spice-Info **: reds.c:2266:reds_accept_ssl_connection: MST start ((null):10731): Spice-Info **: reds.c:2230:reds_init_client_ssl_connection: MST start
Conclusion: The spice connection to the bad VM dos not jump into function reds_accept_ssl_connection() after the display_channel_client_wait_for_init() call.
I'm willing to enhance the tracepoints further. Any ideas where to search next. I'm having some problems to understand the watcher concept and to find a good point to debug the sent/received packets.
Oved Ourfali 2014-06-05 01:06:52 EDT Whiteboard: virt [reply] [−] Private Comment 6 Omer Frenkel 2014-06-05 03:06:45 EDT
please attach vdsm.log as well
Flags: needinfo?(mst@collogia.de) [reply] [−] Private Comment 7 Markus Stockhausen 2014-06-05 03:28:24 EDT
Created attachment 902418 [details] vdsm
Flags: needinfo?(mst@collogia.de) [reply] [−] Private Comment 8 Markus Stockhausen 2014-06-05 03:30:19 EDT
VDSM log from migration target host attached. The log starts before the migration is initiated and ends after I tried to logon to SPICE console via webadmin.
[reply] [−] Private Comment 9 Markus Stockhausen 2014-06-05 04:42:14 EDT
Remark: The following behaviour can be reproduced even without doing a SPICE console connection. Just an online migration of the VM to the special spice debug enabled Fedora 20 hypervisor.
I narrowed the error down a bit in function red_worker_main(). In a normal working VM the endless loop in this function is permanently running. In the "damaged" VM this loop is suddenly halted. I placed two debugging messages before and after the call:
spice_info("before_poll"); num_events = poll(worker->poll_fds, MAX_EVENT_SOURCES, worker->event_timeout); spice_info("after_poll");
On the bad VM that call suddenly does not return any more data and stalls. the last log line shows "before_poll". In the good VM that call will always return and the log is written on and on.
[reply] [−] Private Comment 10 Markus Stockhausen 2014-06-05 10:10:13 EDT
Please ignore my last comments.
I made some further analysis and the outcome is as follows:
When doing live migrations in the above described setup the SPICE performance might deteriorate on the target host. This can lead to a stall of the console display. Several problems might occur:
-
SPICE console lags. If you draw windows over the WinXP desktop very fast the react slowly and are running "behind" the mouse pointer.
-
SPICE console is no longer responsive. You cannot select any item on the WinXP desktop. Nevertheless the console display can be opened from webadmin.
-
SPICE console stays blank with the text connecting ...
This has been verified with the updated 0.12.5 spice-server and the original FC20 spice-server 0.12.4
Videos of case 2 are attached.
[reply] [−] Private Comment 11 Markus Stockhausen 2014-06-05 10:10:49 EDT
Created attachment 902559 [details] normal performance before migration
[reply] [−] Private Comment 12 Markus Stockhausen 2014-06-05 10:11:18 EDT
Created attachment 902561 [details] bad performance after migration
Itamar Heim 2014-06-08 02:30:00 EDT Target Release: --- → 3.5.0 [reply] [−] Private Comment 13 Christophe Fergeau 2014-06-10 09:46:43 EDT
(In reply to Markus Stockhausen from comment 5)
Conclusion: The spice connection to the bad VM dos not jump into function reds_accept_ssl_connection() after the display_channel_client_wait_for_init() call.
I'm willing to enhance the tracepoints further. Any ideas where to search next. I'm having some problems to understand the watcher concept and to find a good point to debug the sent/received packets.
Have you tried getting a gdb backtrace when the server is hanging? ('debuginfo-install qemu-kvm' 'gdb --pid $qemu_pid' followed by 'thread apply all bt')
(In reply to Markus Stockhausen from comment 10)
Please ignore my last comments.
All of them? There is no actual hang, but 'only' some very slow behaviour?
This has been verified with the updated 0.12.5 spice-server and the original FC20 spice-server 0.12.4
Just to make sure I got you right, 0.12.4 was working ok (save for this 32 bit timer overflow), and this issue appeared in 0.12.5 ?
[reply] [−] Private Comment 14 Markus Stockhausen 2014-06-10 10:10:47 EDT
Some further tests showed that this problem occurs only in Windows XP VMs. To be precise we see the system performance degrading after each migration. So we can divide the error into several stages.
-
After 1-5 migration the SPICE display will get slower. The system seems to work normally. See video above.
-
Another 5-10 migrations later the system halts completely.
Exchanging SPICE 0.12.5 with 0.12.4 and vice versa does not change the situation
So to get closer to the error I will follow your advice and analyze what is going on, when the system stalls completely. Depending on that result I will open a new BZ or will continue this one.
[reply] [−] Private Comment 15 Markus Stockhausen 2014-06-10 10:28:19 EDT
Created attachment 907289 [details] gdb output
[reply] [−] Private Comment 16 Markus Stockhausen 2014-06-10 10:30:26 EDT
GDB Output attached. The current version state in that situation is:
- qemu 1.6.2
- libspice-server 0.12.4
[reply] [−] Private Comment 17 Markus Stockhausen 2014-06-10 14:30:03 EDT
BZ1107835 has been created to analyze the major problem -> VM will stall after some migrations.
Markus Stockhausen 2014-06-10 14:30:33 EDT Depends On: 1107835 Omer Frenkel 2014-06-11 01:53:52 EDT Target Release: 3.5.0 → --- Component: ovirt-engine-webadmin → spice Version: 3.4 → 20 Assignee: bugs@ovirt.org → cfergeau@redhat.com Product: oVirt → Fedora QA Contact: pstehlik@redhat.com → extras-qa@fedoraproject.org Markus Stockhausen 2014-06-13 15:11:16 EDT Summary: Spice & RDP connection broken after live migration → SPICE connection to VM no longer seamless with 12th login [reply] [−] Private Comment 18 Markus Stockhausen 2014-06-13 15:22:36 EDT
During long tests we isolated several bugs we encountered in the Ovirt environment. Therefore I changed this bug topic to better reflect a really reproducible testcase. Hope this is ok for the assignees.
Currently it boils down to a very simple and unwanted behaviour.
Server side:
Ovirt 3.4.2 Fedora 20 hyper visor nodes qemu 1.6.2 libspice-server 0.12.4
Client side:
Windows 7 virt-viewer 0.6.0
Steps to reproduce:
-
Start a Windows XP SPICE VM from Ovirt webadmin
-
Connect to the console of the VM
-
virt-viewer will start and open the console window
-
Admin can control the VM "seamless" That means the mouse is not trapped in the conolse window. Simply dragging the mouse over the console window you can interact with the VM.
-
Logout from console by closing virt-viwer
-
Repeat steps 2-5 exactly 11 times
-
Logon to the VM for the 12th time.
Result:
When admin wants to interact with the console the mouse gets trapped in the console window. Admin must press SHIFT+F12 to get out of the window.
Expected result:
Seamless integration. No trapped mouse.
Markus Stockhausen 2014-06-13 15:45:53 EDT No longer depends On: 1107835 [reply] [−] Private Comment 19 Marc-Andre Lureau 2014-06-13 17:21:33 EDT
When the mouse is trapped, can you verify if the vdagent is still running? most likely, not, could you get the vdagent & vdservice logs when the bug occurs? thanks
Flags: needinfo?(mst@collogia.de) [reply] [−] Private Comment 20 Markus Stockhausen 2014-06-14 10:35:52 EDT
Created attachment 908784 [details] vdservice
Flags: needinfo?(mst@collogia.de) [reply] [−] Private Comment 21 Markus Stockhausen 2014-06-14 10:36:22 EDT
Created attachment 908785 [details] vdagent
[reply] [−] Private Comment 22 Markus Stockhausen 2014-06-14 10:37:36 EDT
Logs attached - please don't tell me vdagent/vdservice are stopped if a user logs into a server too often :(
[reply] [−] Private Comment 23 Markus Stockhausen 2014-06-14 10:49:16 EDT
... if (!normal_restart && ++_agent_restarts > VD_AGENT_MAX_RESTARTS) { vd_printf("Agent restarted too many times"); ret = false; stop(); } if (ret && kill_agent() && launch_agent()) { if (time - _last_agent_restart_time > VD_AGENT_RESTART_COUNT_RESET_INTERVAL) { _agent_restarts = 0; } ...
That math looks too anticipatory. In that setup vdservice will kill the agent if a user logs into the console with an interval of < 60 seconds. Is there no better way to handle that?
[reply] [−] Private Comment 24 Markus Stockhausen 2014-06-14 10:55:01 EDT
Windows XP ist not so graceful with agent restarts as newer versions:
case WAIT_OBJECT_0 + VD_STATIC_EVENTS_COUNT: vd_printf("Agent killed"); if (_system_version == SYS_VER_WIN_XP_CLASS) { restart_agent(false); } else if (_system_version == SYS_VER_WIN_7_CLASS) { kill_agent(); // Assume agent was killed due to console disconnect, and wait for agent // normal restart due to console connect. If the agent is not alive yet, // it was killed manually (or crashed), so let's restart it. if (WaitForSingleObject(_control_event, VD_AGENT_RESTART_INTERVAL) == WAIT_OBJECT_0) { handle_control_event(); } if (_running && !_agent_alive) { restart_agent(false); }
Any reason for that?
[reply] [−] Private Comment 25 Christophe Fergeau 2014-06-16 04:50:34 EDT
The bit of code you quote was added as part of https://bugzilla.redhat.com/show_bug.cgi?id=845222 Thanks for the very detailed investigation! I'm not sure if we have a spice-vdagent-win upstream bugzilla component though :((
[reply] [−] Private Comment 26 Uri Lublin 2014-06-16 05:32:41 EDT
(In reply to Markus Stockhausen from comment 23)
The logic of this code is: When spice-vdagent exits, spice-vdservice runs a new instance of spice-vdagent. If something prevents the spice-vdagent from starting, for example a problem with the virtio-serial device/driver, spice-vdservice tries VD_AGENT_MAX_RESTARTS consecutive times and stops trying. The code assumes that if there is no problem, at least one instance out of the VD_AGENT_MAX_RESTARTS vdagents will live for more than VD_AGENT_RESTART_COUNT_RESET_INTERVAL.
The problem is partially due to making the windows spice vdagent exit when the spice client disconnects.
[reply] [−] Private Comment 27 Markus Stockhausen 2014-06-26 06:08:25 EDT
(In reply to Christophe Fergeau from comment 25)
The bit of code you quote was added as part of https://bugzilla.redhat.com/show_bug.cgi?id=845222 Thanks for the very detailed investigation! I'm not sure if we have a spice-vdagent-win upstream bugzilla component though :((
Is there some better place to ask for a fix (outside RH bugzilla)?
Flags: needinfo?(cfergeau@redhat.com) [reply] [−] Private Comment 28 Cole Robinson 2015-03-31 15:41:02 EDT
Markus, there is an upstream spice w32 agent bug tracker, but no guarantees a fix will be generated. at least the bug has a better chance of surviving since this one will be autoclosed when F20 is end-of-life
https://bugs.freedesktop.org/buglist.cgi?component=win32%20agent&product=Spice