1 / 49

Troubleshooting XenServer deployments

Troubleshooting XenServer deployments. Tomasz Czajka, Sr. Support Engineer 8th of October 2010. Agenda. Case Study: “ Production down” Learn: “ XenServer crash ” Case study: “ Singlepathing ” Q & A. “ Production down”. VM don’t start - why?. Basic troubleshooting in XenCenter.

brina
Download Presentation

Troubleshooting XenServer deployments

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Troubleshooting XenServer deployments Tomasz Czajka, Sr. Support Engineer 8th of October 2010

  2. Agenda • Case Study: “Production down” • Learn: “XenServer crash” • Case study: “Singlepathing” • Q & A

  3. “Production down”

  4. VM don’t start - why? Basic troubleshooting in XenCenter Cannot start a VM“The SR is not available” error Storage Repositry (SR) in “broken” state “Repair” does not work. Use CLI to troubleshoot

  5. SR SR Broken storage # xepbd-list currently-atached=false What is “broken”? PBD PBD PBD = Physical Block Device Volume Group XenServer_1 PBD Name: <Prefix>+SR UUID” SR XenServer_2 PBD has UUID (unique ID) SCSI ID

  6. Storage troubleshooting Goal: Reproduce and analyse the logs /var/log/xensource.log* ;SMlog* ; messages* ; # tail –f /var/log/messages >/tmp/ShortLog # date # echo “Unplugging cable” >> messages messages (UTC) <> xensource.log (local)

  7. PBD unplugged Plugging PBD manually # xe pbd-list host-uuid=... sr-uuid=... # xe pbd-plug uuid=... SR_BACKEND_FAILURE_47: The SR is not available no such volume group: VG_XenStorage-19856cba-830c-e298-79fa-84a79eb658f4 # xe sr-list name-label=“My SR” params=uuid19856cba-830c-e298-79fa-84a79eb658f4 # grep “PBD.plug” xensource.log

  8. Volume Group 3 VMs 1 virtual disk each Logical Volume Manager (LVM) What is VG? Volume Group(VG) Physical Volume(PV) Virtual Disk Physical Volume(PV) Volume Group(VG) Logical Volume(LV) HDD / LUN VDI Logical Volume(LV) VDI HDD / LUN Physical Volume(PV) Logical Volume(LV) VDI HDD / LUN Storage Repository SR

  9. Volume Group Matching the UUID # vgs # vgs 'VG_XenStorage-19856cba-830c-e298-79fa-84a79eb658f4' Volume group "VG_XenStorage-19856cba-830c-e298-79fa-84a79eb658f4" not found VG #PV #LV #SN AttrVSizeVFree VG_XenStorage-090d4717-9f91-92de-83c3-5458274802e9 1 18 0 wz--n- 89.99G 19.48G VG_XenStorage-5239de43-6a74-0365-f825-b799aa6de853 1 2 0 wz--n- 129.07G 129.05G VG_XenStorage-70a029cf-7f35-c035-4af7-07eaf31e2e88 1 11 0 wz--n- 49.99G 2.84G VG_XenStorage-9be18df5-3fd2-4835-b864-d0ffbccbaeb3 1 1 0 wz--n- 1.99G 1.98G

  10. Examining HDD/LUN Checking SCSI ID check SCSI ID (unique for each SCSI device) # xepbd-list params=device-configsr-uuid=... device-configSCSIid: 360a9800050334f49633459 PBD SCSI ID

  11. Examining HDD/LUN Can Linux kernel see this block device? (SCSI device) # hdparm -t /dev/disk/by-id/scsi-360a98045234t654... Timing buffered disk reads: 138 MB in 3.02 seconds = 45.68 MB/sec (LUN readable!  )

  12. Addressing SCSI disks # ls -lR /dev/disk | grep 360a9800050334f4963345767656c546 /dev/disk/by-id scsi-360a9800050334f4963345767656c546a -> /dev/sde /dev/disk/by-scsibus 360a9800050334f4963345767656c546a-1:0:0:5 -> /dev/sdc 360a9800050334f4963345767656c546a-2:0:0:5 -> /dev/sde /dev/mapper/360a9800050334f4963345767656c546 Also check /dev/disk/by-path

  13. Examining HDD/LUN Is the LUN empty? # udevinfo -q all -n /dev/disk/by-id/scsi-360a9800050334f496334576765... ... ID_FS_TYPE=LVM2member ... “If this is LVM member, why there is no VG on it?”

  14. Examining HDD/LUN Is there a VG created on PV? # pvs # pvs |grep 360a9800050334f496334595a32306431 PV VG Fmt Attr Psize Free /dev/mapper/360a9800050334f496334595a32306431 VG_XenStorage-332432-430d-3423-4332434-5485974lvm2 a- 14.99G 14.99G # xesr-list name-label="My SR"params=uuid19856cba-830c-e298-79fa-84a79eb658f4 VG_Xenstorage<UUID> differs fromSR UUID ! PV VG FmtAttrPsize Free /dev/mapper/360a9800050334f4963 VG_XenStorage-090d4717-9f91-92de-83c3- lvm2 a- 89.99G 19.48G /dev/mapper/360a9800050334f4963 VG_XenStorage-70a029cf-7f35-c035-4af7- lvm2 a- 49.99G 2.84G /dev/mapper/360a9800050334f4963 VG_XenStorage-19856cba-830c-e298-79fa- lvm2 a- 14.99G 6.45G /dev/mapper/360a9800050334f4965 VG_XenStorage-9be18df5-3fd2-4835-b864- lvm2 a- 1.99G 1.98G /dev/sda3 VG_XenStorage-5239de43-6a74-0365-f825- lvm2 a- 129.07G 129.05G

  15. No original VG on the LUN Potential reasons: (Re)installation of host in the same pool • Unplug FC / Zoning (Re)installation of host in other pool • Zoning Adding SR with “xe sr-create” in CLI ...BE VERY CAREFUL!

  16. Volume Group ...has been recreated! Lost LVM metadata Lost 100 MB of the VDI data Action steps: don’t shutdown running VMs Online backup for running Vms (now) Block-level clone of the whole LUN (now) Assess professional data recovery

  17. Make a copy first # cp /etc/lvm/backup/* /root/backup/ Volume Group Looking for LVM metadata backup /etc/lmv/backup/VG_XenStorage-19856cba-830c-e298-79fa-84a79eb658f4 Check backup timestamp (within the file) LVs in backup file # cat /etc/lvm/backup/VG... | grep VHD VDI in xapi database # xe vdi-list sr=<uuid> params=uuid = LV VDI LV VDI LV VDI

  18. Volume Group Removing new VG and PV # vgremove"VG_XenStorage-<new SR uuid>” # pvremove/dev/mapper/<SCSI ID>

  19. Volume Group Recreating PV and VG from backup # pvcreate--uuid<PV uuid from backup file>--restorefile/etc/lvm/backup/VG_XenStorage-<SR_UUID> /dev/mapper/<SCSI ID> # vgcfgrestoreVG_XenStorage-<SR UUID>-f /etc/lvm/backup/VG_XenStorage-<SR UUID>

  20. Examining HDD/LUN Confirm that VG name contains SR uuid... # pvs |grep 360a9800050334f496334595a32306431 PV VG Fmt Attr Psize Free /dev/mapper/360a9800050334f496334595a32306431 VG_XenStorage-19856cba-830c-e298-79fa-84a79eb658f4 lvm2 a- 14.99G 14.99G # xesr-list name-label="My SR"params=uuid19856cba-830c-e298-79fa-84a79eb658f4 VG_Xenstorage<UUID> matches SR UUID

  21. Logical Volume(LV) Volume Group Checking Logical Volumes # lvs MGTVG_XenStorage-19856cba-830c-e298-79fa-84a79eb658f4 -wi--- 4.00M VHD-352d31ec-aeb6-4601-8ea9-990575dab395VG_XenStorage-19856cba-830c-e298-79fa-84a79eb658f4 -wi--- 520.00M VHD-7e5f83a7-b6c4-4fae-9899-1e6a2cdabd32VG_XenStorage-19856cba-830c-e298-79fa-84a79eb658f4 -wi--- 4.02G VHD-fbce18dd-397e-444e-9470-b6fa240243d9VG_XenStorage-19856cba-830c-e298-79fa-84a79eb658f4 -wi--- 4.02G VHD-ff744448-1b7f-4cc8-80b1-cd38b6c90c98VG_XenStorage-19856cba-830c-e298-79fa-84a79eb658f4 -wi--- 520.00M Logical Volume(LV) Logical Volume(LV)

  22. Storage Repository Plugging PBD again... # xe pbd-plug uuid=… # xe sr-scan uuid=… Error code: SR_BACKEND_FAILURE_46 Error parameters: , The VDI is not available [opterr=Error scanning VDI 7e5f83a7-b6c4-4fae-9899-1e6a2cdabd32] # xe vdi-list uuid=<above number> # lvremove /dev/VG_XenStorage-19856cba-830c-e298-79fa-84a79eb658f4/VHD-7e5f83a7-b6c4-4fae-9899-1e6a2cdabd32 # xe sr-scan uuid=… Success! But no VDIs shown... Success! All VDIs shown... Well done! 

  23. What we’ve learned ...by troubleshooting “Production Down” issue PBD to get plugged needs... LUN/HDD  PV  VG (SR)  LV (VDI) VG name generated from SR uuid (+ prefix)LV namegenerated from VDI uuid(+ prefix) Displaying VG (vgs), PV (pvs), LV (lvs) Addressing block devices (/dev/disk) Examining HDD/LUN with "hdparm –t" Restoring PV & VG from backup

  24. “The XenServer Crash”

  25. The XenServer Crash? Unresponsive orrebooting host Kernel panic or crash dump • Error on Console, host locked • Memory addressing, Bug in OS, Hardware failure No Kernel Panic and no crash dump • Host rebooting / frozen / no errors on the console • Hardware failure, OS busy (I/O), user action

  26. Symptom: Host is unresponsive Symptom: Host rebooted itself Serial console No serial console No crashdump /var/crash/<date> exists Boot the host to the console CTX120540 & reboot Connect local console Review crashdump Generate crashdump CTX120540 & reboot Any errors on the console? HA disabled HA enabled Review crashdump Take photos and reboot Analyse /var/log/messages, xensource.log Disable HA Add „noreboot” option in extlinux.conf Host fenced? Check /var/log/xha.log Analyse /var/log/messages, xensource.log Still rebooting?  examine hardware Analyse /var/log/messages, xensource.log for HA reasons Contact Citrix Tech Support

  27. Getting into details… Analyse /var/log/messages, xensource.log Startup strings: # cd /var/log # grep “klogd” messages -B100 # grep “SERVER START” xensource.log -B100 As easy as grep

  28. Review crashdump Inside crash log directory /var/crash/<stamp> Domain0.log crash.log Domain1,2,3...log Debug.log xen-memory-dump Domain0 console ring Hypervisor console ring HA activity, page fault, driver, storage issues CPU stack - to be analysed by Citrix Tech Support Citrix Confidential - Do Not Distribute

  29. Investigating crash.log Review crashdump (cont) XenConsole ring located at the bottom of the file (XEN) Watchdog timer fired for domain 0(XEN) Domain 0 shutdown: watchdog rebooting machine. Why watchdog triggered?  /var/log/xha.log (Network or Storage heartbeat failed) Why heartbeat failed?  /var/log/messages (DMP, kernel, drivers, I/O errors)

  30. Investigating crash.log Page fault Other examples: (XEN) **************************************** (XEN) Panic on CPU 6: (XEN) FATAL TRAP: vector = 14 (page fault) (XEN) [error_code=0000] , IN INTERRUPT CONTEXT (XEN) **************************************** (XEN) (XEN) Reboot in five seconds...

  31. What we’ve learned Learn: XenServer crash Host really crashed? Kernel Panic Crashdump Triggering Crashdump manually Locating host reboot in the logs Reviewing crashdump logs

  32. “Single-Pathing”

  33. Storage Performance issue DMP has been enabled to improve performance Virtual Machines are running on different iSCSI SRs LinuxGuestVM:~# hdparm -t /dev/xvdb /dev/xvdb: Timing buffered disk reads: 96 MB in 3.07 seconds = 30.41 MB/sec

  34. Storage Performance Checking multipath status # mpathutil status 360a9800050334f496334596c71665246 dm-13 NETAPP,LUN [size=2.0G][features=0][hwhandler=0][rw] \_ round-robin 0 [prio=4][enabled] \_ 3:0:0:2 sdk 8:160 [active][ready] \_ 4:0:0:2 sdj 8:144 [active][ready] /dev/mapper/.... /dev/

  35. Storage Performance Determining current performance on domain0 Testing multi-path device # hdparm /dev/mapper/<scsi id> Testing single-path devices # hdparm /dev/sdj # hdparm /dev/sdm In all cases: 30 MB/sec

  36. Storage Performance Determining usage of paths # iostat –x <device> # iostat –x/dev/sdk /dev/sdj 5 Device Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sdk 803.50 33.0 4122 160sdj 784.00 32.83922 155 Both paths are used equally

  37. Storage Performance Checking if there are really 2 iSCSI sessions # ls -alR /dev/disk/by-path/ | egrep "(sdk|sdj)" ip-10.1.200.40:3260-iscsi-iqn.1992-08.com.netapp:MyNetapp.luns-lun-2 -> ../../sdk ip-10.1.201.40:3260-iscsi-iqn.1992-08.com.netapp:MyNetapp.luns-lun-2 -> ../../sdj

  38. Storage Performance Checking if different paths are really used # tcpdump -i any port 3260 # watch "ifconfig | egrep '(eth0 |eth1 )' -A5 | egrep '(eth|bytes)' " eth0 Link encap:Ethernet HWaddr 00:1D:09:70:88:2C RX bytes:1490076463 (1.3 GiB) TX bytes:170615419 (162.7 MiB) eth1 Link encap:Ethernet HWaddr 00:1D:09:70:88:2E RX bytes:1801238 (166 MiB) TX bytes:46695876 (44.5 MiB)

  39. Storage Performance Checking source IP addresses for iSCSI sessions # netstat -at | grep iscsi 10.1.200.138:53049 10.1.200.40:iscsi-target ESTABLISHED 10.1.200.178:46684 10.1.201.40:iscsi-target ESTABLISHED

  40. Storage Performance Checking kernel routing table # route Destination Gateway Genmask Iface 10.1.200.0 * 255.255.255.0 xenbr0 10.1.200.0 * 255.255.255.0 xenbr1 default 10.1.200.1 0.0.0.0 xenbr0

  41. Storage Performance Configuration of management interfaces in XenCenter Modify ISCSI_2 into 10.1.201.78

  42. Storage Performance Determining current performance on domain0 # route Destination Gateway Genmask Iface 10.1.200.0 * 255.255.255.0 xenbr0 10.1.201.0 * 255.255.255.0 xenbr1 default 10.1.200.1 0.0.0.0 xenbr0

  43. Storage Performance Configuring kernel routing table ...or (not recommended) Add to /etc/rc.local # route add -host 10.1.200.40 xenbr0 # route add -host 10.1.201.40 xenbr1 What about Pool Upgrade and Pool Join?

  44. Storage Performance Determining current performance on VM LinuxVM:~# hdparm -t /dev/xvdb /dev/xvdb: Timing buffered disk reads: 45 MB/sec Well Done!

  45. What we’ve learned Case study: Single-pathing /dev/ locations for single and multi-path devices # mpathutil status # hdparm –t # iostat # ifconfig, # tcpdump, # netstat, # route # watch Best practices for iSCSI storages

  46. Questions

  47. Resources First aid kit http://docs.xensource.com –XenServer documentation http://support.citrix.com/product/xens/ - Knowledge Center http://forums.citrix.com/support - Support forums http://community.citrix.com/citrixready/xenserver - XenServer Central (one-stop information center)

  48. Session surveys are available online at www.citrixsynergy.com starting Thursday, 7 October Provide your feedback and pick up a complimentary gift card at the registration desk Download presentations starting Friday, 15 October, from your My Organiser Tool located in your My Synergy Microsite event account Before you leave…

More Related