This is Part1 of iSCSI All Paths Down (APD) and Permanent Device Loss (PDL) situations which can cause quite a big havoc in a VMware virtualization environment. In this first part I will focus on how to diagnose the situation, what to look for in LOG files and how to verify what has actually happened.
NOTE!: This is not a definitive guide, and should only be considered as such. This is only to help solve diagnose problem. Your case will most likely vary and error messages might be different. The following article is based on my experience with APD and PDL cases with DELL Equalogic PS6xxx and VMware ESXi 5.1.
1. What happens to the ESXi host when a iSCSI LUN is inaccessible
There are two types of scenarios of All Paths Down (APD) and Permanent Device Loss (PDL), the expected one and the non expected one. As one can easily draw a conclusion from this is that the unexpected one is the one I will focus on in this article. If a LUN is taken offline as a result of planned execution, there are rarely any problems.
When a ESXi host loses it’s connection to a LUN unexpectedly, there are two possible outcomes. Either the host resumes normal operation and the specified LUN is marked in gray and has within parenthesis “()” the word inaccessible. This is the better outcome and you can still resume normal work on the ESXi host using either the web based vSphere client or the desktop installation vSphere Client. The other possible outcome is that the ESXi host goes in a non responsive state and the management connection to the host is lost. Obviously this is much worse and the first normal reaction is to panic a little bit, I know I did when this first happened.
The reason why a ESXi host would go to a inaccessible state, or non-responsive, is that the host is trying to reconnect to the LUN in question with all possible resources and the management connections are considered as secondary services. Now, to start debugging it IS required that you can make a SSH connection to the server, if you can’t, there isn’t much you can do at this point. Personally I do keep the SSH service on on all hosts, just to have one more option to connect to the hosts in a worst case scenario. ESXi hosts shouldn’t be possible to connect to from the outside so this isn’t a problem. I’d also like to note that my experience so far has been that the local connection (either KVM/iDRAC/Local) is also really slow and can’t be used for anything usefull. You should try to restart management services, but that doesn’t help if the LUN in question is still inaccessible.
Now what makes these two outcomes different is the Virtual Machines (VM’s) themselves. If a host has VM’s running on it, and they use the LUN in question that was lost, the host will go into a non-responsive state. If the host doesn’t have any VM’s running on it, they you can just reconnect to the host from the vCenter and everything is fine, except that the LUN will remain inaccessible. What happens is that the VM’s will lose their access to the virtual Hard Disks and thus they continue running from what they had stored in the RAM. Any actions that require the VM to access it’s HDD, will make the VM crash, and usually they crash right away. I’ve seen cases where a Linux machine still replied to ICMP calls until it’s web-page was accessed (assuming it’s a www-server), at which point it stopped replying to ICMP ping as well.
2. How to diagnose APD/PDL with the help of log files
There are a few key elements that you can look for in log files that will help you diagnose the problem at hand. As mentioned before, a SSH connection has to be made to the server. Logs can be found under /var/log/ on the ESXi host, so navigate here first. In this directory, you can find some key logs that are in a iSCSI environment vital. These logs that we want to look at are the following:
- hostd.log (ESX Service Log)
- vmkernel.log (VMKernel Messages)
- vmkwarning.log (VMKernel Warnings)
Each of the log files mentioend above, contain some clues to what has happened, and verify a lost LUN it’s a good idea to check the all to get a good certainty on what LUN is the root cause of our problems.
2.1 hostd.log
Hostd.log file there is mainly one entry that we are interested in, and it contains some helpful information which lets us know in clear text which LUN is the root cause of the problem.
timeofevent [77180B90 info 'VmkVprobSource'] VmkVprobSource::Post event: (vim.event.EventEx) { --> dynamicType = , --> key = 1990376604, --> chainId = 0, --> createdTime = "1970-01-01T00:00:00Z", --> userName = "", --> datacenter = (vim.event.DatacenterEventArgument) null, --> computeResource = (vim.event.ComputeResourceEventArgument) null, --> host = (vim.event.HostEventArgument) { --> dynamicType = , --> name = "esxi-host-name", --> host = 'vim.HostSystem:ha-host', --> }, --> vm = (vim.event.VmEventArgument) null, --> ds = (vim.event.DatastoreEventArgument) { --> dynamicType = , --> name = "datastore-name", --> datastore = 'vim.Datastore:12312323-1234aaa9-a1a1-ac333a3a3ac3', --> }, --> net = (vim.event.NetworkEventArgument) null, --> dvs = (vim.event.DvsEventArgument) null, --> fullFormattedMessage = , --> changeTag = , --> eventTypeId = "esx.problem.vmfs.heartbeat.timedout", --> severity = , --> message = , --> arguments = (vmodl.KeyAnyValue) [ --> (vmodl.KeyAnyValue) { --> dynamicType = , --> key = "1", --> value = "12312323-1234aaa9-a1a1-ac333a3a3ac3", --> }, --> (vmodl.KeyAnyValue) { --> dynamicType = , --> key = "2", --> value = (vim.event.DatastoreEventArgument) { --> dynamicType = , --> name = "datastore-name", --> datastore = 'vim.Datastore:12312323-1234aaa9-a1a1-ac333a3a3ac3', --> }, --> } --> ], --> objectId = "12312323-1234aaa9-a1a1-ac333a3a3ac3", --> objectType = "vim.Datastore", --> objectName = "datastore-name", --> fault = (vmodl.MethodFault) null, --> }
Note here that the ESXi host is mentioned and what datastore is lost is clear text as datastore-name.
2.2 VMKernel.log
In the vmkernel.log file you can find a lot more information compared to the hostd.log file. In the vmkernel.log file you can find multiple entries regarding the broken iSCSI path and lost LUN. What you should be able to find in the log file are as following:
NMP: nmp_ThrottleLogForDevice:2319: Cmd 0x28 (0x41248919c240, 8235) to dev "naa.6090a000d0ea5550006666c333fbb9bb" on path "vmhba40:C1:T7:L0" Failed: H:0x5 D:0x0 P:0x0 Possible sense data: 0x2 0x3a 0x0. Act:EVAL
WARNING: iscsi_vmk: iscsivmk_TaskMgmtIssue: vmhba40:CH:1 T:7 L:0 : Task mgmt "Abort Task" with itt=0x2b5bd (refITT=0x2b5bc) timed out.
WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237:NMP device "naa.6090a000d0ea5550006666c333fbb9bb" state in doubt; requested fast path state update...
ScsiDeviceIO: 2316: Cmd(0x41248919c240) 0x28, CmdSN 0xc42383 from world 8235 to dev "naa.6090a000d0ea5550006666c333fbb9bb" failed H:0x5 D:0x0 P:0x0 Possible sense data: 0x2 0x3a 0x0.
These are the three types of error messages I’ve found which clearly have to do with the same LUN. the naa code matches the same LUN which can be found in the hostd.log file. As you can see from here, the LUN in question doesn’t respond to the ESXi host and the host in question is trying to reconnect to the SAN.
Now you could try to do a rescann of datastores by running ( esxcfg-rescan vmhbaNN ), where NN is the iSCSI adapter. But if the LUN is still in use, as will be mentioned later on, then this won’t give a desired result. I do recommend to try and rescan the datastores, as if your problem is not the same as described here, then this could solve your issue. However, if you do a rescan of the datastores and they are still inaccessible then you will find the following lines in the log file:
Partition: 414: Failed read for "naa.6090a000d0ea5550006666c333fbb9bb": I/O error Partition: 1020: Failed to read protective mbr on "naa.6090a000d0ea5550006666c333fbb9bb" : I/O error WARNING: Partition: 1129: Partition table read from device naa.6090a000d0ea5550006666c333fbb9bbfailed: I/O error
2.3 vmkwarning.log
In the vmkwarning.log file you will be able to find similar entries to what can be found in the vmkernel.log file. As vmkwarning.log file only contains the warnings, it might be faster to find the following line than from the vmkernel.log file. Personally I don’t check this log file anymore as the same messages can be foud in the vmkernel.log.
WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237:NMP device "naa.6090a098d0ea5330680825c322fbb9db" state in doubt; requested fast path state update...
3. Further information available
In addition to the log files what are available from the ESXi host, you can do a few commands on the host to do some additional verification and debugging of the problem. There is one key command that can be done to see if a LUN is in use by VM’s, and that it’s inaccessible.
First off we should always verify that the LUN does indeed excists on this perticular host. Obvisouly if the log files contain information of this LUN, it will find it with the following command. But if your even a little but uncertain, this will help you verify:
~ # ls /vmfs/devices/disks/ | grep naa.6090a098d0ea5330680825c322fbb9db naa.6090a098d0ea5330680825c322fbb9db naa.6090a098d0ea5330680825c322fbb9db:1 ~ #
To know if a LUN is stuck, we can check the storage information of a ESXi host with the following command:
esxcli storage core device world list -d naa.6090a098d0ea5330680825c322fbb9db Device World ID Open Count World Name ------------------------------------ -------- ---------- ---------- naa.6090a098d0ea5330680825c322fbb9db 8224 1 idle0
The result shows two key elements that I’ve found in every situation. The result always shows “Open Count” as 1 and it’s idle0. This means that there are active powered on Virtual Machines that currently trying to use the LUN.
Additional information can be found on this VMware KB article 2014155.
4. Conclusion
As you can see from the log files, the LUN is inaccessible and you can also see which LUN it is. In most cases the SAN itself doesn’t break down, and as such there is no point to cold boot a whole ESXi host as a whole, as this would mean that you risk corrupting more than those VM’s that are affected. This approach hopefully helps you diagnose the the problem at hand.
In the next part, I will go trough how to recover from a APD/PDL situation which does work when using Dell EQL and ESXi 5.1. If you use another iSCSI SAN than EQL, the same method might apply to your environment.
Just wondering if the above behavior & solution (both Part 1 & 2) is valid for ESXi hosts 5.5 & 6.0 as well ?
Many thanks and a great article!!
The principle is the same although ESXi does handle datastore connection issues a lot better and hosts don’t enter a state where they use 100% CPU to find the lost datastore (as an example). Diagnosing the issue can be conducted in a similar fashion but official VMware KB articles should be used as the preferred source of information as they are always accurate and sanctioned by VMware :)