ESX VMWare “File not Found” starting machine

Experience is something you never have when you need it. Restarting our websites’ SQL server resulted in down time when the server would not start up. It reached 95% of the start up bar on the ESX task bar in VM Infrastructure Client but terminated the start up with a dialog box “File not found” and an OK button. I had not used anything in VMware Infrastructure client other than starting stopping machines and taking a snapshot, anything else was the VM guy’s role. However we had a server down and customers getting a bad experience so some clicking around and a few Google searches got it sorted out.

Seeking a solution

Using Goole I established there was a more comprehensive log files in the same directory as the VM machine. Reading the contents of these logs, I found that the failure was during loading one of the virtual disks, as it could not find the disk file (virtual disks are just files in the ESX server file system).

The log file showed that the file that could not be found had a different path to the others of that virtual machine, as the others were loading successfully.This was the clue as to what was wrong.

 
Jun 10 10:11:10.988: vmx| DISKLIB-VMFS : "/vmfs/volumes/4908a5f1-67541468-21fa-0016357ea69b/websqlserver01VM/websqlserver01VM-000009-delta.vmdk" : open successful (23) size = 32225215488, hd = 0. Type 8
Jun 10 10:11:10.990: vmx| DISKLIB-VMFS : "/vmfs/volumes/4908a5f1-67541468-21fa-0016357ea69b/websqlserver01VM/websqlserver01VM-000006-delta.vmdk" : open successful (23) size = 32225215488, hd = 0. Type 8
Jun 10 10:11:10.992: vmx| DISKLIB-VMFS : "/vmfs/volumes/4908a5f1-67541468-21fa-0016357ea69b/websqlserver01VM/websqlserver01VM-000003-delta.vmdk" : open successful (23) size = 32225215488, hd = 0. Type 8
Jun 10 10:11:10.994: vmx| DISKLIB-LINK  : "/vmfs/volumes/4901e93d-93a8aeed-12b7-0016357ea69b/websqlserver01VM/websqlserver01VM.vmdk" : failed to open (The system cannot find the file specified).  
J

The storage section of the infrastructure client showed the location for each data store, it showed that all the files were in one data store, however the file with issues was in another data store. Opening up that data store in the infrastructure client the folder for the file causing the error had be renamed to websqlserver01VM_old. This was different to what was listed in the log file.

Solution

Renaming the file back to original directory name then allowed the machine to boot up. Sometime while the machine had been running this file’s folder must have been renamed, only coming to light on reboot when it was needed again for start-up.

At least by having to jump in at the deep end I have a much better understanding of how the ESX server runs, and I got to know what is in the Infrastructure Client well by the end of the issue having scoured it for clues to my issues.

Quick Clues for places to look

Find out where the machine is located, this is listed under the resources in the summary of the virtual machine when the machine is selected in the tree view on in VMware Infrastructure Client.

Virtual Machine's data store locations

You can double click on the data store to open it up and see the files in that data store. Navigate to the machine sub folder of interest, there should be .log files in there, get the latest one, right click download to put it on your local machine for examination in note pad.datastoreContents

While you are here, for interest, have a look at the .vmx, .vmxf, .vmsd files, check the paths in there too for clues.

You can find and confirm the data store paths on the actual ESX server by clicking the ESX server of interest in the Hosts and Clusters tree on the left hand side, then select configuration and click on the data store of interest. Hover over the Location and the path will show as a mouse over tool tip.
Datastorelocation2 

On our server the virtual hard disks are broken up into 2GB files and if you have snapshots this may result in a lot more files than you see on this example. Each file has a dash and a number showing which disk and number it is part of.

I hope this helps someone else out who may be loosing orders by the hour, you really need your ESX administrator to help you if you can.

Delete All Snapshots VMWare infrastructure Client 95%

If you have decided to delete all the snapshots on a Virtual Machine, beware. This can take HOURS.

If you take snapshots, unless you have reason to, don’t keep them for too long or they will grow and grow on your ESX server taking even longer to delete when you get around to deleting them.

Delete All Snap Shots

In snap shot manager in VMWare Infrastructure Client there is a facility to delete all Snap Shots. This merges all the changes through all the snapshots back into the original disk and finally commits those changes. On a server I recently had to deal with we had a busy SQL server with 6 months worth of snapshot activity.

DatastoreMerge

As you can see it was a mess with many GB of data to merge back in from a number of snapshots.

Misinformation

Once set away on the snap shot delete the task line in the VM Infrastructure Client sat at 95%, then claimed that the task had timed out! Yet the machine would not shutdown or anything as a pending task was waiting to complete. It turns out this time out is actually the infrastructure client timing out monitoring the task, not the task itself timing out.

After logout of Virtual Centre and into the actual ESX server of interest better information was shown, the task that could no longer be seen in VC was visible, sitting at 95%.

Snapshot Delete at 95%

There it remained at 95%, looking as if the task had crashed. I got faith from the activity monitor for the ESX showing extremely high disk and processor use. Then I checked Google and found others stating that whatever you do don’t try to stop it running and it may take some time. If you force the server to stop running the task you can destroy your VM image. It took over 3 hours for our server to process this delete. During which time the Virtual Machine was unavailable, unanticipated down time as merging snapshots had happened very quickly for me in the past, but then they had always been small.

It seems the server does a merge and commit, just wait, wait, wait and it will finish, but you will not see the progress move past 95% and the cancel option on the right click of the task will be greyed out…

You can see the vmdk files disappear one at a time as they are merged back into the base drive if you keep refreshing the data store they reside in. This may give a feeling for how quickly your server is getting through them. Eventually the disk activity dropped to zero and the virtual machine could be used again.

Lesson learned

Keep your snapshots under control, delete them if you no longer need them and don’t expect large snapshots to merge and commit quickly.

datastoreContents
Results after the snapshot delete.