Monday, February 09, 2009

Vmotion with RDM problem solved!

I've finally figured it out and in doing so I've discovered some other potential problems. First the solution to not being able to move a VM from ESX host to ESX host even though the LUN's are shared to both host within Navisphere. It turns out that I needed to have both of my hosts in the same storage group on the Navi side. I had them in their own groups. Once I moved all the VM over to one host the and made sure the all LUN's were visible to the second host and added the ESX host with no VM to the storage group on the other host. Then I renamed the storage group in Navi to something that represents whats going on.

The reason for this is b/c although VMotion was working with all the other LUN's RDM's are a bit different. When adding a LUN to a an ESX server they read the same LUN ID this isn't the case with RDM's. RDM's get a LUN ID of whatever is the next number within ESX. For VMotion to work the everything has to match up on both ESX hosts. In my case with RDM's the LUN ID's on the ESX host were never the same. Now with the change on the Navi side all my LUN's have the same ID within ESX.

Now I can move all my VM's back and forth instead on the just the ones that were VMDK's.

The other potential problem that I found was that four of my initiator names were pointing to the wrong server host name within Navi under connectivity status. All of my servers should have four initiators and one had eight. The initiators are the WWN names that come from the HBA's in the servers. One of my ESX host wasn't showing up properly. There was no real problem but I caught that before it became one.

Wednesday, February 04, 2009

My first day back from vaction and problems...

After my vaction was over I was well rested and stress free. I get to work Monday morning Jan 5th and that's when it starts. My exchange cluster starts to freak out for a bit. I investigate what going on and powerpath is telling that a connection to the SAn is lost. WTF!!!! Now I frantic b/c I just got back and I'm grilling everyone to see if anything went on while I was away (I stayed here BTW www.rayonhotels.com). Everything was quiet I was told and I already knew that b/c I was checking in via SSL VPN. I went in Navi and I saw errors and the email home was going off. I checked the error and it was and SFP error. Now this could be real simple and real bad. Then guess what the errors go away and I check the SFP's and they are fine nice and snug in there.

I come in Tuesday morning and it's happening all over again. This time the transmittance is more frequent dropping Outlook connections to the server all day long. I put a call into Dell right away and to my surprise they didn't have my service tag updated on file so they couldn't find my SAN in their system. WHAT are you F***ing serious? WTF is going on? So I call my account rep and ask him that the deal is. He said we'd see whats going on. At this time my rebooting my exchange cluster to reestablis connection and trying to make sense of this thing.

After 2 hours they finally got their act together and updated their system. Sign of the times over at Dell huh? Anyhow they are dispatching parts and they should come by 11am and a tech should call me shortly after. I get the parts at 11am but no call from tech. There are two boxes. Now from my knowledge an SFP is the size of a thumb. I did get a small box and I also got a much larger box. Now I'm think WTF and open it. it's a storage processor. Why the ef would I need this? And this is what I mean this could be real simple or real bad. I send emails to these people working on this case. I get replies saying I should hear from someone soon. I get a call from the Technical account manager saying the tech should be here by 1pm EST. 1 pm EST rolls by and no call or no one. I send more emails and they keep send me ETA that can't keep. It's now going on 3 pm and still no one. I get a call from my reps and we are on a conference call. Since this is saying it's an SFP I know I can change that myself and ask if this is a good idea. They say if I feel comfortable do so. Great som much for gold support right? I end up changing the DAMN thing and immediately the errors in Navi are gone and everything is back to normal.

After all that, after all of that I get a call from a tech saying he's at another site and can come over and that he just got the call. Hmmmmm..... what happened to all those ETS's? I told him don't bother and changed the SFP myself. Now what if that wasn't the problem and what if the port itself was bad. I wasn't going to try to change a storage processor myself. I mean if push came to shove they could walk me through it but WTF is up with my gold support?

Vmotion with RDM problem

So I've run into a bit of a snag with Vmotion. So to do Vmotion it's best to have a VMDK partition shared amongst both ESX host. In the perfect world this would be the case but we're not in a perfect work and some of us can't convert/migrate to a VMDK partition. What I mean by this and this is based on my situation. I can convert servers without much problems into a VMDK partition but what about the data the server serves up. Remember my 3.14 TB migration that is no a 8TB partition that took 2 months. How long would VMware converter take to convert that data and move to a VMDK partition of equal size if I had one? Or if I try to convert the data partition from the ESX host by adding storage it would format my 8TB and that would be BAD.

So with all that in mind there is another option with a BUT that allows me to add the existing partition as an RDM (raw device mapping). I make sure to share the LUN with both ESX host. I add it to my VM through the edit option and it shows up no problem. Remember that BUT? Well you can't Vmotion the RDM b/c it thinks it's not shared with the other host, even though it is.

So I have a call in with HP/Vmware to help resolve this issue. I'll post back with my findings.

More VMware, Virtual Center Server and Vmotion

After the LUN migration fiasco I finally have the space to move forward with VMware and continuing to convert more servers. It's even more important now b/c we are supposed to be moving our data center and the lest I have to transport the better. Also for high availability, flexibility and efficiency. A lot of buzz words right!

I've created a 1TB LUN for all my VMDK files. Mostly the guest OS's. This should be more then enough for what I need currently. I have plenty of rooms for SWAP space and backups. I've then shared this massive LUN to both ESX host inside of Navishpere (select the LUN and add to storage group - in the case you add the LUN to both ESX storage groups). That's the basics.

I'm going to take a step back and discuss the overall picture. The gist of the matter is to be able to have my servers up all the time or as little downtime as possible. VMware already allows you to reboot servers in about 1 1/2 minutes time so that already faster then a physical box rebooting. But in the cases where you can't afford to have any dropped connections what-so-ever Vmotion is the way to go. This is what's meant by high availability. To accomplish this you'll need two ESX servers and one Virtual Center server with the appropriate license to unlock Vmotion.

So I've went and installed Virtual Center server on a windows 2003 server and pointed it to my two ESX hosts. You'll need to have a central lic server that manages all the lic's on the ESX hosts and the virtual center server. This is no biggie. Go to your account page on Vmware site and convert your lic's to central file or something like that. I pretty much edited one of my lic's and added the rest to the last line. Then installed the lic manger of the Virtual center server and updated the ESX hosts to look there. I was good to go. Then you'll need to update your infrastructure client by pointing it to the Vitrual center server so that you can see both hosts. Create a new cluster and add you hosts. All your existing VM's will appear on both host if you have them.

The final set will be to create a Vmotion network which is actually called a VMkernel Vmotion. You mush have a few NIC's in both host that you can allowcate to network redundancy and Vmotion. You assign a NIC for the Vmotion network and give it a private IP address x.x.x.1 and on the other host do the same thing with x.x.x.2 At this point you must physically get a crossover cable and connect those NIC you just assigned. You'll have to go to the back of the host and start plugging until you find the NIC's LOL. Or you can just VLAN those NIC's and you should be set. I those the first method this time around.

Sounds confusing but it really isn't. Once you get started the info just flows to your brain from out of nowhere....LOL!

Once all that was out the way I created a test VM on the shared LUN I created above. I put it on the network and opened the console or VNC to it. I then started a continuous ping from inside the VM to our DNS server (since that always on) and proceeded to do a manual Vmotion. Drag and drop the VM from it's ESX host to the other ESX host and you get a window pop-up and a few questions. It's a done deal. You can see exactly when the VM moves buy watching the ping hic-up but you don't lose a single connection.

Amazing ain't it!

Happy New Year......yeah I'm late but here is an update

The last time I posted I started my LUN migration to our new space. I thought it would take about a week two tops to migrate 3.14TB to a standard LUN twice the size. Well it took 2 MONTHs. I started mid Nov and it end mid Jan.

For those who have experienced this before will say it depends on what my priority rate was and would probably bet my priority was on the lowest setting. NOPE! I had it on ASAP for a bit and it freaked the user base out. They all complained about slow connectivity to the file server. So I dropped it down to medium and that's were it sat for the better part of 2 months. I went on vacation and everything (had a blast by the way at www.rayonhotels.com).

So after that was done I expanded the partition in windows 2003 by using diskpart and whalah instant space.