Monday, October 26, 2009

It's been a while...

I'm still here. I just got back from London. Some restructuring is going on in that office and a few folks migrated over to a new company. Most of the IT guys went so I had to go over there and have a look. It was a nice trip first time in that office and in London. The goal is to bring that office more in line with the New York HQ on all levels. What else have I been up to since April? Well LOTS! So much that I can't even remember. Our Hong Kong office office moved to a bigger space in the same building and we had to move our MPLS line that I talked about in April. The line was finally brought up last week. It takes months to move international circuits. If you ever have to do anything like that make sure you plan around that. Don't expect the line to be moved over night from the time you request it. I've had some SAN issues after my April post. One drive showed up as dead which was fine normal. They shipped us a new drive no problem. The drive comes and I swap it and the SAN still thinks no drive is in the bay. WHAT! I call Dell/emc support they send a new drive same deal. WOW! WTF! Then they want me to pull SP collects. Done they nothing. Now they want to webex in to run a tool and dig deeper. problems found. Another disk is bad but not showing up in Navi and I had to replace that disk as well. That two bad disk at the same time. I forgot exactly what we did but the process was long and took days b/c we had to put in a new disk into the slot where the ghosted bad disk was (the disk that was bad on the back end) and wait for it to transition. That's a day gone. The next day after the transition we had to go back to the original bad disk and swap that out and wait for transition, another day down. On the third day which was actually the forth day not included was the first day the original call was placed. On the Thursday the transitioning was done and all was well again. Four days to address drive problems. But I got to tell you this, NO user complained and there was NO down time. Gotta love EMC! Internet slowness. Out users are just consuming bandwidth at an alarming rate here. Even with Websense to block sites and protocols there still isn't enough. Then again we do only have 3mb to the internet at the NY HQ. But I did get a 12mb DS3 installed at the 42nd st office we are supposed to be moving to that we have not moved to yet. So in the mean time I will be routing internet traffic for the users over that 12mb line and keeping my servers and public IP's here at the main office. This is b/c all of our public IP's are tied to our circuits here in the HQ and we can't afford the downtime to move the block of IP's to the other office. Even though Verizon says it take 5 minutes. Yeah right! Something will break and everyone will be on my a$$ b/c there is down time and something isn't working, like emails. We've also installed and rolled out Microsoft's Office Communication Server (OCS). Now all our users have IM. Microsoft has a really nice product. There plan is to make OCS and Exchange into a VoIP solution. I saw a demo and it looked promising. So of course I came back to the office and upgraded to R2 and attempted to connect the OCS server to our Cisco call manager. I was fine right up until I was supposed to active the truck line and decided not to. The last thing I want is to bring down our phone system. So I put that project on hold until someone with more experience is available. I've contacted some of our vendors and they don't even know how to do it. There are more stuff but I leave it at this for now.

Tuesday, April 14, 2009

Setting up our Hong Kong office network from NY

I'm in the process of setting up our Hong Kong (HK) office. We've installed an MPLS line there over 6 months ago. Our plan was to send someone over from NY to install everything and set them up. Then the economy tanked and personnel got laid off. As a result the HK office has been using our Juniper SSL VPN exclusively. They are able to grab files, access email, intranet and work like any remote office. But the MPLS line is just sitting there costing us per month.

I've finally been able to think about them and wanted to get that ball rolling in getting them setup to work exactly like our Shanghai office is. Meaning getting connected to the MPLS. My counterpart in London and another Admin didn't like the idea I presented of remotely setting them up. They'd rather send someone over and train them how to do things differently. My take is that we've been waiting on someone going over for over 6 months and we are paying of a line we are not using. In that time we could have had them setup remotely using local resources when needed. I've met with my boss and fleshed out a plan to get our local consultants involved. It's not a difficult thing to do at all. We've already had a server ordered and delivered sitting there as well for over 6 months that wasn't being used (reasons explained later).

I've drafted my plan and sent it to my boss and we got on a conference call with the HK folks one night EST. We all agreed to attempt to get this to happen. The HK folks wanted direct connection to our NY office and I wanted to give it to them. I've sent my plan over to the local consultants. My plan was to get them to install VMware on the server setup it up according to my instruction I sent over with IP addresses etc, connect the ESX server to a switch that connects to the MPLS router and I'll take it from there. At first they didn't know how to install VMware. So I sent them this youtube VMWare ESX 3 Training CBT - Installing ESX and a link to site How to Install VMware ESX Server. I also stated in an email to them that if they know how to install any *NIX distro they should be able to install ESX with no problem. Once they got that email with links they said yeah they know how to do it.

So now the server is installed and connected to the MPLS. I got an email from the office manager saying that it was done one morning followed by an email from the person in charge there (above office manager) and you would not believe this. The person in charge said the server was too loud and decided to turn it off. Granted the HK office is one room in a business center and the server is in the corner. So I've asked if they could leave the server on when they leave and turn it off in the morning. There night is our day and vise-versa. So they did that for me the next day. I was able to access the server via infrastructure client and SSL. I sent over our windows image, took about 30 mins. Once that was done I installed the first VM and that is a Windows DC for that office. The second VM installed was a file server. There were some pain points with the installation. I remote desktop to our Virtual center server which has all my VMware tools then I vnc into the VM's once they were up and that was a bit slow since I didn't get VMware tool installed yet. Once the tools were installed I was good to go. That took pretty much my entire day.

I come in the next day and the server is off. Yes off they couldn't even turn it back on when they left the office. Either they forgot or don't care. So that's where that project is right now. They can't stand the noise and even if I did put the finishing touches on it they would have turning it off anyway. Now imagine if we sent someone over like everyone wanted to do. That would have been thousand of $ in travel expenses and once they left the machine would be turned off O_o!

Monday, February 09, 2009

Vmotion with RDM problem solved!

I've finally figured it out and in doing so I've discovered some other potential problems. First the solution to not being able to move a VM from ESX host to ESX host even though the LUN's are shared to both host within Navisphere. It turns out that I needed to have both of my hosts in the same storage group on the Navi side. I had them in their own groups. Once I moved all the VM over to one host the and made sure the all LUN's were visible to the second host and added the ESX host with no VM to the storage group on the other host. Then I renamed the storage group in Navi to something that represents whats going on.

The reason for this is b/c although VMotion was working with all the other LUN's RDM's are a bit different. When adding a LUN to a an ESX server they read the same LUN ID this isn't the case with RDM's. RDM's get a LUN ID of whatever is the next number within ESX. For VMotion to work the everything has to match up on both ESX hosts. In my case with RDM's the LUN ID's on the ESX host were never the same. Now with the change on the Navi side all my LUN's have the same ID within ESX.

Now I can move all my VM's back and forth instead on the just the ones that were VMDK's.

The other potential problem that I found was that four of my initiator names were pointing to the wrong server host name within Navi under connectivity status. All of my servers should have four initiators and one had eight. The initiators are the WWN names that come from the HBA's in the servers. One of my ESX host wasn't showing up properly. There was no real problem but I caught that before it became one.

Wednesday, February 04, 2009

My first day back from vaction and problems...

After my vaction was over I was well rested and stress free. I get to work Monday morning Jan 5th and that's when it starts. My exchange cluster starts to freak out for a bit. I investigate what going on and powerpath is telling that a connection to the SAn is lost. WTF!!!! Now I frantic b/c I just got back and I'm grilling everyone to see if anything went on while I was away (I stayed here BTW www.rayonhotels.com). Everything was quiet I was told and I already knew that b/c I was checking in via SSL VPN. I went in Navi and I saw errors and the email home was going off. I checked the error and it was and SFP error. Now this could be real simple and real bad. Then guess what the errors go away and I check the SFP's and they are fine nice and snug in there.

I come in Tuesday morning and it's happening all over again. This time the transmittance is more frequent dropping Outlook connections to the server all day long. I put a call into Dell right away and to my surprise they didn't have my service tag updated on file so they couldn't find my SAN in their system. WHAT are you F***ing serious? WTF is going on? So I call my account rep and ask him that the deal is. He said we'd see whats going on. At this time my rebooting my exchange cluster to reestablis connection and trying to make sense of this thing.

After 2 hours they finally got their act together and updated their system. Sign of the times over at Dell huh? Anyhow they are dispatching parts and they should come by 11am and a tech should call me shortly after. I get the parts at 11am but no call from tech. There are two boxes. Now from my knowledge an SFP is the size of a thumb. I did get a small box and I also got a much larger box. Now I'm think WTF and open it. it's a storage processor. Why the ef would I need this? And this is what I mean this could be real simple or real bad. I send emails to these people working on this case. I get replies saying I should hear from someone soon. I get a call from the Technical account manager saying the tech should be here by 1pm EST. 1 pm EST rolls by and no call or no one. I send more emails and they keep send me ETA that can't keep. It's now going on 3 pm and still no one. I get a call from my reps and we are on a conference call. Since this is saying it's an SFP I know I can change that myself and ask if this is a good idea. They say if I feel comfortable do so. Great som much for gold support right? I end up changing the DAMN thing and immediately the errors in Navi are gone and everything is back to normal.

After all that, after all of that I get a call from a tech saying he's at another site and can come over and that he just got the call. Hmmmmm..... what happened to all those ETS's? I told him don't bother and changed the SFP myself. Now what if that wasn't the problem and what if the port itself was bad. I wasn't going to try to change a storage processor myself. I mean if push came to shove they could walk me through it but WTF is up with my gold support?

Vmotion with RDM problem

So I've run into a bit of a snag with Vmotion. So to do Vmotion it's best to have a VMDK partition shared amongst both ESX host. In the perfect world this would be the case but we're not in a perfect work and some of us can't convert/migrate to a VMDK partition. What I mean by this and this is based on my situation. I can convert servers without much problems into a VMDK partition but what about the data the server serves up. Remember my 3.14 TB migration that is no a 8TB partition that took 2 months. How long would VMware converter take to convert that data and move to a VMDK partition of equal size if I had one? Or if I try to convert the data partition from the ESX host by adding storage it would format my 8TB and that would be BAD.

So with all that in mind there is another option with a BUT that allows me to add the existing partition as an RDM (raw device mapping). I make sure to share the LUN with both ESX host. I add it to my VM through the edit option and it shows up no problem. Remember that BUT? Well you can't Vmotion the RDM b/c it thinks it's not shared with the other host, even though it is.

So I have a call in with HP/Vmware to help resolve this issue. I'll post back with my findings.

More VMware, Virtual Center Server and Vmotion

After the LUN migration fiasco I finally have the space to move forward with VMware and continuing to convert more servers. It's even more important now b/c we are supposed to be moving our data center and the lest I have to transport the better. Also for high availability, flexibility and efficiency. A lot of buzz words right!

I've created a 1TB LUN for all my VMDK files. Mostly the guest OS's. This should be more then enough for what I need currently. I have plenty of rooms for SWAP space and backups. I've then shared this massive LUN to both ESX host inside of Navishpere (select the LUN and add to storage group - in the case you add the LUN to both ESX storage groups). That's the basics.

I'm going to take a step back and discuss the overall picture. The gist of the matter is to be able to have my servers up all the time or as little downtime as possible. VMware already allows you to reboot servers in about 1 1/2 minutes time so that already faster then a physical box rebooting. But in the cases where you can't afford to have any dropped connections what-so-ever Vmotion is the way to go. This is what's meant by high availability. To accomplish this you'll need two ESX servers and one Virtual Center server with the appropriate license to unlock Vmotion.

So I've went and installed Virtual Center server on a windows 2003 server and pointed it to my two ESX hosts. You'll need to have a central lic server that manages all the lic's on the ESX hosts and the virtual center server. This is no biggie. Go to your account page on Vmware site and convert your lic's to central file or something like that. I pretty much edited one of my lic's and added the rest to the last line. Then installed the lic manger of the Virtual center server and updated the ESX hosts to look there. I was good to go. Then you'll need to update your infrastructure client by pointing it to the Vitrual center server so that you can see both hosts. Create a new cluster and add you hosts. All your existing VM's will appear on both host if you have them.

The final set will be to create a Vmotion network which is actually called a VMkernel Vmotion. You mush have a few NIC's in both host that you can allowcate to network redundancy and Vmotion. You assign a NIC for the Vmotion network and give it a private IP address x.x.x.1 and on the other host do the same thing with x.x.x.2 At this point you must physically get a crossover cable and connect those NIC you just assigned. You'll have to go to the back of the host and start plugging until you find the NIC's LOL. Or you can just VLAN those NIC's and you should be set. I those the first method this time around.

Sounds confusing but it really isn't. Once you get started the info just flows to your brain from out of nowhere....LOL!

Once all that was out the way I created a test VM on the shared LUN I created above. I put it on the network and opened the console or VNC to it. I then started a continuous ping from inside the VM to our DNS server (since that always on) and proceeded to do a manual Vmotion. Drag and drop the VM from it's ESX host to the other ESX host and you get a window pop-up and a few questions. It's a done deal. You can see exactly when the VM moves buy watching the ping hic-up but you don't lose a single connection.

Amazing ain't it!

Happy New Year......yeah I'm late but here is an update

The last time I posted I started my LUN migration to our new space. I thought it would take about a week two tops to migrate 3.14TB to a standard LUN twice the size. Well it took 2 MONTHs. I started mid Nov and it end mid Jan.

For those who have experienced this before will say it depends on what my priority rate was and would probably bet my priority was on the lowest setting. NOPE! I had it on ASAP for a bit and it freaked the user base out. They all complained about slow connectivity to the file server. So I dropped it down to medium and that's were it sat for the better part of 2 months. I went on vacation and everything (had a blast by the way at www.rayonhotels.com).

So after that was done I expanded the partition in windows 2003 by using diskpart and whalah instant space.