Monday, October 30, 2006

More on storage

Like I said eariler storage is the single most challenging area for me. We are ordering another DAE (disk array enclosure) for or Clarion CX300 SAN. This is the last one we can add before we have to upgrade the CX500. This new DAE would be about 3.8TB of usable storage. We are also getting another Apple Xserver RAID 7TB raw unit. Our Archive has used up 5TB already. To continue with the growth we need to have more archive space.

Thursday, October 26, 2006

Centralizing storage has allowed us to decommission 5 servers

Since we've had our SAN in place for over a year now we are finally able to get rid of a few servers. Consolidating mostly file servers is the biggest benefit. No more hanging on to these old servers, no more paying for care packs, no more worrying about hardware failures due to age, no more buying racks and UPS like mad for space. I think I've even saw the UPS meter lights drop a few bubbles. LOL.

Wednesday, October 25, 2006

Event ID: 55 continued Not what I expected

The server froze up again yesterday. That would be 3 times in the past week this happend. I never ran the chkdsk /f last night. Instead one of my Admins called up HP to see if they had any say on the matter. It came down to our array controller cards firmware being outdated. So outdated that the HP array configuration util wouldn't open up. And on top of that so outdated or bugged that the array didn't show that we had 2 failed drives. Yes these failed drive were causing the event ID 55 in OS. The firmware was updated and newer diagnostic utils were loaded on the server. Running these new tools we can see the failed drives. HP is sending up 2 replacements this morning.

I must say that from all the server I've Administered old and new I have never seen this problem before. Drive failing and not blinking red inside the array. This server is not that old either ~ 3 years old. Still pretty beefy. This is why carepacks are so important. So for all you cheap companys out there that like to cut corners, budget server warranty and service contracts and stop blaming your Admins for not being able to fix things fast enough.

Tuesday, October 24, 2006

Event ID: 55

So I'm getting this error in my system event log. No big deal accept this is on a 1.7TB volume that is in production. Not sure how long it will take to run chkdsk /f on this volume. This utility forces the volume to dismount kicking anyone off. Around here these people don't like when I have to take the server offline and I'm talking about after hours maintenance. I'll update when I run this util with how long it took.

Event Type: Error
Event Source: Ntfs
Event Category: Disk
Event ID: 55
Date: 10/24/2006
Time: 9:31:15 AM
User: N/A
Computer: **********
Description:
The file system structure on the disk is corrupt and unusable. Please run the chkdsk utility on the volume New Volume.

For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.
Data:
0000: 00 00 04 00 02 00 52 00 ......R.
0008: 02 00 00 00 37 00 04 c0 ....7..À
0010: 00 00 00 00 02 01 00 c0 .......À
0018: 00 00 00 00 00 00 00 00 ........
0020: 00 00 00 00 00 00 00 00 ........
0028: c2 00 22 00 Â.".

Thursday, October 19, 2006

phone calls

I get all kinds of phone calls. These days I don't even answer my phone anymore. My wife says to me I tried calling you but you must have been busy. I said yeah I was busy screen my calls. LOL! Anyhow I've had to blow off a few people for a long time until one day I have to just deal with them. One guy called me the begining of the year, I told him call me back b/c I don't feel like getting into what ever it is he is trying to sell me. He calls back a few weeks later asking if I remember him. I say no (of course I did) so he goes on to tell me who he is what he has and how I can benifit from it. I say not interested right now and I have to go. He asks if he can try back at later time and gives a specific date. Out of anger a frustration b/c I'm busy I say yeah try back then just to get him off my phone. Sure enough this same damn guy calls back on the date LOL. I tell him call back again and I tell him I'm going on leave for a month so I won't be around. Anyway when I got back from my leave (new born baby girl :D). I think I've have a change of heart this guy calls back so I talk to him for about 40 mins and I explain to him how we do things here and how his product is just not right for us. I even volentarily give him more information about what I am looking in regauds to the issues I am currently facing. He them recommended some expensive products to me. We end the call and I was sure that was the end of him. A few weeks later this same guy calls one of my Admins singing the same damn song O_o!

Old Story***Exchange 5.5 migration gone wrong

I couple of years ago when we were in the midst's of migrating from exchange 5.5 to exchange 2003 we had planned the process for weeks. Myself and a consultant that I've used for a long time set a date for a Saturday morning. Early morning to get a head start. We had both servers ready. New and old (I mean real old).

We started the day by installing 2003 server on the new box as well as exchange 2003. We added if the server to the Organization and both servers were talking. We ran tests and check the whole way and everything was looking fine. We went back to the 5.5 server to load up the tools to do mailbox move. With NT 4.0 and Exchange 5.5 everything needed a reboot. We proceed to reboot the server. It shuts down and comes up in a BSOD. Blue Screen of Death.....WTF. We do a hard restart and cross our fingers. Same thing when it restarted BSOD. We put a call into Microsoft and after an hour on the call it boils down to reinstalling the OS. So we do that and exchange is pretty much F@&%ed.

We were able to see the files that represent both stores. We copy the stores over to another server as a near line backup just in case. but at that time both stores were about 130GB in total. That took at that time a few hours to copy. Once that was done copying we reinstalled exchange 5.5 and service packs. Once this was done we had to copy the data back to the exchange 5.5 server. So that took another few hours to copy back. It's going from evening to night now. The files are done copy but we have to run ISinteg and other tools to defrag the database stores. We locate another area on the network to defrag the store to. Now this process takes a few more hours. Once the defrag process is done we had to remove the old data base store and copy the newly defraged ones back to the server. This was going to take a few more hours to copy.

During these night time copying we decide to attempt to sleep. there is no place comfy to sleep in the office so I grab some old boxes and lay them out on floor. One of my coworkers (female) had a sweater on her chair she leave at work for when the office gets cold, I used that to keep worm that night. Shhh, don't tell her. So I get a nap on the floor and Sunday morning rolls around.

I am now here a full 24 hours. the files are done copying. We attempt the start the exchange 5.5 services and they come up. So we are now back to square one where we were 24 hours before. We install the tools again and cross our finger that the server does not again BSOD. The server reboots and all is fine. Now we are using the mailbox move tool to move hundreds of mailboxes over to the new server. This too takes time so we wait and watch one by one moving over (in groups of 10). It's midday on that Sunday and we are about half way done with the mailboxes. We still have the 80GB with of public folders to do after this. We ended up finishing all mailboxes sometime in mid afternoon. We run tests to see if emails are working on the new server and they are.

Half the battle done now for the public folders. To get these over is different from the mailbox move wizard. I can't remember what processes were available at the time but I ended up exporting my entire public folder to a file and importing it in my new exchange 2003 server. This caused problems b/c not all of the data was exported/imported. My public folders run deep and has many sub folders. So of the sub folders either didn't show up in the new server or the contents were missing. So I had individually export and import specific folders. All of this importing, exporting took the rest of Sunday night into Monday morning. I was done for the most part my Monday 9am.

I still didn't go home. As users began to login I stuck around to make sure everyone was OK with Outlook and accessing anything in the new Exchange 2003 server. I got a few calls about the data in the public folders not being there but that was an easy fix. Export, import. I ended up going home about 4pm Monday afternoon and still went in on Tuesday regular time. The missing data calls kept coming in months after the migration was complete but I kept the old exchange 5.5 online and unplugged from the network just in case. To this day I still have that old server sitting there and I am now ready to throw it out for good. If calls about missing data come up we tell them that data has been removed for good.

I was at work for about 52 hours straight that time. No shower, no shave, no teeth brushing (ginger ale :D) no deodorant, no change of cloths and hardly ate. I was a mess but I got the job done and that's all that mattered to me. Yes I have been though the trenches and that was the last time I laid my head to floor at work either but I'll leave that story for another time.

Monday, October 16, 2006

8 hour tech support call

About a week an a half ago it was a Thursday everything was well for the most part. One of my Admins says he was getting errors in exchange system manger when trying to click on objects. The first thing I think was hmm... How long has that server been up and running? You know the usual thought when something goes wrong in a windows environment. So I check my outlook to see if emails were flowing and it was fine. It was already after 9am so rebooting the server to clean up errors wasn't an option at that point. I make a self note to reboot the server in the morning. A few hours pass by and I get a call from our overseas office asking if we are having issues with emails. I log into exchange and the server seems fine. No errors on the screen, task manager clear, no memory leaks and nothing frozen. As soon as I get off the phone I get calls from users saying they are getting email bounce backs. I then know something is wrong. I run into the server to see what is going on again. NOTHING! All looks fine from the typical admin perspective. So to not waste time I reboot the server. Once I did that NONE of the exchange services started. I tried to start them manually and no joy. So I knew something was wrong. I did nothing in exchange for weeks how could all of a sudden something just happen. I call back the overseas office to tell them yes we are having a problem how did you notice to problem. They say they have a consultant in installing Intellysync software on their end and they said they did nothing to anything else. So I called Microsoft. Hours roll by after explaining the problem. These were what seemed to be text book guys and went by the book. Little errors in AD these guys wanted to stray from the matter at hand and resolve them first. We did have a DC error but it was a rogue DC acting up BUT not affecting AD itself. These guys spent a few hours on that alone and I had to basically tell them we are going to demote this DC get it offline and continue with the matter at hand. It's only been 6 hours that I was on the phone.

To make a long story short. The permissions for exchange site name and anywhere that inherited these permissions were stripped somehow. We had to add them back one by one and even when we finished doing this the services still would not start. The phone call was already at 7.5 hours and it this time the MS tech decided to say lets reinstall exchange. I'm like WHAT! He says it again but we will select reinstall from the drop down box. Basically we did reinstall exchange right on top of the current install, wiping out the service pack as well. After the install we went ahead and reinstall the service pack too. After this we attempted to restart the services and they came up. I rebooted the server to make sure the service started themselves and they did.

After 8 hours the problem was finally resolved. No one still knows why these permissions were stripped out but I have my suspicions.

Tuesday, October 03, 2006

1.7 TB took 4 days to copy

So it took 4 days to copy 1.7TB from my file server to my EMC SAN via HBA's at 2GB Data rate. 4 days DAYMN!!!!! It would have been less if the script didn't get stuck on files that didn't have Admin rights. I would have posted the copy stats but the script is setup to sort of loop itself and start over if anything changes. No I didn't set it to write to a log. I'm guessing the log file would have been in the hundred MB rage and can't chance a log file crashing my server.