I upgraded my test lab from SP1 to R2-RTM this weekend.
My current test lab consists of the following servers:
OMRMS – Server 2003 - RMS role
OMMS3 – Server 2008 - MS role, Web Console
OMMS – Server 2003 - MS role, ACS collector
OMDB – Server 2003/SQL 2005 - OperationsManager Database
OMDW – Server 2003/SQL 2005 - OperationsManagerDW database, Reporting, SRS, ACSDB roles
There are 18 agents reporting to this management group.
So – I start – with a little light reading.
I begin with the release notes. These are available from the R2 CD, and on the web at Operations Manager 2007 R2 Release Notes I dont see anything in there that is terribly applicable to me…. but these are good to commit to short term memory – in case we hit a snag during/after the upgrade.
Next – I move on to the Upgrade guide. This is available on the Technet Library – at Operations Manager 2007 Upgrade Guide I need to spend a little time on this one, mapping out the pre-upgrade steps, and then planning the order of my upgrade based on how my management group is deployed.
So – I start by running down the pre-upgrade checklist at: Preparing to Upgrade Operations Manager 2007
I record my service accounts, make sure my DB’s have plenty of free space, and my t-logs are sized big enough. I make sure the volume with TempDB has plenty of free disk space in case TempDB needs to auto-grow.
Next – I map out my plan – and order of operations, for my management group, and share the plan with my team:
- Get most recent backup of Database, Encryption key and Export unsealed MP’s for safekeeping.
- Go to pending actions – and reject/remove anything in there.
- Verify free space on SQL database and validate log size is appropriate.
- I need to uninstall the agent from OMTERM – my terminal server which has a console and an agent only. I decide to go ahead and uninstall the agent, the console, and the SP1 authoring console as well, since I will be replacing it with the R2 auth console. I will replace the agent and consoles when the upgrade is complete for the management group.
- I need to disable all my notification subscriptions, and disabled my product connectors. I am running a custom internal product connector – which runs as a service and updates alert properties – so I will stop and disable that service for the duration of the upgrade.
- I see a section on Improving Upgrade Performance so I will add that step here – right before I upgrade the first component.
- I am now ready to establish the upgrade order for my management group – this is available at: Planning your Operations Manager 2007 Upgrade
- RMS (OMRMS)
- Reporting Server (OMDW)
- Stand Alone Consoles (None – I uninstalled this already in my case)
- Management Servers (OMMS3, OMMS)
- Gateway Servers (None)
- Agents
- Web Console (on OMMS3 and OMMS)
- Post-Upgrade validation steps
Ok – that's my plan. Time to get rolling.
The SP1 to R2 steps are outlined here: Upgrading from Operations Manager 2007 SP1 to R2
I know from experience with customers – the success of your upgrade HINGES on how well you read AND follow the upgrade steps – VERBATIM. The majority of issues we see (especially on clustered RMS) are when a customer does not follow the steps exactly as written, in the correct order.
I complete steps 1-7 in the plan above, and then start the RMS upgrade at step 8. I run “SetupOM.exe” and kick off the pre-req checker before starting the install, where I hit my first snag. I need to install WS-Management v1.1, because I do plan on monitoring Unix/Linux machines in the future with this management group. (This was documented in the release notes, and in the upgrade guide – so I was expecting this… I should have added this to my plan) So I install WS-man from the link provided in the pre-req, which just takes a few minutes. Now – it looks much better in the pre-req checker:
The install instructions provided on TechNet are very straightforward. The install took about 20 minutes for my small environment. It waited the longest on “Loading Management Packs” on the screen in my environment. It finally ended with an error:
The guide has a note on this – about the fact you might get a warning that a service failed to start – and to hit OK. However – this is a different error – this is a service failing to stop… I click OK, and then a few minutes later – setup completes. I uncheck the box to start the console and to backup the encryption key.
I then ran the RMS upgrade validation steps – checking the registry and the services. Registry setup version shows me all is good.
***Note: We have changed the service display names for R2. See below:
I moved on to Reporting. My SRS, Reporting, and DataWarehouse are all shared on a single server – OMDW.
As I read the guide at Upgrading from Operations Manager 2007 SP1 to R2 I notice this little tidbit – which needs to be given STRONG attention before I kick off the upgrade:
Prior to running the upgrade on the Reporting server, you must remove the Operations Manager 2007 agent; the upgrade will fail if this is not done.
So – I kick off the uninstall of the agent on the Reporting/SRS server (OMDW in my case) from Add/Remove programs – before I start the upgrade. Missing little steps like this will drive you nuts if you aren't methodical.
After the agent uninstall – I pick back up on the guide – and kick off “SetupOM.exe”. Since I am a freak – I go ahead and run a pre-req check just to make sure all is good:
Moving on…. I start the install according to the guide. The install goes without a hitch, and took about 10 minutes to complete.
Next up – Management servers. I start with OMMS3. I hit the pre-req check – and I notice I already have WS-Man installed – so away I go. The installer immediately failed with a pre-req failure. I realized – I have the web console installed on this management server, and I forgot to add that when running the pre-req check manually. When I do – I see:
So – I need to grab the ASP.NET Ajax extensions…. this is to support the new cool health explorer in the Web Console. I click “More” on the pre-req check – which gives me a link to the download.
After this little hurdle – the management servers upgraded very quickly. Once again – I got an expected error about a failure to stop a service.
Click ok and setup completes. I repeat this upgrade on the other management server (OMMS) and these are done. A quick check of the registry – and the setup version is indeed 6.1.7221.0
I don't have any gateways in this lab – so next up is agents.
Lucky me – all 18 agents show up in pending actions for an update. I will approve them all – and let the management server push the update down and upgrade them.
***Note – do not upgrade more than 299 agents in this manner at a time. This is documented in the Upgrade Guide.
All my agents upgraded successfully except for two. BOTH that failed happened to be the two servers that I manually removed the SP1 agent from – OMTERM and OMDW. (I forgot to delete their “agent managed” object from the management group) Both have a different error. OMTERM is failing to install with a push failure for MOMAgentInstaller. I have had trouble with this agent before – possibly because of the TS role - so I just do a manual agent install here. OMDW is different – the console push said it was a success – however – the System Center Management Service (HealthService) will not start – it gives an error:
Event Type: Error
Event Source: Service Control Manager
Event Category: None
Event ID: 7024
Date: 5/23/2009
Time: 1:09:37 AM
User: N/A
Computer: OMDW
Description:
The System Center Management service terminated with service-specific error 2147500037 (0x80004005).
I ran a repair action from the console – but got the same error here. So – I manually uninstalled the broken agent – and deleted the agent from the Agent Managed section of the console – and re-pushed the agent. I had a little trouble getting these two to come into the management group… but eventually after a couple delete/reinstalls they finally appear to be working ok. I’d recommend uninstalling them from the console next time…. so this will remove both the agent and the computer object from the console.
Next on the list: Web Console
From the upgrade guide I see this note….
If your Web console server is on the same computer as a management server, the Web console server is upgraded when the management server is upgraded, rendering this upgrade procedure unnecessary. You can still run the verification procedure to ensure that the Web console server upgrade was successful.
Good – my web console is not a stand-alone – it was running on a management server (OMMS3) so that is already taken care of.
Aha – I found something we forgot on the plan…. the ACS Collector. This role is missing from the table at Planning your Operations Manager 2007 Upgrade so I completely missed this as a planning step. However the process is documented at Upgrading from Operations Manager 2007 SP1 to R2. So – we need to do this – I will assume last since it is last on the upgrade detailed steps. Following the guide…. I walked through the steps – no issues.
Looks like we are done! I will now start the post-upgrade validation steps to make sure my management group is actually working as it should without any major issues.
There is a list of post-upgrade checks at Completing the Post-Upgrade Tasks
I am going to walk through those here:
1. I open up discovered inventory – and change target to “Health Service Watcher” and compare this to the list I had before the upgrade. These are agents that have a problem from the management server perspective – which causes them to appear “grey” in all other views. My list is the same as before I started – I have 6 in this list as critical – 5 of them are agents that are VM’s that are currently down – so this is good. 1 of them is an old management server… for some reason we don't groom these out of the view/database – and these seem to stick around forever in this view.
2. I review the event logs on the RMS and all MS roles. I am seeing some errors like below:
Event Type: Warning
Event Source: HealthService
Event Category: Health Service
Event ID: 2120
Date: 5/23/2009
Time: 10:02:15 AM
User: N/A
Computer: OMRMS
Description:
The Health Service has deleted one or more items for management group "OPS" which could not be sent in 1440 minutes.
This is normal – it happens when you have agents that are down in your environment.
Event Type: Error
Event Source: Health Service Modules
Event Category: Data Warehouse
Event ID: 31552
Date: 5/23/2009
Time: 10:03:38 AM
User: N/A
Computer: OMRMS
Description:
Failed to store data in the Data Warehouse.
Exception 'SqlException': Sql execution failed. Error 777971002, Level 16, State 1, Procedure StandardDatasetGroom, Line 303, Message: Sql execution failed. Error 2812, Level 16, State 62, Procedure StandardDatasetGroom, Line 145, Message: Could not find stored procedure 'KMS_EventGroom'.One or more workflows were affected by this.
Workflow name: Microsoft.SystemCenter.DataWarehouse.StandardDataSetMaintenance
Instance name: KMS Activation Event Data Set
Instance ID: {800D8126-6F72-CA84-A76B-A94F7E3C93CF}
Management group: OPS
This is not normal – this looks like an issue with the KMS MP – and R2’s advanced logging is picking up on an error that's been there all along, I just didn't know it.
That is all from the RMS – pretty clean. On the Management servers…. I found a bit more – but they were all due to the problems I was having with a handful of agents. Once I removed and fixed those agents – the MS logs are clean.
3. No cluster in this lab – so nothing to test there.
4. Review alerts in the console. I sort by Repeat Count and LastModified (I add these to all my alert views) and look for anything that stands out as repeating a LOT, or something new that looks like a problem. I dont see anything here – so that is good!
5. DB server in perfmon looks good. I examine % Processor Time, and Logical Disk Avg disk sec/read and Avg disk sec/write. Those are both avg under 15ms (.015) on the DB and log volumes - so that looks good. CPU is avg under 25%.
6. Check all the console views. Much snappier than in SP1. Nice.
7. I opened up reporting – and ran the “Microsoft ODR Report Library > Most Common Alerts” report – to test out reporting. It ran with no issues. I test a few of my saved custom and favorite reports – no errors – all good.
8. Authoring pane looks good – I can see my groups, monitors, rules – and wow – they open a LOT faster than before. Very nice.
9. I check out my MP versions. The install upgraded all my core MP’s to 6.1.7221.0. I was already pretty current on my MP’s – so not much to do here now that needs my immediate attention.
10. Re-enable notification subscriptions and product connectors. I turn my subscriptions back on – and fire off a test event that I use to generate an alert and email me a notification. Works great. Next – I got to my custom product connector – and enable the service and start it back up again. I run some test alerts – to make sure my product connector is taking all the necessary actions on the alerts – and forwarding them as appropriately. All good.
11. Review My Workspace. Yep – all my old custom views are there.
12. Re-deploy agents. I already did this. Perhaps I should have waited on this step…. because I spent so much time troubleshooting those last few pesky agents that seem to have trouble.
13. Oh – the BIG ONE. This step is a bit odd – we tell you to go run this SQL query. LET ME WARN YOU – this is not a “quick job”. This is the script that is documented and discussed at my blog post: Does your OpsDB keep growing- Is your localizedtext table using all the space- Dont take this step lightly – running this script could take several hours – so plan accordingly. Read the link above for the details – and consider skipping this step for now…. until you are sure you are ready to execute it. Take some calculations based on the blog post above – how long it will take – how severely you are impacted (row count of your localizedtext table) and make sure you have a LOT of free space for the tempDB and tempDBlog to grow if needed. My LT table was already really small – so no issues for me running this – it completed in less than a minute.
Done! (with the “official” steps)
Now – I just have a couple cleanup steps I need to do – like go back and install the Ops Console and the Auth Console back on my terminal server. Did that without issue. All looks good.
And then I realized – we are missing another step in our plan – under the post-upgrade tasks – make sure the web console is working! I saw lots of items in the release notes about how this might break…. and I imagine someone will complain rather quickly if it isnt working – so we better go check that out.
Sweet! I hit up the web console and it is all good. I check out several of the new views – and run health explorer from the web console. I have tasks, maintenance mode, and health explorer. Very cool. I event execute some of my favorite reports under “My Workspace” just to make sure those are good – ouch – not working. I will have to look into that one.
Ok – that’s enough for today. All in all – a successful upgrade. A good plan written out at the beginning, based on the upgrade guide - makes all the difference.