Today, I came across something funny. The servers had been restarted some time ago, and we had decided to shut down the second node, in order to be able to determine the effect on the application the developers were working on. DBConsole hadn't been switched on yet, so I started it. And that is when the trouble began.
Within seconds, the CPU and I/O usage went way up. DBConsole was still (somewhat) responsive, so I investigated the problem using OEM. Two major problems were visible:
- Thousands of "active sessions" reported on the Cluster Home page
- A lot of active SYSMAN sessions, mainly for locking a MGMT_FAILOVER_TABLE and for executing EMD_NOTIFICATION.OMS_FAILOVER
One of the SYSMAN sessions was blocking several others, but getting apparently nowhere with the work it should be doing. Searching the internet I found some possible causes, but none of them seemed to help me here.
Then I decided to just use trial-and-error. Obviously, the MGMT_FAILOVER_TABLE was involved somewhere. Looking at the data in that table revealed that there were 135 rows in there. Not knowing whether this is normal, I looked it up in other RAC databases. These had only 1 row in the table, so I stopped DBConsole, made a backup of the table contents, removed all rows (probably could have left the most recent one in there..) and started DBConsole. It's been quite some time already, and the system is still running smoothly with low CPU and I/O usage, and a normal load from the developers.
I still don't know why there were that many rows in this table, but I am confident that there shouldn't be. Luckily, the system is smart enough to recover if you delete all the rows and after that, normal operations resume. If anyone knows what causes this, feel free to comment.
I'm looking forward to any answer that clarifies what really happened here ;-)
For those who wish to reproduce: the DB and OS used are
- 11gR2 (11.2.0.2.0)
- RedHat Linux 5.5