Wednesday, August 8, 2012

OEM DBconsole locking and huge resource usage

For some time now, we have had an 11g 2-node RAC database up and running (actually, we have more, but this one behaved somewhat different than expected). It has never been really necessary to plug it into Grid Control, so we still use DBConsole to monitor it when we need. It obviously is no production system we are talking about here...

Today, I came across something funny. The servers had been restarted some time ago, and we had decided to shut down the second node, in order to be able to determine the effect on the application the developers were working on. DBConsole hadn't been switched on yet, so I started it. And that is when the trouble began.

Within seconds, the CPU and I/O usage went way up. DBConsole was still (somewhat) responsive, so I investigated the problem using OEM. Two major problems were visible:
  • Thousands of "active sessions" reported on the Cluster Home page
  • A lot of active SYSMAN sessions, mainly for locking a MGMT_FAILOVER_TABLE and for executing EMD_NOTIFICATION.OMS_FAILOVER

One of the SYSMAN sessions was blocking several others, but getting apparently nowhere with the work it should be doing. Searching the internet I found some possible causes, but none of them seemed to help me here.

Then I decided to just use trial-and-error. Obviously, the MGMT_FAILOVER_TABLE was involved somewhere. Looking at the data in that table revealed that there were 135 rows in there. Not knowing whether this is normal, I looked it up in other RAC databases. These had only 1 row in the table, so I stopped DBConsole, made a backup of the table contents, removed all rows (probably could have left the most recent one in there..) and started DBConsole. It's been quite some time already, and the system is still running smoothly with low CPU and I/O usage, and a normal load from the developers.

I still don't know why there were that many rows in this table, but I am confident that there shouldn't be. Luckily, the system is smart enough to recover if you delete all the rows and after that, normal operations resume. If anyone knows what causes this, feel free to comment.

I'm looking forward to any answer that clarifies what really happened here ;-)


For those who wish to reproduce: the DB and OS used are
  • 11gR2 (11.2.0.2.0)
  • RedHat Linux 5.5

Friday, February 24, 2012

SRW.RUN_REPORT and the dreaded REP-0178

I was working on a Forms&Reports migration last month (6i to 11gR2) and had little trouble upgrading the entire application, except for one small part: some reports were used as a starting report, calling other reports (based on various criteria), using SRW.RUN_REPORT to do so.

SRW.RUN_REPORT was calling the other reports using both the "userid" en "server" parameters, which were deprecated in 10g. Using the "userid" will net you an Oracle error: REP-01434.
If you set the "REPORTS_SRWRUN_TO_SERVER" environment variable to YES, this error can be avoided, but I chose to eliminate these keywords altogether (as there weren't that many reports that used them). So after this, I expected the reports to run smoothly...

When invoking the starting report via the URL, the report itself ran OK, but the SRW.RUN_REPORT did not. Eventually, in the rwEng-0_diagnostic.log file In found only “REP-1428 An error occurred while running procedure SRW.RUN_REPORT in program unit beforereport”. Not much help there.

Because the SRW.RUN_REPORT effectively calls rwclient.exe (it’s a Windows 2008 implementation), I tried using rwclient from the command box. At first, this didn’t work either (REP-0178 Cannot connect to Reports Server), until I saw that one of the rwclient.bat scripts invoked reports.bat (setting the environment) and the other did not. When setting only two environment variables (ORACLE_HOME and ORACLE_INSTANCE), rwclient.exe also worked.

Encouraged by this, I set these two environment variables as system variables, so they will always be available for any program. Just to be sure, I rebooted the entire server as this was only an acceptation server at the moment. After reboot, the reports using SRW.RUN_REPORT also worked. Mission accomplished ;-).


Some details of the environment:
• OS = Windows 2008 R2
• AS = WebLogic 10.3.5
• FMW = 11.1.2 Forms&Reports Services
• Database (different server) = 11.2.0.3