TheBonsai's Blog

About the days and nights of TheBonsai

Archive for the 'Work' Category

Fingerprint daemon tickles DBUS daemon

March 2nd, 2018 by TheBonsai

On a customer Linux system, OEL (RHEL) 7.3 we noticed a huge load of open files for the dbus-daemon process. After reaching the open file limits, the process began to use one whole CPU core and weird effects took place. All open files pointed to different /proc/PID/cmdlinefiles, all of processes not existing anymore.

The problem was detected a while after a job with a ten minute frequency was deployed. This initially means nothing, but the time correlation is high enough to keep that in mind.

After some hours of troubleshooting, log analysis, process analysis and some common sense, I found a closed bug report in the RH bugzilla about a very similar – or the same – problem. This pointed me in the right direction. The key is, that the mode this job is being ran involves PAM loading (it actually doesn’t switch, but it could switch user on start). Due to the RH standard PAM configuration, this involves loading pam_fprintd.so if installed (even if unused). This leads to a request to start the fprintd daemon, which goes from systemd-logind over the DBUS to the dbus-daemon. The fprintd daemon process is started and terminates again. This constellation leads to dbus-daemon holding open file descriptors pointing to /proc/PID/cmdline of the job process (which terminates afterwards) that are never closed by the dbus-daemon itself. A typical file descriptor leakage situation.

The workaround was just to disable the fingerprint PAM module in the relevant PAM configuration file(s) and restart the DBUS daemon and some DBUS-using processes that don’t recover themselves (like systemd-logind and NetworkManager).

Unfortunately I wasn’t able to understand the exact anatomy of the problem, since another referenced bug in the bug report I linked above isn’t availableĀ  to me.

Category: Linux, Work | No Comments »

20 years of *NIX stuff

May 8th, 2016 by TheBonsai

Heyah,

yesterday I celebrated my 36th birthday (yes, I really got old..!).

What I suddenly realized was that I also can look back on 20 years of Linux and UNIX passion. I don’t know the exact day, so I just declared it to be on my birthday!

So, also celebrating 20 years of personal *NIX passion!

Happy anniversary Jan and *NIX šŸ˜‰

Category: english, Linux, Technology, Work | No Comments »

10gR2 (10.2.0.5) on top of existing 11gR2 (11.2.0.3) GI

February 5th, 2012 by TheBonsai

Hello Oracle fans and victims out there,

I tried to create a 10.2.0.5 database on top of an existing 11.2.0.3 GI (with an already fine running 11.2.0.3 RDBMS as second database), 2 node RAC.

Installed the 10.2.0.1 + 10.2.0.5 PS + 10.2.0.5.6 PSU… worked like a charm.

After pinning the nodes (remember, you have to do that for a pre-11gR2 database on 11gR2 Clusterware!), I created the RAC database with DBCA… worked like a charm.

I started up the database with srvctl – crash. Both instances were killed by their LMON because of a KGXGN polling error. It looks like this in the alert logs:

Instance #1:

Sat Feb 04 12:44:20 CET 2012
lmon registered with NM – instance id 1 (internal mem no 0)
Sat Feb 04 12:46:50 CET 2012
oracle@svdbslx060 (LMON) (ospid: 28370) detects hung instances during IMR reconfiguration
oracle@svdbslx060 (LMON) (ospid: 28370) tries to kill the instance 2.
Please check instance 2’s alert log and LMON trace file for more details.
Sat Feb 04 12:48:05 CET 2012
Remote instance kill is issued with system inc 0 and reason 0x20000000
Remote instance kill map (size 1) : 2
Sat Feb 04 12:49:20 CET 2012
Error: KGXGN polling error (15)
Sat Feb 04 12:49:20 CET 2012
Errors in file /opt/oracle/base/admin/REDSYS/bdump/redsys1_lmon_28370.trc:
ORA-29702: Fehler bei Vorgang von Cluster Group Service
LMON: terminating instance due to error 29702
Sat Feb 04 12:49:20 CET 2012
System state dump is made for local instance
Sat Feb 04 12:49:20 CET 2012
Errors in file /opt/oracle/base/admin/REDSYS/bdump/redsys1_diag_28366.trc:
ORA-29702: Fehler bei Vorgang von Cluster Group Service
Sat Feb 04 12:49:20 CET 2012
Trace dumping is performing id=[cdmp_20120204124920]
Sat Feb 04 12:49:20 CET 2012
Instance terminated by LMON, pid = 28370

Instance #2:

Sat Feb 04 12:44:20 CET 2012
lmon registered with NM – instance id 2 (internal mem no 1)
Sat Feb 04 12:49:20 CET 2012
Error: KGXGN polling error (15)
Sat Feb 04 12:49:20 CET 2012
Errors in file /opt/oracle/base/admin/REDSYS/bdump/redsys2_lmon_1695.trc:
ORA-29702: Fehler bei Vorgang von Cluster Group Service
LMON: terminating instance due to error 29702
Sat Feb 04 12:49:20 CET 2012
Trace dumping is performing id=[cdmp_20120204124920]
System state dump is made for local instance
Sat Feb 04 12:49:21 CET 2012
Errors in file /opt/oracle/base/admin/REDSYS/bdump/redsys2_diag_1691.trc:
ORA-29702: Fehler bei Vorgang von Cluster Group Service
Sat Feb 04 12:49:21 CET 2012
Trace dumping is performing id=[cdmp_20120204124921]
Sat Feb 04 12:49:21 CET 2012
Instance terminated by LMON, pid = 1695

The LMON trace files just revealed the same information for me. I installed the 10.2.0.5.2 CRS Bundle into this ORACLE_HOME – no change. The internet – including MOS – gave hints in many directions, but nothing really seemed to match.

Finally, after a day off (you sometimes need distance!) I got it:

The error messages highly indicate an interconnect problem. The fact that a second (11gR2) database works fine at the same time excludes physics and other problems on the base of the stack. The server and the 11gR2 world has 2 interconnect interfaces. Solution: I just burned the IP of one specific interconnect interface into the init.ora parameters of the 10gR2 instance – and we got a take off. It works like a charm.

Category: Oracle, Work | 2 Comments »