Thursday, August 26, 2010

How-To: Fix service check time outs in Nagios + NRPE deployment

Once you get used to writing plug-ins in Nagios and the complexity of the plug-ins you write grows, you may encounter this error, service check timed out.

If some of your service checks have this problem, you can isolate the problem in these 3 values:

1. how slow is the plugin

This is the first thing you should do. Check if how much time does your plugin needs before it can finish checking and provide an exit status. Log-on to the server your monitoring and run the plugin locally. Use the time command to measure.

$ time /usr/lib/nagios/plugins/check_service

2. how short is NRPE’s patience

Once you have the value (in seconds) in step #1, check your NRPE configuration in that same server . The default location of NRPE’s configuration is /etc/nagios/nrpe.cfg
Find this parameter, command_timeout. The value of this parameter, in seconds, must be greater than the value that you’ve got in
step #1.
Once the parameter’s been set, restart the NRPE service (service nrpe restart).

3. how short is Nagios’ patience

Nagios executes a command, check_nrpe, to connect to a NRPE service. check_nrpe has a timeout paramer, -t. This parameter must have a bigger value than the one you set in
step #2.
Log-on to your Nagios server and you can set this by opening the commands configuration file, /etc/nagios/objects/commands.cfg
Find check_nrpe, and edit its command_line and set the -t parameter. For instance, if you want the timeout value to be 500 seconds, it will look like this:
command_line $USER1$/check_nrpe -H $HOSTADDRESS$ -c $ARG1$ -t 500
Restart the Nagios service afterwards (service nagios restart).

    

No comments:

Post a Comment