Troubleshooting

Troubleshooting Job failure

First, you should determine if a Job failed as a result of an ActiveBatch error, or if the Job failed for another reason (e.g. the payload of the Job did not run successfully).

ActiveBatch Errors

In some scenarios, if a Job fails due to an ActiveBatch error, you will see an "ABAT" error in the Exit Code Description column of the Instances pane. An ABAT error means that there was an ActiveBatch-specific reason for the Job failure. Each ABAT error name is designed to be helpful in identifying the problem that was encountered. Here are a few examples:

%ABAT-E-INVALID_WORKING_DIRECTORY, the working directory set for the Job is invalid.

%ABAT-E-INSTSKIPPED, instance skipped due to active instance (this means a trigger occurred, but multiple instances of the same Job are not allowed to run - this is based on a Job property setting).

To see a list of ABAT messages, see ActiveBatch Messages.

In other cases, if a Job fails due to an ActiveBatch error, you will not see an ABAT message, but you will see a reason for the failure in the Exit Code description. Here are a few examples:

Logon error - This failure occurs because the user credentials specified on the Job did not authenticate on the Agent system (where the Job was dispatched to run).

Some variables are missing - This failure occurs because the Job author specified that one or more configured variables must resolve properly in order for the Job to run. If they don't, the Job will fail.

Runtime underrun - This failure occurs because the Job author enabled a feature called Run Time Monitoring. If the Job does not run for a long enough period of time, it will fail for this reason. Note: There is also a Runtime overrun option, where a Job is auto-canceled if it runs for a longer period of time than allowed.

Some ActiveBatch runtime errors result in additional information added to the Execution Agent log file. The Agent log file is located on the Agent system where the Job was dispatched. Navigate to the ActiveBatch installation directory and look for a Logs subdirectory.

Agent log files are named abateagent.log for Windows.

Agent log files are named abatemgr.log for Unix/Linux.

Older log files have a number appended at the end of the file name, where 1 is the previous log file when compared to the current one, and 10 is the oldest. Open the Agent log file and look for a date/time that matches the failed Job. There may be additional details in the file.

Note: Non-ActiveBatch errors are not posted in the Agent log file. For example, you will not see anything in the Agent log file if a Job failed because the payload it is running (let's say a script), failed due to a script syntax error.

As a general rule, you can see how far an instance was processed by looking at the audits of the instance history. Right click on the failed instance, select Properties, then Audits. If you do not see a "Started" audit item in the audit trail, that means the Job was never started by the Execution Agent. Typically this type of situation falls into the realm of an ActiveBatch error. The instance did not get far enough into the processing to run. A logon error (mentioned above) would be a good example of this. The Agent could not authenticate the User Account the Job was configured to run under, and therefore the process was not started. Here is the audit trail of a logon Job failure. Notice there is no "Started" audit.

If the Agent did not start the process, you will often not see a log file if you click on the Log tab, depicted in the above image. This is because the Agent manages the Job log file, and if processing did not get far enough along (before Job failure) to create a Job log, then one will not be present.

Non-ActiveBatch Errors

These types of errors typically occur when the payload of the Job fails to execute properly. For example, a script Job fails because it is referencing a Windows Share that cannot be accessed on the network. In this scenario, you will see a "Started" audit in the failed instance. Somewhere along the way the payload failed to execute properly. Not all failures are due to payload errors. The system can fail a Job due to specific property settings configured by the Job author - such as a breach of an SLA (Service Level Agreement), or a runtime overrun (i.e. the Job ran for longer than expected, so the system marked it a failure).

ActiveBatch supports 3 Job types - Process, Script and Jobs Library. What determines the success or failure of these Job types? Let's discuss this.

For Process and Script Jobs, the Job author sets a Job property to determine the success or failure of the Job. The property name is the Completion Status Rule. The Completion Status Rule is located on the the Job property sheet named Process or Script, depending on the Job type configured. The Completion Status Rule provides two choices:

Success: This property is used to set one or more exit codes. When set, this means the Job must exit with a code specified in this property to be considered a success. For example, the exit code may be configured with a range of numbers, from 10 to 20. If the Job exits with any number from 10 to 20, it will be marked a success by ActiveBatch. If it exits with a number outside of the range, it will be marked a failure. The exit code the Job exits with will be displayed in the exit code column of the Instances pane. The system will not attempt to interpret the exit code, therefore the Exit Code Description column will remain blank. Alternatively, the Job log file associated with the Job may attempt to interpret the exit code if the Execution > Logging property named "Interpret Exit Code (applies to Process/Script Job types only)" is checked.

Use Search String: This property is typically used when the exit code can't be relied upon to determine the success or failure of a Job. The Job author can specify a string or regular expression that will be used to search the Job log file (or another file generated by the Job, if any) to determine the success or failure of the Job. For example, if the Job author enters the word "Failure" as the search string, then checks the "Presence of string indicates Job Failure" checkbox - the Job will fail if the Job log file contains the word failure in it. If the word is not found, the Job will be marked a success. When the Job does fail due to using the search string, the failure exit code will be -805502974 (0xCFFD0002), and the Exit Code Description will be %ABAT-E-FAILURE, Failure in requested operation.

In summary, the Process or Script Job type failure or success is determined by property settings configured by the Job author. As an ActiveBatch operator, you may need to look at the Job property settings to see how it was configured to better understand Job failures.

For Jobs Library Jobs, the Job author does not set a property - rather, the system uses its own predefined exit codes - 100 for success, 102 for failure. The Job author can choose to ignore errors encountered while running this Job type. Therefore, it could be possible to view a Job log file and see an error posted, yet ActiveBatch marked the Job a success. Typically errors are not ignored, and the Job will fail, but it is good to know about edge cases since the Job author has some control of the overall Job status. If a Jobs Library Job fails, there should be a reason in the Exit Code Description column of the Instances pane. This is because Jobs Library Jobs use building blocks provided ActiveBatch, and therefore ActiveBatch can interpret the reason for the failure.

Next, it is always a good idea to check a failed Job's instance history (right click on the instance, then select Properties).

The instance history provides all the details about the instance, including:

General information (creation time, start time, instance ID, etc.)

How variables resolved (their values) - if a variable does not resolve correctly, it will be marked as "missing".

The audit trail (e.g. created, started, etc.)

The Job log file

Job properties, as they were set when the instance ran, and their current settings. Be sure to look at the Instance properties as they were set at the time the failed Job ran. If you look at the current properties, they may have been changed and therefore would not apply to the failed instance you are investigating. See View Instance History for more details.

Typically if a Job fails due to a non ActiveBatch reason, the first place to look would be the Job log file. The Job log file captures standard output and standard errors. Logging is enabled by default, however it can be turned off by the Job author on a per-Job basis (meaning, no log file will be generated). As a general best practice, keep logging enabled. For more information about Job log files, see Logging

Note: Another reason for lack of a Job log file is it could have been purged by the system if the log file retention period was met.

Troubleshooting Plan Failure

A Plan is a wrapper around one or more Jobs. A Plan is triggerable, and when run, it does go into an executing state. There is no payload associated with a Plan. The Plan is an executing state as long as its underlying Jobs are active. That said, a Plan's success or failure is determined by the Plan's Completion Rule property. A Job author configures this property. The default Completion Rule is any Job that runs in the Plan must succeed, or the Plan will be marked a failure. However, there are other options. One option is the Job author can select which Jobs in the Plan must run, and they can also select the status each Job must complete with in order for the Plan to be considered a success. Please see the Plan Object for more information about the Completion Rule property.

Troubleshooting ActiveBatch Component issues

Job Scheduler

The Job Scheduler runs as a Windows service. When the Scheduler server is rebooted, the Scheduler service is configured to automatically start. If the Scheduler service fails, the Windows Services recovery feature is set to restart up to 2 times. Subsequent failures result in no action. This is the out-of-the-box configuration.

If the Scheduler service fails (or won't start), check the Job Scheduler log file, on the Scheduler server. The Scheduler log file captures information, warnings and errors. The file name is Abatjss.log, located in the installation directory, which by default is Program Files\ASCI\ActiveBatchVx\Logs. Older log files have a number appended to the end of the root of the file name (e.g.Abatjss01.log). If the Scheduler service fails, you can also check the Windows Event Viewer under Windows Logs > Application and Windows Logs > System. There may be some relevant information there.

Additionally, if the Scheduler service fails, you may see a dump file (generated through Windows Error Reporting - WER). See the ActiveBatch Installation folder, then navigate to the Dumps directory. A file in this directory would likely be useful to the ActiveBatch Technical Support team, when you open a support ticket.

If the Scheduler can't communicate with the ActiveBatch backend database, you will likely see some information about that in the above described Scheduler log. When the Job Scheduler was installed/configured, the person performing the configuration provided the backend database information. The backend database is either using the Job Scheduler service account for database access, or a database-specific account. If the database credentials change (user name and/or password), they must be updated for the Scheduler as well.

Execution Agent

The Windows Execution Agent runs as Windows service. When the Agent server is rebooted, the service is configured to automatically start. If the Agent service fails, the Windows Services recovery feature is set to restart up to 2 times. Subsequent failures result in no action. This is the out-of-the-box configuration. Next, if the Execution Agent service fails, you may see a dump file (generated through Windows Error Reporting - WER). See the ActiveBatch Installation folder, then navigate to the Dumps directory. A file in this directory would likely be useful to the ActiveBatch Technical Support team, when you open a support ticket. Lastly, if the Agent service fails, check the Windows Event Viewer under Windows Logs > Application and Windows Logs > System. There may be some relevant information there.

The Unix/Linux Execution Agent runs as a daemon. To have the Agent automatically startup as part of a system startup, run the ‘abatconfig’ procedure. This interactive procedure will ask whether you want the Agent to startup when the system boots. Next, if the daemon fails, it will automatically restart as long as the Agent was started using the provided script named 'abatstartup'. For any ActiveBatch daemon failure, look for a core file as follows: / -name core*. This file would likely be useful to the ActiveBatch Technical Support team, when you open a support ticket.

Like the Job Scheduler, the Agent has a log file to capture information, warnings and errors. Below is the location where you can find the log files, based on the OS the Agent is running on. For Windows and Unix/Linux, older log files have a number appended to the end of the root of the file name (e.g.abateagent01.log).

Windows: Program Files\ASCI\ActiveBatchVx\Logs (default path)

File name: abateagent.log

Unix/Linux: Opt/ASCI/ActiveBathcVx/Logs (default path)

File name: abateamgr.log

OpenVMS: The log file is contained in the directory defined by the ABATOVMS_LOG_PRODUCT logical name.

File name: abatemgr.log

Troubleshooting offline Execution Queues

The Job Scheduler connects to Execution Agents using the machine property set for each Execution Queue. If the connection is successful, the Queue will be in a "Started" state (see the Execution Queue's General property sheet where the Queue state is displayed). If the Job Scheduler cannot connect to the Execution Agent system, the Queue will be in a "Starting" state. The Scheduler will continuously attempt to connect to the Agent system until it does (the attempts to connect never time out).

Sometimes the Scheduler cannot connect to the Execution Agent system. Below you will find some reasons why.

Make sure the machine name property set on the Execution Queue is correct. If the machine name is correct (if you are using the machine name), then try using the IP Address of the Agent system, or the fully qualified domain name.

Make sure the Execution Agent is running.

For Windows, launch the Windows Services window and check the state of the ActiveBatch Execution Agent service.

For UNIX/Linux, type: ps-ef|grep abatemgr

Try pinging the Agent machine from the Scheduler server.

Make sure the port that is being used (3655 is the default port) is not blocked.

Check to see if the firewall is disabled.

Check the Agent log file (discussed in the Components Issues section) to see if any error messages are logged there.

Check to see if you are experiencing a licensing issue. Execution Agents are separately licensed. Different license models are supported. The classic point system allocates points for each installed Agent. If an Agent won't start, it could be that you need to purchase more license points.