High Availability and Restart

In our examples and case studies to date we’ve been blessed with plans and jobs that succeed. They haven’t failed. Of course, that’s not the real world. In the real world, plans, jobs and machines do fail and we need to design and provide for that possible. ActiveBatch separates job failure and restart/recovery into two (2) categories: Plan/Job and Machine.

 

Plan/Job Failure and Restart

 

This type of failure is when the plan or job fails and that failure is not due to a machine failure. This is an important distinction and yields several benefits. You may have a job that needs to access many network resources. These resources may take time to marshal and the job may fail. Executing the job a second time may yield a positive result. ActiveBatch allows for that possibility.

 

 

The above figure provides the restart choices. You may elect not to allow an automatic restart (default). You may elect to restart the plan/job and allow for the possibility of another execution queue running the job (Restart & Failover). Another selection is the “Disable Template”. This means that the plan/job definition should be disabled or held. No future jobs will be triggered. This may be useful to prevent a cascade of increasing failed plan/jobs and forces manual intervention to correct the situation. Finally, the “Restart” option allows for a simple job restart on the same execution queue/machine the job had originally been scheduled on. While “No Restart” is the default, “Restart” is the typical selection for jobs that allow restart. Failover is only available if the job had been associated with a generic queue. If the job was associated with an execution queue then “Restart & Failover” and “Restart” are the same.

 

When a job has been marked for automatic restart, several properties can be used to further influence the recovery.

 

 

The first property “Wait … secs before restarting” can be useful when the failure is deemed to be transient due to insufficient resources. In plain language, if you waited a few seconds or minutes the job would have run successfully. In the example above, 60 seconds would be used between restart attempts. The second property “Maximum Restarts” indicates the number of times the plan/job may be automatically restarted. In the example, the limit is set to 5. This means that after 5 attempts the plan/job will be marked as “Failed”. The checkbox “Reset on Restart” governs the reevaluation of active variables. If this checkbox is not enabled (default), active variables are not re-evaluated and whatever values were initially retrieved are used. If this checkbox is enabled, active variables are re-executed prior to the job’s execution.

 

Machine Failure and Restart

 

This type of failure governs the execution machine itself. In other words the machine failed and that failure has caused the job to fail. The restart choices and restart options are the same as previously mentioned.

 

 

Typically “Failover” or “Restart” are the most prevalent choices. You may select “Failover” if the job has been associated with a generic queue. You may select “Restart” if you need to wait for the failed machine to reboot and otherwise become available.

Regardless of the type of failure and recovery, the restart options are used in common. In other words, in the above example, this job can be restarted to a maximum of five (5) times.

 

Note: The Restart choices and limitations apply to the automatic restart facility. You may manually issue a Restart command to restart a plan or job. A manual restart resets any limitation counters (i.e. restart limit).

 

Checkpointing

 

Sometimes jobs take a long time to execute. A job may perform a series of discrete but related operations where each “step” is separate but “multiple steps” can take a long time. ActiveBatch provides the ability to restart a plan or job, however, in addition to restart; ActiveBatch provides jobs with the ability to associate non-volatile data that can be used as a job checkpoint. A checkpoint, in restart/recovery terms, is data that would allow a process to avoid having to start execution at the beginning. For example, if you were searching through files or directories you might decide to checkpoint every tenth filename you processed. That information would be made available to the job if the job were restarted. This is the basic premise for ActiveBatch’s checkpoint facility.

 

Taking the checkpoint is a fairly simple process. You invoke the AbatChkPt utility as part of your job. This utility requires a single parameter, the data you want to associate with this job as your checkpoint. Each time you invoke AbatChkPt you overwrite the previous checkpoint. The data you choose for your checkpoint is completely up to you and the demands of your job.

To retrieve a checkpoint you must perform several checks first.

 

  1. Check the ABAT_RESTART environment variable (or logical name) and verify it is set to TRUE. This means that the job has been restarted.

  2. Check the ABAT_CHECKPOINT environment variable (or logical name) and verify it is set to TRUE. This means that the checkpoint is legal and available.

  3. The environment variable ABAT_CHECKPOINTVAL contains the checkpoint value. ABAT_CHECKPOINTTIME contains the last checkpoint date and time.

 

The checkpoint data may reflect a specific step of processing. For example, on an OpenVMS system the following syntax would be used:

 

$ start=”begin”

$ if trnlnm(“ABAT_RESTART) .and. f$trnlnm(“ABAT_CHECKPOINT”)

$ then

$ start=f$trnlnm(“ABAT_CHECKPOINTVAL”)

$ endif

$ goto ‘start

$begin:

.

.

$second_step:

 

$  abatchkpt “second_step”

 

The simple passing of each completed step could avoid having to re-run a very long job.

 

Progress Indicator

 

While technically the progress indicator is not a part of restart/recovery it is typically used in places where a checkpoint might be used as well. The progress indicator feature of ActiveBatch allows a job author to relay progress on a running job to a client using many of the GUI displays and views. The progress indicator is invoked as part of the AbatScripting COM object. Specific programming details can be found in the ActiveBatch Developer’s Guide. The purpose of this subsection is to acquaint you with the feature and how it may be used.

 

 

The above figure provides a simple VBS script that shows how the progress indicator is used.

 

 

This snapshot from Daily Activity list shows the output of the sample progress script.