Run Recovery and Clean-Up Jobs
Applications must often perform recovery and cleanup on a periodic basis to check the state of the system or the state specific to their types. Mostly these are achieved by writing custom cron jobs. The C3 Agentic AI Platform has support for creating recovery jobs that are fired every minute based on the state objects that are returned by the user code. This avoids the need for users to create custom jobs for just recovery related tasks.
Implementation
Users can inherit the base Type Recoverable and provide an implementation for the following two functions:
states: Indicates the various state objects that need/qualify for recovery. These objects can be custom objects that mixin RecoveryState and can provide any number of custom fields along with the fields in the RecoveryState Type.recover: Action that actually performs the business logic to recover the jobs. This action can be fed in the states as returned by the Recoverable#states API and is expected to perform the recovery.
These jobs get invoked every minute and errors, if any, get logged in a formalized way.
Usage
Adding Recoverable
An example implementation of this is used in the recovery and cleanup of DbLock entries.
Imagine a scenario where a machine acquired a lock and then the machine went down. Any other application requiring that lock can be starved forever. It is important to perform periodic recovery and cleanup in these scenarios.
You can add your own Recoverable Type. Create the specialized Type that mixes in Recoverable:
/**
* Recover / fix stuck db lock entries {@link DbLockEntry}
*/
private type DbLockDoctor mixes Recoverable<DbLockEntry>DbLockDoctor mixes in Recoverable and the specific state object for it is the DbLockEntry Type:
- The
DbLockdoctor implements the states and therecoverfunction, and provides thenodeIdandaction idon which the lock was acquired in thestatesmethod. - The recovery engine checks if the action is still running on the given
nodeIdand, if not, then calls the recovery action with only thosestatesobjects that did not have the action running on the given node. In the example above, the engine calls the recover method on theDbLockDoctorto recover states that did not have any action running, in which case it should recover the lock.
By default, the engine checks if the action is running on the given node before calling the Recoverable#recover function. Any other recovery can be performed by the user in either the Recoverable#states code or Recoverable#recover code.
Debug
To identify if there are any errors occurring as a part of this timer task, enable debug logging:
// Logger is a utility for logging any arbitrary information and for publishing any arbitrary events.
Cluster.setLogLevel("c3.server", "DEBUG", TIME_IN_SECONDS)