crabOverseer documentation¶

crabOverseer is a tool to manage CRAB tasks and the following post-processing steps.

crabOverseer (and, in general, RunKit) is intended as a submodule for a task-specific framework that setups the environment and provides the necessary configuration files. Here are some examples of frameworks that use RunKit:

Logical flow¶

Preparation steps:
1. Setup framework area and environment on the site where the crab outputs will be staged out
2. Define crabOverseer configurations
3. Setup VOMS certificates
4. Make sure that there is enough space to store the crab outputs
  - with the current implementation (to be improved), you should reserve 2x the expected size to store intermediate outputs, which can be removed after the post-processing is finished
5. Run dry-run submission to make sure that setup works as expected
Submission and the main crabOverseer loop:
1. Update the status of all active tasks:
  - if a task is new, submit it
  - if a task is submitted on CRAB, update the status
  - if a task is failed on crab, create and submit a crab recovery task
    - if the maximal number of recovery attempts is reached, create local recovery jobs
  - if a task is finished on CRAB, check that all outputs are present
    - if not, create and submit a crab (or local) task
    - if yes, create a local post-processing job
2. Submit and monitor the list of local recovery and post-processing jobs defined in the previous step
  - law is used to submit and monitor the jobs on a local grid
3. If there are non-finished tasks, wait for updateInterval (starting from the last beginning of step 1) and repeat the loop
4. If all tasks are finished, finish crabOverseer.

Usage¶

python RunKit/crabOverseer.py [-h] [--work-area WORK_AREA] [--cfg CFG] [--no-status-update] [--update-cfg] [--no-loop] [--select SELECT] [--action ACTION] [--verbose VERBOSE] [task_file ...]

Command line arguments:

Argument	Description
-h, --help	Show the help message and exit
--work-area WORK_AREA	The working area to store crabOverseer state (default: $PWD/.crabOverseer)
--cfg CFG	The main crabOverseer configuration file (see description below)
task_file	The list of files with a description of tasks to be managed by crabOverseer (see description below)
--no-status-update	If specified, do not call crab to update tasks statuses and proceed with the next steps
--update-cfg	Update the main and all task configurations from the config files provided in --cfg and task_file arguments
--no-loop	Run one iteration of task update and submission and exit
--action ACTION	Apply an action on the selected tasks and exit (see description below)
--select SELECT	Select tasks to which the action should be applied (default: select all)
--verbose VERBOSE	Verbosity level (default: 1)

After the first call, the crabOverseer state is stored in the working area, and the subsequent calls will use it. Therefore, it is not necessary to provide arguments for the subsequent calls, meaning that the following command will be enough:

python RunKit/crabOverseer.py

Alternatively, if a non-default working area is used:

python RunKit/crabOverseer.py --work-area <working_area>

Main configuration file format¶

The main crabOversser configuration file uses the YAML format. It contains definitions that are common for all tasks. A task-specific definition can be defined (or overwritten) in the task configuration file.

Supported parameters¶

Parameter	Description
cmsswPython	path to the CMSSW python configuration file. For a nanoAOD production, use `RunKit/nanoProdWrapper.py`.
params	list of parameters that will be passed during execution of `cmsRun` on the `cmsswPython` file
splitting	Crab job splitting. Currently, only `FileBased` splitting is supported
unitsPerJob	number of units per job (i.e. files per job for `FileBased` splitting) for the initial task submission. This parameter is decreased by a factor of 2 for each consecutive recovery submission. Suggested value: 16
scriptExe	Executable script that will be run on the remote nodes. Suggested value: `RunKit/crabJob.sh`
outputFiles	List of output files produced by the crab job. The suggested value for a nanoAOD production: `- nano.root`
filesToTransfer	List of files that CRAB will transfer to the remote nodes
site	Site where CRAB will transfer jobs outputs. Example: `T2_CH_CERN`
crabOutput	The path where jobs outputs will be stored using `/store/...` notation
localCrabOutput	path where `crabOutput` is mounted in the local file system
finalOutput	path in the local file system where the final post-processed outputs will be stored
maxMemory	Memory requirements per job in MB. Suggested value: 2500
numCores	number of cores per job. Suggested value: 1
inputDBS	Input DBS. Suggested value: global
allowNonValid	Allow processing datasets listed as not VALID on DAS. Suggested value: False
dryrun	Run CRAB in a dry-run mode (for testing). Suggested value: False
maxRecoveryCount	Maximal number of recovery attempts. Suggested value: 3
updateInterval	Interval in minutes between the task update & post-processing iterations
localProcessing	Parameters for the local recovery and post-processing step.
localProcessing / lawTask	Name of the law task. Suggested value: `ProdTask`
localProcessing / workflow	Workflow type. Currently, only the `htcondor` workflow is supported
localProcessing / bootstrap	Bootstrap file to load environment on a remote node. Suggested value: `bootstrap.sh`
localProcessing / requirements	(optional) additional requirement for a remote node
targetOutputFileSize	Desired size of the output files in MiB. Suggested value: 2048
renewKerberosTicket	Periodically renew the validity of a Kerberos ticket. Suggested value: `True` if run on AFS; otherwise `False`
whitelistFinalRecovery	list of "most reliable" sites where the final recovery will be performed

Example configuration file:

cmsswPython: RunKit/nanoProdWrapper.py
params:
  customise: NanoProd/NanoProd/customize.customize
  skimCfg: skim_htt.yaml
  skimSetup: skim
  skimSetupFailed: skim_failed
  maxEvents: -1
splitting: FileBased
unitsPerJob: 16
scriptExe: RunKit/crabJob.sh
outputFiles:
  - nano.root
filesToTransfer:
  - RunKit/crabJob.sh
  - RunKit/crabJob.py
  - RunKit/crabJob_nanoProd.py
  - RunKit/skim_tree.py
  - RunKit/sh_tools.py
  - NanoProd/config/skim_htt.yaml
  - NanoProd/python/customize.py
site: T2_CH_CERN
crabOutput: /store/group/phys_tau/kandroso/prod
localCrabOutput: /eos/cms/store/group/phys_tau/kandroso/prod
finalOutput: /eos/cms/store/group/phys_tau/kandroso/final
maxMemory: 2500
numCores: 1
inputDBS: global
allowNonValid: False
dryrun: False
maxRecoveryCount: 3
updateInterval: 60
localProcessing:
  lawTask: ProdTask
  workflow: htcondor
  bootstrap: bootstrap.sh
targetOutputFileSize: 2048
renewKerberosTicket: True
whitelistFinalRecovery:
  - T1_DE_KIT
  - T2_CH_CERN
  - T2_DE_DESY
  - T2_IT_Legnaro
  - T3_CH_PSI

Task configuration file format¶

The task configuration file uses the YAML format. Each file contains a list of tasks with the same set of parameters. The parameters defined in the task configuration file overwrite the parameters defined in the main configuration file.

Format¶

config: Section with the task-specific parameters. Same as for the main configuration |

all items with names different from "config" are considered task descriptions. Two formats are possible:

short format:

dataset_name: dataset_path

Example:

TTTo2L2Nu: /TTTo2L2Nu_TuneCP5_13TeV-powheg-pythia8/RunIISummer20UL18MiniAODv2-106X_upgrade2018_realistic_v16_L1v1-v1/MINIAODSIM

long format, where some parameters are overwritten:

dataset_name:
    path: dataset_path
    param1: value1
...

Example:

QCD_HT200to300:
    inputDataset: /QCD_HT200to300_TuneCP5_13TeV-madgraphMLM-pythia8/RunIISummer20UL18MiniAODv2-106X_upgrade2018_realistic_v16_L1v1-v2/MINIAODSIM
    ignoreFiles:
        - /store/mc/RunIISummer20UL18MiniAODv2/QCD_HT200to300_TuneCP5_13TeV-madgraphMLM-pythia8/MINIAODSIM/106X_upgrade2018_realistic_v16_L1v1-v2/70000/C38FF40C-E9F0-CB48-B9C7-1E874A4AF010.root

Example configuration file:

config:
  params:
    sampleType: mc
    era: Run2_2018
    storeFailed: True

TTTo2L2Nu: /TTTo2L2Nu_TuneCP5_13TeV-powheg-pythia8/RunIISummer20UL18MiniAODv2-106X_upgrade2018_realistic_v16_L1v1-v1/MINIAODSIM
TTToHadronic: /TTToHadronic_TuneCP5_13TeV-powheg-pythia8/RunIISummer20UL18MiniAODv2-106X_upgrade2018_realistic_v16_L1v1-v1/MINIAODSIM
TTToSemiLeptonic: /TTToSemiLeptonic_TuneCP5_13TeV-powheg-pythia8/RunIISummer20UL18MiniAODv2-106X_upgrade2018_realistic_v16_L1v1-v2/MINIAODSIM

Actions¶

When something goes wrong and automatic recovery fails, it could be necessary to apply some manual manipulations to the tasks. Such intervention can be done by using crabOverseer actions. Alternatively, manual editing of task configuration and status files could be required in complicated cases.

Action is applied from a command line on the selected tasks. Tasks can be selected with the --select argument using Python syntax. Selection is applied to each individual crabTask object.

Examples:

--select 'name == "TTTo2L2Nu"' : select task with name TTTo2L2Nu
--select 'task.params["sampleType"] == "mc"': select all tasks that process MC datasets

The following actions are supported:

print: print names of the selected tasks

Example:

python RunKit/crabOverseer.py --action print --select 'task.params["sampleType"] == "mc"'

list_files_to_process: print a list of files that are yet to be processed

Example:

python RunKit/crabOverseer.py --action list_files_to_process --select 'name == "TTTo2L2Nu"'

kill: kill selected tasks

Example:

python RunKit/crabOverseer.py --action kill --select 'name == "TTTo2L2Nu"'

remove: remove selected tasks (cannot be undone!)

Example:

python RunKit/crabOverseer.py --action remove --select 'name == "TTTo2L2Nu"'

remove_final_output: remove final outputs of selected tasks

Example:

python RunKit/crabOverseer.py --action remove_final_output --select 'name == "TTTo2L2Nu"'

run_cmd: execute the specified Python code on each selected task

Examples:

python RunKit/crabOverseer.py --action 'run_cmd task.kill()' --select 'name == "TTTo2L2Nu"'
python RunKit/crabOverseer.py --action 'run_cmd task.taskStatus.status = status.WaitingForRecovery' --select 'name == "TTTo2L2Nu"'