caper run: error: argument --pbs-extra-param: expected one argument #27

tanpuekai · 2019-10-29T15:52:19Z

Hi Jin,

I figured out that the reason why my jobs were killed because of excessive CPU use is it used only the CPUs, in the same job.

So the apparent solution from your caper document seems to be setting up some 'backend' approach wherein the smaller tasks (such as alignment) were again submitted to job queue for running.

Our HPC is PBS. below is my command for running caper:

caper run \
                        -i input.json \
                        --tmp-dir /TMPDIR3 \
                        --pbs-queue normal \
                        --pbs-extra-param "-l nodes=1:ppn=12,mem=45g,walltime=96:00:00" \
                        --pbs-extra-param "-N Task.backend" \
                        --pbs-extra-param " -j oe" \
                        ../atac-seq-pipeline/atac.wdl

and here is the content (empty) of ~/.caper/default.conf:

running above command resulting in this error message:

usage: caper run [-h] [--dry-run] [-i INPUTS] [-o OPTIONS] [-l LABELS]
                 [-p IMPORTS] [-s STR_LABEL] [--hold]
                 [--singularity-cachedir SINGULARITY_CACHEDIR] [--no-deepcopy]
                 [--deepcopy-ext DEEPCOPY_EXT]
                 [--docker [DOCKER [DOCKER ...]]]
                 [--singularity [SINGULARITY [SINGULARITY ...]]]
                 [--no-build-singularity] [--slurm-partition SLURM_PARTITION]
                 [--slurm-account SLURM_ACCOUNT]
                 [--slurm-extra-param SLURM_EXTRA_PARAM] [--sge-pe SGE_PE]
                 [--sge-queue SGE_QUEUE] [--sge-extra-param SGE_EXTRA_PARAM]
                 [--pbs-queue PBS_QUEUE] [--pbs-extra-param PBS_EXTRA_PARAM]
                 [-m METADATA_OUTPUT] [--java-heap-run JAVA_HEAP_RUN]
                 [--db-timeout DB_TIMEOUT] [--file-db FILE_DB] [--no-file-db]
                 [--mysql-db-ip MYSQL_DB_IP] [--mysql-db-port MYSQL_DB_PORT]
                 [--mysql-db-user MYSQL_DB_USER]
                 [--mysql-db-password MYSQL_DB_PASSWORD] [--cromwell CROMWELL]
                 [--max-concurrent-tasks MAX_CONCURRENT_TASKS]
                 [--max-concurrent-workflows MAX_CONCURRENT_WORKFLOWS]
                 [--max-retries MAX_RETRIES] [--disable-call-caching]
                 [--backend-file BACKEND_FILE] [--out-dir OUT_DIR]
                 [--tmp-dir TMP_DIR] [--gcp-prj GCP_PRJ]
                 [--gcp-zones GCP_ZONES] [--out-gcs-bucket OUT_GCS_BUCKET]
                 [--tmp-gcs-bucket TMP_GCS_BUCKET]
                 [--aws-batch-arn AWS_BATCH_ARN] [--aws-region AWS_REGION]
                 [--out-s3-bucket OUT_S3_BUCKET]
                 [--tmp-s3-bucket TMP_S3_BUCKET] [--use-gsutil-over-aws-s3]
                 [-b BACKEND] [--http-user HTTP_USER]
                 [--http-password HTTP_PASSWORD] [--use-netrc]
                 wdl
caper run: error: argument --pbs-extra-param: expected one argument

Would you pls kindly see what is the problem? And how to solve it?
Thanks

Chan

The text was updated successfully, but these errors were encountered:

leepc12 · 2019-10-29T19:23:09Z

First of all --pbs-extra-param cannot be used multiple times.

To use dashes (-) in a quoted string for Python argparse. You need a whitespace between " and -.

--pbs-extra-param " -l nodes=1:ppn=12,mem=45g,walltime=96:00:00"

But those three parameters are already taken care of by the pipeline. Don't try to add them manually.

Your conf file is empty and that means you didn't use caper init pbs? Your conf file should have something like backend=pbs.

Please caper init pbs first and follow the instruction for PBS.

tanpuekai · 2019-10-30T02:43:42Z

Hi Jin,

Thanks.

I ve now performed caper init pbs and the file ~/.caper/default.conf now looks like this:

$ cat ~/.caper/default.conf
backend=pbs

the caper command now looks like this (note that i added space between " and -l):

 caper run \
                        -i input.json \
                        --tmp-dir /TMPDIR3 \
                        --pbs-queue normalQ \
                        --pbs-extra-param " -l nodes=1:ppn=12,mem=45g,walltime=96:00:00 -N TSK.backend -j oe" \
                        ../atac-seq-pipeline/atac.wdl

Seems to be running now (for half an hour now; and longer than before I performed above two operations). Is this how it should be configured ?

Or should I move the params " -l nodes=1:ppn=12,mem=45g,walltime=96:00:00 -N TSK.backend -j oe" to the ~/.caper/default.conf file?

Thanks.

Best
Chan

leepc12 · 2019-10-30T19:58:00Z

Please don't define --pbs-extra-param for resources, job name and join (-j). Resource parameters are already taken care of the pipeline. For each task in the pipeline, pipeline qsubs with resource parameters. This is internally called by the pipeline for each task.

qsub ... -N {TASK_NAME} -lselect=1:ncpus={CPU}:mem={MEM_MB}mb -lwalltime={WALLTIME_HOUR}:0:0

See this for details.

tanpuekai · 2019-10-31T06:11:11Z

Hi Jin,

Seems there are some bugs in caper handling the PBS backend? Particularly, I am referring to the check alive commands in your caper_backend.py.

Briefly, thanks to your help, I was able to run the job with PBS backend, using this command:

 caper run \
                        -i input.json \
                        --tmp-dir TMPDIR3 \
                        --pbs-queue legacy \
                        ../atac-seq-pipeline/atac.wdl

and with this caper.conf:

user@hpc:/projet1$ cat ~/.caper/default.conf
backend=pbs
user@hpc:/projet1$ qstat
user@hpc:/projet1$ qstat
Job id            Name             User              Time Use S Queue
----------------  ---------------- ----------------  -------- - -----
334705.omics      cromwell_52491e  user            03:52:10 R legacy
334706.omics      cromwell_52491e  user            03:53:43 R legacy
334707.omics      cromwell_52491e  user            03:44:20 R legacy
334708.omics      cromwell_52491e  user            03:56:16 R legacy
334709.omics      cromwell_52491e  user            03:42:30 R legacy
334710.omics      cromwell_52491e  user            03:47:18 R legacy
334711.omics      cromwell_52491e  user            03:38:46 R legacy
334712.omics      cromwell_52491e  user            03:51:21 R legacy

I was very happy, but then one hour later, the main job was killed, and below is the log:

[2019-10-30 11:41:24,51] [info] DispatchedConfigAsyncJobExecutionActor [ESC[38;5;2m52491eafESC[0matac.align:5:1]: executing: qstat -j 334706
[2019-10-30 11:41:24,57] [info] DispatchedConfigAsyncJobExecutionActor [ESC[38;5;2m52491eafESC[0matac.align:5:1]: Status change from Running to WaitingForReturnCode
[2019-10-30 11:41:25,34] [info] DispatchedConfigAsyncJobExecutionActor [ESC[38;5;2m52491eafESC[0matac.align:2:1]: executing: qstat -j 334705
[2019-10-30 11:41:25,36] [info] DispatchedConfigAsyncJobExecutionActor [ESC[38;5;2m52491eafESC[0matac.align:2:1]: Status change from Running to WaitingForReturnCode
[2019-10-30 11:41:27,70] [info] DispatchedConfigAsyncJobExecutionActor [ESC[38;5;2m52491eafESC[0matac.align:3:1]: executing: qstat -j 334709
[2019-10-30 11:41:27,72] [info] DispatchedConfigAsyncJobExecutionActor [ESC[38;5;2m52491eafESC[0matac.align:3:1]: Status change from Running to WaitingForReturnCode
[2019-10-30 11:41:43,31] [info] DispatchedConfigAsyncJobExecutionActor [ESC[38;5;2m52491eafESC[0matac.align_mito:1:1]: executing: qstat -j 334712
[2019-10-30 11:41:43,32] [info] DispatchedConfigAsyncJobExecutionActor [ESC[38;5;2m52491eafESC[0matac.align_mito:1:1]: Status change from Running to WaitingForReturnCode
[2019-10-30 11:41:43,87] [info] DispatchedConfigAsyncJobExecutionActor [ESC[38;5;2m52491eafESC[0matac.align:0:1]: executing: qstat -j 334707
[2019-10-30 11:41:43,89] [info] DispatchedConfigAsyncJobExecutionActor [ESC[38;5;2m52491eafESC[0matac.align:0:1]: Status change from Running to WaitingForReturnCode
[2019-10-30 11:41:52,70] [info] DispatchedConfigAsyncJobExecutionActor [ESC[38;5;2m52491eafESC[0matac.align_mito:4:1]: executing: qstat -j 334713
[2019-10-30 11:41:52,72] [info] DispatchedConfigAsyncJobExecutionActor [ESC[38;5;2m52491eafESC[0matac.align_mito:4:1]: Status change from Running to WaitingForReturnCode
[2019-10-30 11:41:54,28] [info] DispatchedConfigAsyncJobExecutionActor [ESC[38;5;2m52491eafESC[0matac.align_mito:5:1]: executing: qstat -j 334715
[2019-10-30 11:41:54,30] [info] DispatchedConfigAsyncJobExecutionActor [ESC[38;5;2m52491eafESC[0matac.align_mito:5:1]: Status change from Running to WaitingForReturnCode
[2019-10-30 11:41:55,35] [info] DispatchedConfigAsyncJobExecutionActor [ESC[38;5;2m52491eafESC[0matac.align:1:1]: executing: qstat -j 334710
[2019-10-30 11:41:55,37] [info] DispatchedConfigAsyncJobExecutionActor [ESC[38;5;2m52491eafESC[0matac.align:1:1]: Status change from Running to WaitingForReturnCode
[2019-10-30 11:41:57,77] [info] DispatchedConfigAsyncJobExecutionActor [ESC[38;5;2m52491eafESC[0matac.align_mito:2:1]: executing: qstat -j 334711
[2019-10-30 11:41:57,79] [info] DispatchedConfigAsyncJobExecutionActor [ESC[38;5;2m52491eafESC[0matac.align_mito:2:1]: Status change from Running to WaitingForReturnCode
[2019-10-30 11:41:58,80] [info] DispatchedConfigAsyncJobExecutionActor [ESC[38;5;2m52491eafESC[0matac.align_mito:0:1]: executing: qstat -j 334714
[2019-10-30 11:41:58,82] [info] DispatchedConfigAsyncJobExecutionActor [ESC[38;5;2m52491eafESC[0matac.align_mito:0:1]: Status change from Running to WaitingForReturnCode
[2019-10-30 11:42:00,17] [info] DispatchedConfigAsyncJobExecutionActor [ESC[38;5;2m52491eafESC[0matac.align_mito:3:1]: executing: qstat -j 334716
[2019-10-30 11:42:00,19] [info] DispatchedConfigAsyncJobExecutionActor [ESC[38;5;2m52491eafESC[0matac.align_mito:3:1]: Status change from Running to WaitingForReturnCode
[2019-10-30 11:42:20,90] [info] DispatchedConfigAsyncJobExecutionActor [ESC[38;5;2m52491eafESC[0matac.align:4:1]: executing: qstat -j 334708
[2019-10-30 11:42:20,92] [info] DispatchedConfigAsyncJobExecutionActor [ESC[38;5;2m52491eafESC[0matac.align:4:1]: Status change from Running to WaitingForReturnCode
[2019-10-30 11:44:54,31] [ESC[38;5;1merrorESC[0m] DispatchedConfigAsyncJobExecutionActor [ESC[38;5;2m52491eafESC[0matac.align:5:1]: Return file not found after 180 seconds, assuming external kill
[2019-10-30 11:44:54,32] [info] DispatchedConfigAsyncJobExecutionActor [ESC[38;5;2m52491eafESC[0matac.align:5:1]: Status change from WaitingForReturnCode to Failed
[2019-10-30 11:44:55,00] [ESC[38;5;1merrorESC[0m] DispatchedConfigAsyncJobExecutionActor [ESC[38;5;2m52491eafESC[0matac.align:3:1]: Return file not found after 180 seconds, assuming external kill
[2019-10-30 11:44:55,01] [info] DispatchedConfigAsyncJobExecutionActor [ESC[38;5;2m52491eafESC[0matac.align:3:1]: Status change from WaitingForReturnCode to Failed
[2019-10-30 11:45:08,12] [ESC[38;5;1merrorESC[0m] DispatchedConfigAsyncJobExecutionActor [ESC[38;5;2m52491eafESC[0matac.align_mito:4:1]: Return file not found after 180 seconds, assuming external kill
[2019-10-30 11:45:08,13] [info] DispatchedConfigAsyncJobExecutionActor [ESC[38;5;2m52491eafESC[0matac.align_mito:4:1]: Status change from WaitingForReturnCode to Failed
[2019-10-30 11:45:21,34] [ESC[38;5;1merrorESC[0m] DispatchedConfigAsyncJobExecutionActor [ESC[38;5;2m52491eafESC[0matac.align_mito:5:1]: Return file not found after 180 seconds, assuming external kill
[2019-10-30 11:45:21,35] [info] DispatchedConfigAsyncJobExecutionActor [ESC[38;5;2m52491eafESC[0matac.align_mito:5:1]: Status change from WaitingForReturnCode to Failed
[2019-10-30 11:45:32,58] [ESC[38;5;1merrorESC[0m] DispatchedConfigAsyncJobExecutionActor [ESC[38;5;2m52491eafESC[0matac.align_mito:0:1]: Return file not found after 180 seconds, assuming external kill
[2019-10-30 11:45:32,58] [info] DispatchedConfigAsyncJobExecutionActor [ESC[38;5;2m52491eafESC[0matac.align_mito:0:1]: Status change from WaitingForReturnCode to Failed
[2019-10-30 11:45:38,22] [ESC[38;5;1merrorESC[0m] DispatchedConfigAsyncJobExecutionActor [ESC[38;5;2m52491eafESC[0matac.align_mito:1:1]: Return file not found after 180 seconds, assuming external kill
[2019-10-30 11:45:38,23] [info] DispatchedConfigAsyncJobExecutionActor [ESC[38;5;2m52491eafESC[0matac.align_mito:1:1]: Status change from WaitingForReturnCode to Failed
[2019-10-30 11:46:01,25] [ESC[38;5;1merrorESC[0m] DispatchedConfigAsyncJobExecutionActor [ESC[38;5;2m52491eafESC[0matac.align:4:1]: Return file not found after 180 seconds, assuming external kill
[2019-10-30 11:46:01,26] [info] DispatchedConfigAsyncJobExecutionActor [ESC[38;5;2m52491eafESC[0matac.align:4:1]: Status change from WaitingForReturnCode to Failed
[2019-10-30 11:46:02,95] [ESC[38;5;1merrorESC[0m] DispatchedConfigAsyncJobExecutionActor [ESC[38;5;2m52491eafESC[0matac.align:2:1]: Return file not found after 180 seconds, assuming external kill
[2019-10-30 11:46:02,95] [info] DispatchedConfigAsyncJobExecutionActor [ESC[38;5;2m52491eafESC[0matac.align:2:1]: Status change from WaitingForReturnCode to Failed
[2019-10-30 11:46:08,95] [ESC[38;5;1merrorESC[0m] DispatchedConfigAsyncJobExecutionActor [ESC[38;5;2m52491eafESC[0matac.align_mito:3:1]: Return file not found after 180 seconds, assuming external kill
[2019-10-30 11:46:08,96] [info] DispatchedConfigAsyncJobExecutionActor [ESC[38;5;2m52491eafESC[0matac.align_mito:3:1]: Status change from WaitingForReturnCode to Failed
[2019-10-30 11:46:23,98] [ESC[38;5;1merrorESC[0m] DispatchedConfigAsyncJobExecutionActor [ESC[38;5;2m52491eafESC[0matac.align:1:1]: Return file not found after 180 seconds, assuming external kill
[2019-10-30 11:46:23,99] [info] DispatchedConfigAsyncJobExecutionActor [ESC[38;5;2m52491eafESC[0matac.align:1:1]: Status change from WaitingForReturnCode to Failed
[2019-10-30 11:46:25,62] [ESC[38;5;1merrorESC[0m] DispatchedConfigAsyncJobExecutionActor [ESC[38;5;2m52491eafESC[0matac.align_mito:2:1]: Return file not found after 180 seconds, assuming external kill
[2019-10-30 11:46:25,62] [info] DispatchedConfigAsyncJobExecutionActor [ESC[38;5;2m52491eafESC[0matac.align_mito:2:1]: Status change from WaitingForReturnCode to Failed
[2019-10-30 11:46:30,46] [ESC[38;5;1merrorESC[0m] DispatchedConfigAsyncJobExecutionActor [ESC[38;5;2m52491eafESC[0matac.align:0:1]: Return file not found after 180 seconds, assuming external kill
[2019-10-30 11:46:30,46] [info] DispatchedConfigAsyncJobExecutionActor [ESC[38;5;2m52491eafESC[0matac.align:0:1]: Status change from WaitingForReturnCode to Failed
[2019-10-30 11:46:31,40] [ESC[38;5;1merrorESC[0m] WorkflowManagerActor Workflow 52491eaf-92d4-4091-a673-9ac6eabd2005 failed (during ExecutingWorkflowState): java.lang.Exception: The job was aborted from outside Cromwell
        at cromwell.engine.workflow.lifecycle.execution.WorkflowExecutionActor$$anonfun$5.applyOrElse(WorkflowExecutionActor.scala:251)
        at cromwell.engine.workflow.lifecycle.execution.WorkflowExecutionActor$$anonfun$5.applyOrElse(WorkflowExecutionActor.scala:186)
        at scala.PartialFunction$OrElse.apply(PartialFunction.scala:168)
        at akka.actor.FSM.processEvent(FSM.scala:687)
        at akka.actor.FSM.processEvent$(FSM.scala:681)
        at cromwell.engine.workflow.lifecycle.execution.WorkflowExecutionActor.akka$actor$LoggingFSM$$super$processEvent(WorkflowExecutionActor.scala:51)
        at akka.actor.LoggingFSM.processEvent(FSM.scala:820)
        at akka.actor.LoggingFSM.processEvent$(FSM.scala:802)
        at cromwell.engine.workflow.lifecycle.execution.WorkflowExecutionActor.processEvent(WorkflowExecutionActor.scala:51)
        at akka.actor.FSM.akka$actor$FSM$$processMsg(FSM.scala:678)
        at akka.actor.FSM$$anonfun$receive$1.applyOrElse(FSM.scala:672)
        at akka.actor.Actor.aroundReceive(Actor.scala:517)
        at akka.actor.Actor.aroundReceive$(Actor.scala:515)
....
            {
                "causedBy": [],
                "message": "The job was aborted from outside Cromwell"
            },
            {
                "causedBy": [],
                "message": "The job was aborted from outside Cromwell"
            },
            {
                "message": "The job was aborted from outside Cromwell",
                "causedBy": []
            },
            {
                "causedBy": [],
                "message": "The job was aborted from outside Cromwell"
            },
            {
                "causedBy": [],
                "message": "The job was aborted from outside Cromwell"
            },
            {
                "causedBy": [],
                "message": "The job was aborted from outside Cromwell"
            },
            {
                "message": "The job was aborted from outside Cromwell",
                "causedBy": []
            },
            {
                "causedBy": [],
                "message": "The job was aborted from outside Cromwell"
            }
        ],
        "message": "Workflow failed"
    }
]

atac.align_mito Failed. SHARD_IDX=0, RC=None, JOB_ID=334714, RUN_START=2019-10-30T03:38:32.673Z, RUN_END=2019-10-30T03:45:32.592Z, STDOUT=/project1/atac/52491eaf-92d4-4091-a673-9ac6eabd2005/call-align_mito/shard-0/execution/stdout, STDERR=/project1/atac/52491eaf-92d4-4091-a673-9ac6eabd2005/call-align_mito/shard-0/execution/stderr

atac.align_mito Failed. SHARD_IDX=1, RC=None, JOB_ID=334712, RUN_START=2019-10-30T03:38:28.671Z, RUN_END=2019-10-30T03:45:38.233Z, STDOUT=/project1/atac/52491eaf-92d4-4091-a673-9ac6eabd2005/call-align_mito/shard-1/execution/stdout, STDERR=/project1/atac/52491eaf-92d4-4091-a673-9ac6eabd2005/call-align_mito/shard-1/execution/stderr
STDERR_CONTENTS=


atac.align_mito Failed. SHARD_IDX=2, RC=None, JOB_ID=334711, RUN_START=2019-10-30T03:38:26.676Z, RUN_END=2019-10-30T03:46:25.631Z, STDOUT=/project1/atac/52491eaf-92d4-4091-a673-9ac6eabd2005/call-align_mito/shard-2/execution/stdout, STDERR=/project1/atac/52491eaf-92d4-4091-a673-9ac6eabd2005/call-align_mito/shard-2/execution/stderr
STDERR_CONTENTS=


atac.align_mito Failed. SHARD_IDX=3, RC=None, JOB_ID=334716, RUN_START=2019-10-30T03:38:36.665Z, RUN_END=2019-10-30T03:46:08.963Z, STDOUT=/project1/atac/52491eaf-92d4-4091-a673-9ac6eabd2005/call-align_mito/shard-3/execution/stdout, STDERR=/project1/atac/52491eaf-92d4-4091-a673-9ac6eabd2005/call-align_mito/shard-3/execution/stderr

atac.align_mito Failed. SHARD_IDX=4, RC=None, JOB_ID=334713, RUN_START=2019-10-30T03:38:30.665Z, RUN_END=2019-10-30T03:45:08.133Z, STDOUT=/project1/atac/52491eaf-92d4-4091-a673-9ac6eabd2005/call-align_mito/shard-4/execution/stdout, STDERR=/project1/atac/52491eaf-92d4-4091-a673-9ac6eabd2005/call-align_mito/shard-4/execution/stderr

atac.align_mito Failed. SHARD_IDX=5, RC=None, JOB_ID=334715, RUN_START=2019-10-30T03:38:34.666Z, RUN_END=2019-10-30T03:45:21.353Z, STDOUT=/project1/atac/52491eaf-92d4-4091-a673-9ac6eabd2005/call-align_mito/shard-5/execution/stdout, STDERR=/project1/atac/52491eaf-92d4-4091-a673-9ac6eabd2005/call-align_mito/shard-5/execution/stderr

atac.align Failed. SHARD_IDX=0, RC=None, JOB_ID=334707, RUN_START=2019-10-30T03:38:18.678Z, RUN_END=2019-10-30T03:46:30.471Z, STDOUT=/project1/atac/52491eaf-92d4-4091-a673-9ac6eabd2005/call-align/shard-0/execution/stdout, STDERR=/project1/atac/52491eaf-92d4-4091-a673-9ac6eabd2005/call-align/shard-0/execution/stderr
STDERR_CONTENTS=


atac.align Failed. SHARD_IDX=1, RC=None, JOB_ID=334710, RUN_START=2019-10-30T03:38:24.685Z, RUN_END=2019-10-30T03:46:23.992Z, STDOUT=/project1/atac/52491eaf-92d4-4091-a673-9ac6eabd2005/call-align/shard-1/execution/stdout, STDERR=/project1/atac/52491eaf-92d4-4091-a673-9ac6eabd2005/call-align/shard-1/execution/stderr
STDERR_CONTENTS=


atac.align Failed. SHARD_IDX=2, RC=None, JOB_ID=334705, RUN_START=2019-10-30T03:38:14.688Z, RUN_END=2019-10-30T03:46:02.961Z, STDOUT=/project1/atac/52491eaf-92d4-4091-a673-9ac6eabd2005/call-align/shard-2/execution/stdout, STDERR=/project1/atac/52491eaf-92d4-4091-a673-9ac6eabd2005/call-align/shard-2/execution/stderr
STDERR_CONTENTS=


atac.align Failed. SHARD_IDX=3, RC=None, JOB_ID=334709, RUN_START=2019-10-30T03:38:22.685Z, RUN_END=2019-10-30T03:44:55.012Z, STDOUT=/project1/atac/52491eaf-92d4-4091-a673-9ac6eabd2005/call-align/shard-3/execution/stdout, STDERR=/project1/atac/52491eaf-92d4-4091-a673-9ac6eabd2005/call-align/shard-3/execution/stderr
STDERR_CONTENTS=


atac.align Failed. SHARD_IDX=4, RC=None, JOB_ID=334708, RUN_START=2019-10-30T03:38:20.670Z, RUN_END=2019-10-30T03:46:01.263Z, STDOUT=/project1/atac/52491eaf-92d4-4091-a673-9ac6eabd2005/call-align/shard-4/execution/stdout, STDERR=/project1/atac/52491eaf-92d4-4091-a673-9ac6eabd2005/call-align/shard-4/execution/stderr
STDERR_CONTENTS=


atac.align Failed. SHARD_IDX=5, RC=None, JOB_ID=334706, RUN_START=2019-10-30T03:38:16.667Z, RUN_END=2019-10-30T03:44:54.327Z, STDOUT=/project1/atac/52491eaf-92d4-4091-a673-9ac6eabd2005/call-align/shard-5/execution/stdout, STDERR=/project1/atac/52491eaf-92d4-4091-a673-9ac6eabd2005/call-align/shard-5/execution/stderr
STDERR_CONTENTS=

[Caper] run:  1 52491eaf-92d4-4091-a673-9ac6eabd2005 /project1/atac/52491eaf-92d4-4091-a673-9ac6eabd2005/metadata.json

So i looked up into the stderrs:

user@omics:/project1$ ll /project1/atac/52491eaf-92d4-4091-a673-9ac6eabd2005/call-align/shard-5/execution/stderr*
-rw-r--r-- 1 user group1   0 Oct 30 11:38 /project1/atac/52491eaf-92d4-4091-a673-9ac6eabd2005/call-align/shard-5/execution/stderr
-rw-r--r-- 1 user group1 410 Oct 30 11:41 /project1/atac/52491eaf-92d4-4091-a673-9ac6eabd2005/call-align/shard-5/execution/stderr.check
-rw-r--r-- 1 user group1   0 Oct 30 11:38 /project1/atac/52491eaf-92d4-4091-a673-9ac6eabd2005/call-align/shard-5/execution/stderr.submit

So I cat the stderr file with non-zero size:

user@hpc:/project1$ cat atac/52491eaf-92d4-4091-a673-9ac6eabd2005/call-align/shard-1/execution/stderr.check
qstat: invalid option -- 'j'
usage:
qstat [-f] [-J] [-p] [-t] [-x] [-E] [-F format] [-D delim] [ job_identifier... | destination... ]
qstat [-a|-i|-r|-H|-T] [-J] [-t] [-u user] [-n] [-s] [-G|-M] [-1] [-w]
        [ job_identifier... | destination... ]
qstat -Q [-f] [-F format] [-D delim] [ destination... ]
qstat -q [-G|-M] [ destination... ]
qstat -B [-f] [-F format] [-D delim] [ server_name... ]
qstat --version

I went and checked your source codes:

$ grep "qstat" *.py
caper_backend.py:                        "check-alive": "qstat -j ${job_id}",
caper_backend.py:                        "check-alive": "qstat -j ${job_id}",

What I can confirm is that qstat -j is not right. qstat does not accept -j at least in our HPC.
Should it be 'qstat -J' instead?

Thanks.

Best
Chan

leepc12 · 2019-10-31T19:30:47Z

Can you edit caper_backend.py to remove -j from the ""check-alive"" command line and try again?

To find a correct caper_backend.py.

$ python3 -c "import caper; print(caper.caper_args.__file__)"

leepc12 added a commit that referenced this issue Nov 1, 2019

fix: issue #27 (buggy qstat command)

8332725

leepc12 mentioned this issue Nov 1, 2019

Dev13 #29

Merged

tanpuekai closed this as completed Nov 1, 2019

hchintalapudi mentioned this issue Nov 3, 2020

Allocate more resources with PBS backend ENCODE-DCC/atac-seq-pipeline#294

Closed

OrangeyO2 mentioned this issue Dec 6, 2023

--slurm-extra-param: expected one argument #202

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

caper run: error: argument --pbs-extra-param: expected one argument #27

caper run: error: argument --pbs-extra-param: expected one argument #27

tanpuekai commented Oct 29, 2019

leepc12 commented Oct 29, 2019

tanpuekai commented Oct 30, 2019

leepc12 commented Oct 30, 2019

tanpuekai commented Oct 31, 2019 •

edited

Loading

leepc12 commented Oct 31, 2019

caper run: error: argument --pbs-extra-param: expected one argument #27

caper run: error: argument --pbs-extra-param: expected one argument #27

Comments

tanpuekai commented Oct 29, 2019

leepc12 commented Oct 29, 2019

tanpuekai commented Oct 30, 2019

leepc12 commented Oct 30, 2019

tanpuekai commented Oct 31, 2019 • edited Loading

leepc12 commented Oct 31, 2019

tanpuekai commented Oct 31, 2019 •

edited

Loading