Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

caper run: error: argument --pbs-extra-param: expected one argument #27

Closed
tanpuekai opened this issue Oct 29, 2019 · 5 comments
Closed

Comments

@tanpuekai
Copy link

Hi Jin,

I figured out that the reason why my jobs were killed because of excessive CPU use is it used only the CPUs, in the same job.

So the apparent solution from your caper document seems to be setting up some 'backend' approach wherein the smaller tasks (such as alignment) were again submitted to job queue for running.

Our HPC is PBS. below is my command for running caper:

caper run \
                        -i input.json \
                        --tmp-dir /TMPDIR3 \
                        --pbs-queue normal \
                        --pbs-extra-param "-l nodes=1:ppn=12,mem=45g,walltime=96:00:00" \
                        --pbs-extra-param "-N Task.backend" \
                        --pbs-extra-param " -j oe" \
                        ../atac-seq-pipeline/atac.wdl

and here is the content (empty) of ~/.caper/default.conf:

running above command resulting in this error message:

usage: caper run [-h] [--dry-run] [-i INPUTS] [-o OPTIONS] [-l LABELS]
                 [-p IMPORTS] [-s STR_LABEL] [--hold]
                 [--singularity-cachedir SINGULARITY_CACHEDIR] [--no-deepcopy]
                 [--deepcopy-ext DEEPCOPY_EXT]
                 [--docker [DOCKER [DOCKER ...]]]
                 [--singularity [SINGULARITY [SINGULARITY ...]]]
                 [--no-build-singularity] [--slurm-partition SLURM_PARTITION]
                 [--slurm-account SLURM_ACCOUNT]
                 [--slurm-extra-param SLURM_EXTRA_PARAM] [--sge-pe SGE_PE]
                 [--sge-queue SGE_QUEUE] [--sge-extra-param SGE_EXTRA_PARAM]
                 [--pbs-queue PBS_QUEUE] [--pbs-extra-param PBS_EXTRA_PARAM]
                 [-m METADATA_OUTPUT] [--java-heap-run JAVA_HEAP_RUN]
                 [--db-timeout DB_TIMEOUT] [--file-db FILE_DB] [--no-file-db]
                 [--mysql-db-ip MYSQL_DB_IP] [--mysql-db-port MYSQL_DB_PORT]
                 [--mysql-db-user MYSQL_DB_USER]
                 [--mysql-db-password MYSQL_DB_PASSWORD] [--cromwell CROMWELL]
                 [--max-concurrent-tasks MAX_CONCURRENT_TASKS]
                 [--max-concurrent-workflows MAX_CONCURRENT_WORKFLOWS]
                 [--max-retries MAX_RETRIES] [--disable-call-caching]
                 [--backend-file BACKEND_FILE] [--out-dir OUT_DIR]
                 [--tmp-dir TMP_DIR] [--gcp-prj GCP_PRJ]
                 [--gcp-zones GCP_ZONES] [--out-gcs-bucket OUT_GCS_BUCKET]
                 [--tmp-gcs-bucket TMP_GCS_BUCKET]
                 [--aws-batch-arn AWS_BATCH_ARN] [--aws-region AWS_REGION]
                 [--out-s3-bucket OUT_S3_BUCKET]
                 [--tmp-s3-bucket TMP_S3_BUCKET] [--use-gsutil-over-aws-s3]
                 [-b BACKEND] [--http-user HTTP_USER]
                 [--http-password HTTP_PASSWORD] [--use-netrc]
                 wdl
caper run: error: argument --pbs-extra-param: expected one argument

Would you pls kindly see what is the problem? And how to solve it?
Thanks

Chan

@leepc12
Copy link
Contributor

leepc12 commented Oct 29, 2019

First of all --pbs-extra-param cannot be used multiple times.

To use dashes (-) in a quoted string for Python argparse. You need a whitespace between " and -.

--pbs-extra-param " -l nodes=1:ppn=12,mem=45g,walltime=96:00:00"

But those three parameters are already taken care of by the pipeline. Don't try to add them manually.

Your conf file is empty and that means you didn't use caper init pbs? Your conf file should have something like backend=pbs.

Please caper init pbs first and follow the instruction for PBS.

@tanpuekai
Copy link
Author

Hi Jin,

Thanks.

I ve now performed caper init pbs and the file ~/.caper/default.conf now looks like this:

$ cat ~/.caper/default.conf
backend=pbs

the caper command now looks like this (note that i added space between " and -l):

 caper run \
                        -i input.json \
                        --tmp-dir /TMPDIR3 \
                        --pbs-queue normalQ \
                        --pbs-extra-param " -l nodes=1:ppn=12,mem=45g,walltime=96:00:00 -N TSK.backend -j oe" \
                        ../atac-seq-pipeline/atac.wdl

Seems to be running now (for half an hour now; and longer than before I performed above two operations). Is this how it should be configured ?

Or should I move the params " -l nodes=1:ppn=12,mem=45g,walltime=96:00:00 -N TSK.backend -j oe" to the ~/.caper/default.conf file?

Thanks.

Best
Chan

@leepc12
Copy link
Contributor

leepc12 commented Oct 30, 2019

Please don't define --pbs-extra-param for resources, job name and join (-j). Resource parameters are already taken care of the pipeline. For each task in the pipeline, pipeline qsubs with resource parameters. This is internally called by the pipeline for each task.

qsub ... -N {TASK_NAME} -lselect=1:ncpus={CPU}:mem={MEM_MB}mb -lwalltime={WALLTIME_HOUR}:0:0

See this for details.

@tanpuekai
Copy link
Author

tanpuekai commented Oct 31, 2019

Hi Jin,

Seems there are some bugs in caper handling the PBS backend? Particularly, I am referring to the check alive commands in your caper_backend.py.

Briefly, thanks to your help, I was able to run the job with PBS backend, using this command:

 caper run \
                        -i input.json \
                        --tmp-dir TMPDIR3 \
                        --pbs-queue legacy \
                        ../atac-seq-pipeline/atac.wdl

and with this caper.conf:

user@hpc:/projet1$ cat ~/.caper/default.conf
backend=pbs
user@hpc:/projet1$ qstat
user@hpc:/projet1$ qstat
Job id            Name             User              Time Use S Queue
----------------  ---------------- ----------------  -------- - -----
334705.omics      cromwell_52491e  user            03:52:10 R legacy
334706.omics      cromwell_52491e  user            03:53:43 R legacy
334707.omics      cromwell_52491e  user            03:44:20 R legacy
334708.omics      cromwell_52491e  user            03:56:16 R legacy
334709.omics      cromwell_52491e  user            03:42:30 R legacy
334710.omics      cromwell_52491e  user            03:47:18 R legacy
334711.omics      cromwell_52491e  user            03:38:46 R legacy
334712.omics      cromwell_52491e  user            03:51:21 R legacy

I was very happy, but then one hour later, the main job was killed, and below is the log:

[2019-10-30 11:41:24,51] [info] DispatchedConfigAsyncJobExecutionActor [ESC[38;5;2m52491eafESC[0matac.align:5:1]: executing: qstat -j 334706
[2019-10-30 11:41:24,57] [info] DispatchedConfigAsyncJobExecutionActor [ESC[38;5;2m52491eafESC[0matac.align:5:1]: Status change from Running to WaitingForReturnCode
[2019-10-30 11:41:25,34] [info] DispatchedConfigAsyncJobExecutionActor [ESC[38;5;2m52491eafESC[0matac.align:2:1]: executing: qstat -j 334705
[2019-10-30 11:41:25,36] [info] DispatchedConfigAsyncJobExecutionActor [ESC[38;5;2m52491eafESC[0matac.align:2:1]: Status change from Running to WaitingForReturnCode
[2019-10-30 11:41:27,70] [info] DispatchedConfigAsyncJobExecutionActor [ESC[38;5;2m52491eafESC[0matac.align:3:1]: executing: qstat -j 334709
[2019-10-30 11:41:27,72] [info] DispatchedConfigAsyncJobExecutionActor [ESC[38;5;2m52491eafESC[0matac.align:3:1]: Status change from Running to WaitingForReturnCode
[2019-10-30 11:41:43,31] [info] DispatchedConfigAsyncJobExecutionActor [ESC[38;5;2m52491eafESC[0matac.align_mito:1:1]: executing: qstat -j 334712
[2019-10-30 11:41:43,32] [info] DispatchedConfigAsyncJobExecutionActor [ESC[38;5;2m52491eafESC[0matac.align_mito:1:1]: Status change from Running to WaitingForReturnCode
[2019-10-30 11:41:43,87] [info] DispatchedConfigAsyncJobExecutionActor [ESC[38;5;2m52491eafESC[0matac.align:0:1]: executing: qstat -j 334707
[2019-10-30 11:41:43,89] [info] DispatchedConfigAsyncJobExecutionActor [ESC[38;5;2m52491eafESC[0matac.align:0:1]: Status change from Running to WaitingForReturnCode
[2019-10-30 11:41:52,70] [info] DispatchedConfigAsyncJobExecutionActor [ESC[38;5;2m52491eafESC[0matac.align_mito:4:1]: executing: qstat -j 334713
[2019-10-30 11:41:52,72] [info] DispatchedConfigAsyncJobExecutionActor [ESC[38;5;2m52491eafESC[0matac.align_mito:4:1]: Status change from Running to WaitingForReturnCode
[2019-10-30 11:41:54,28] [info] DispatchedConfigAsyncJobExecutionActor [ESC[38;5;2m52491eafESC[0matac.align_mito:5:1]: executing: qstat -j 334715
[2019-10-30 11:41:54,30] [info] DispatchedConfigAsyncJobExecutionActor [ESC[38;5;2m52491eafESC[0matac.align_mito:5:1]: Status change from Running to WaitingForReturnCode
[2019-10-30 11:41:55,35] [info] DispatchedConfigAsyncJobExecutionActor [ESC[38;5;2m52491eafESC[0matac.align:1:1]: executing: qstat -j 334710
[2019-10-30 11:41:55,37] [info] DispatchedConfigAsyncJobExecutionActor [ESC[38;5;2m52491eafESC[0matac.align:1:1]: Status change from Running to WaitingForReturnCode
[2019-10-30 11:41:57,77] [info] DispatchedConfigAsyncJobExecutionActor [ESC[38;5;2m52491eafESC[0matac.align_mito:2:1]: executing: qstat -j 334711
[2019-10-30 11:41:57,79] [info] DispatchedConfigAsyncJobExecutionActor [ESC[38;5;2m52491eafESC[0matac.align_mito:2:1]: Status change from Running to WaitingForReturnCode
[2019-10-30 11:41:58,80] [info] DispatchedConfigAsyncJobExecutionActor [ESC[38;5;2m52491eafESC[0matac.align_mito:0:1]: executing: qstat -j 334714
[2019-10-30 11:41:58,82] [info] DispatchedConfigAsyncJobExecutionActor [ESC[38;5;2m52491eafESC[0matac.align_mito:0:1]: Status change from Running to WaitingForReturnCode
[2019-10-30 11:42:00,17] [info] DispatchedConfigAsyncJobExecutionActor [ESC[38;5;2m52491eafESC[0matac.align_mito:3:1]: executing: qstat -j 334716
[2019-10-30 11:42:00,19] [info] DispatchedConfigAsyncJobExecutionActor [ESC[38;5;2m52491eafESC[0matac.align_mito:3:1]: Status change from Running to WaitingForReturnCode
[2019-10-30 11:42:20,90] [info] DispatchedConfigAsyncJobExecutionActor [ESC[38;5;2m52491eafESC[0matac.align:4:1]: executing: qstat -j 334708
[2019-10-30 11:42:20,92] [info] DispatchedConfigAsyncJobExecutionActor [ESC[38;5;2m52491eafESC[0matac.align:4:1]: Status change from Running to WaitingForReturnCode
[2019-10-30 11:44:54,31] [ESC[38;5;1merrorESC[0m] DispatchedConfigAsyncJobExecutionActor [ESC[38;5;2m52491eafESC[0matac.align:5:1]: Return file not found after 180 seconds, assuming external kill
[2019-10-30 11:44:54,32] [info] DispatchedConfigAsyncJobExecutionActor [ESC[38;5;2m52491eafESC[0matac.align:5:1]: Status change from WaitingForReturnCode to Failed
[2019-10-30 11:44:55,00] [ESC[38;5;1merrorESC[0m] DispatchedConfigAsyncJobExecutionActor [ESC[38;5;2m52491eafESC[0matac.align:3:1]: Return file not found after 180 seconds, assuming external kill
[2019-10-30 11:44:55,01] [info] DispatchedConfigAsyncJobExecutionActor [ESC[38;5;2m52491eafESC[0matac.align:3:1]: Status change from WaitingForReturnCode to Failed
[2019-10-30 11:45:08,12] [ESC[38;5;1merrorESC[0m] DispatchedConfigAsyncJobExecutionActor [ESC[38;5;2m52491eafESC[0matac.align_mito:4:1]: Return file not found after 180 seconds, assuming external kill
[2019-10-30 11:45:08,13] [info] DispatchedConfigAsyncJobExecutionActor [ESC[38;5;2m52491eafESC[0matac.align_mito:4:1]: Status change from WaitingForReturnCode to Failed
[2019-10-30 11:45:21,34] [ESC[38;5;1merrorESC[0m] DispatchedConfigAsyncJobExecutionActor [ESC[38;5;2m52491eafESC[0matac.align_mito:5:1]: Return file not found after 180 seconds, assuming external kill
[2019-10-30 11:45:21,35] [info] DispatchedConfigAsyncJobExecutionActor [ESC[38;5;2m52491eafESC[0matac.align_mito:5:1]: Status change from WaitingForReturnCode to Failed
[2019-10-30 11:45:32,58] [ESC[38;5;1merrorESC[0m] DispatchedConfigAsyncJobExecutionActor [ESC[38;5;2m52491eafESC[0matac.align_mito:0:1]: Return file not found after 180 seconds, assuming external kill
[2019-10-30 11:45:32,58] [info] DispatchedConfigAsyncJobExecutionActor [ESC[38;5;2m52491eafESC[0matac.align_mito:0:1]: Status change from WaitingForReturnCode to Failed
[2019-10-30 11:45:38,22] [ESC[38;5;1merrorESC[0m] DispatchedConfigAsyncJobExecutionActor [ESC[38;5;2m52491eafESC[0matac.align_mito:1:1]: Return file not found after 180 seconds, assuming external kill
[2019-10-30 11:45:38,23] [info] DispatchedConfigAsyncJobExecutionActor [ESC[38;5;2m52491eafESC[0matac.align_mito:1:1]: Status change from WaitingForReturnCode to Failed
[2019-10-30 11:46:01,25] [ESC[38;5;1merrorESC[0m] DispatchedConfigAsyncJobExecutionActor [ESC[38;5;2m52491eafESC[0matac.align:4:1]: Return file not found after 180 seconds, assuming external kill
[2019-10-30 11:46:01,26] [info] DispatchedConfigAsyncJobExecutionActor [ESC[38;5;2m52491eafESC[0matac.align:4:1]: Status change from WaitingForReturnCode to Failed
[2019-10-30 11:46:02,95] [ESC[38;5;1merrorESC[0m] DispatchedConfigAsyncJobExecutionActor [ESC[38;5;2m52491eafESC[0matac.align:2:1]: Return file not found after 180 seconds, assuming external kill
[2019-10-30 11:46:02,95] [info] DispatchedConfigAsyncJobExecutionActor [ESC[38;5;2m52491eafESC[0matac.align:2:1]: Status change from WaitingForReturnCode to Failed
[2019-10-30 11:46:08,95] [ESC[38;5;1merrorESC[0m] DispatchedConfigAsyncJobExecutionActor [ESC[38;5;2m52491eafESC[0matac.align_mito:3:1]: Return file not found after 180 seconds, assuming external kill
[2019-10-30 11:46:08,96] [info] DispatchedConfigAsyncJobExecutionActor [ESC[38;5;2m52491eafESC[0matac.align_mito:3:1]: Status change from WaitingForReturnCode to Failed
[2019-10-30 11:46:23,98] [ESC[38;5;1merrorESC[0m] DispatchedConfigAsyncJobExecutionActor [ESC[38;5;2m52491eafESC[0matac.align:1:1]: Return file not found after 180 seconds, assuming external kill
[2019-10-30 11:46:23,99] [info] DispatchedConfigAsyncJobExecutionActor [ESC[38;5;2m52491eafESC[0matac.align:1:1]: Status change from WaitingForReturnCode to Failed
[2019-10-30 11:46:25,62] [ESC[38;5;1merrorESC[0m] DispatchedConfigAsyncJobExecutionActor [ESC[38;5;2m52491eafESC[0matac.align_mito:2:1]: Return file not found after 180 seconds, assuming external kill
[2019-10-30 11:46:25,62] [info] DispatchedConfigAsyncJobExecutionActor [ESC[38;5;2m52491eafESC[0matac.align_mito:2:1]: Status change from WaitingForReturnCode to Failed
[2019-10-30 11:46:30,46] [ESC[38;5;1merrorESC[0m] DispatchedConfigAsyncJobExecutionActor [ESC[38;5;2m52491eafESC[0matac.align:0:1]: Return file not found after 180 seconds, assuming external kill
[2019-10-30 11:46:30,46] [info] DispatchedConfigAsyncJobExecutionActor [ESC[38;5;2m52491eafESC[0matac.align:0:1]: Status change from WaitingForReturnCode to Failed
[2019-10-30 11:46:31,40] [ESC[38;5;1merrorESC[0m] WorkflowManagerActor Workflow 52491eaf-92d4-4091-a673-9ac6eabd2005 failed (during ExecutingWorkflowState): java.lang.Exception: The job was aborted from outside Cromwell
        at cromwell.engine.workflow.lifecycle.execution.WorkflowExecutionActor$$anonfun$5.applyOrElse(WorkflowExecutionActor.scala:251)
        at cromwell.engine.workflow.lifecycle.execution.WorkflowExecutionActor$$anonfun$5.applyOrElse(WorkflowExecutionActor.scala:186)
        at scala.PartialFunction$OrElse.apply(PartialFunction.scala:168)
        at akka.actor.FSM.processEvent(FSM.scala:687)
        at akka.actor.FSM.processEvent$(FSM.scala:681)
        at cromwell.engine.workflow.lifecycle.execution.WorkflowExecutionActor.akka$actor$LoggingFSM$$super$processEvent(WorkflowExecutionActor.scala:51)
        at akka.actor.LoggingFSM.processEvent(FSM.scala:820)
        at akka.actor.LoggingFSM.processEvent$(FSM.scala:802)
        at cromwell.engine.workflow.lifecycle.execution.WorkflowExecutionActor.processEvent(WorkflowExecutionActor.scala:51)
        at akka.actor.FSM.akka$actor$FSM$$processMsg(FSM.scala:678)
        at akka.actor.FSM$$anonfun$receive$1.applyOrElse(FSM.scala:672)
        at akka.actor.Actor.aroundReceive(Actor.scala:517)
        at akka.actor.Actor.aroundReceive$(Actor.scala:515)
....
            {
                "causedBy": [],
                "message": "The job was aborted from outside Cromwell"
            },
            {
                "causedBy": [],
                "message": "The job was aborted from outside Cromwell"
            },
            {
                "message": "The job was aborted from outside Cromwell",
                "causedBy": []
            },
            {
                "causedBy": [],
                "message": "The job was aborted from outside Cromwell"
            },
            {
                "causedBy": [],
                "message": "The job was aborted from outside Cromwell"
            },
            {
                "causedBy": [],
                "message": "The job was aborted from outside Cromwell"
            },
            {
                "message": "The job was aborted from outside Cromwell",
                "causedBy": []
            },
            {
                "causedBy": [],
                "message": "The job was aborted from outside Cromwell"
            }
        ],
        "message": "Workflow failed"
    }
]

atac.align_mito Failed. SHARD_IDX=0, RC=None, JOB_ID=334714, RUN_START=2019-10-30T03:38:32.673Z, RUN_END=2019-10-30T03:45:32.592Z, STDOUT=/project1/atac/52491eaf-92d4-4091-a673-9ac6eabd2005/call-align_mito/shard-0/execution/stdout, STDERR=/project1/atac/52491eaf-92d4-4091-a673-9ac6eabd2005/call-align_mito/shard-0/execution/stderr

atac.align_mito Failed. SHARD_IDX=1, RC=None, JOB_ID=334712, RUN_START=2019-10-30T03:38:28.671Z, RUN_END=2019-10-30T03:45:38.233Z, STDOUT=/project1/atac/52491eaf-92d4-4091-a673-9ac6eabd2005/call-align_mito/shard-1/execution/stdout, STDERR=/project1/atac/52491eaf-92d4-4091-a673-9ac6eabd2005/call-align_mito/shard-1/execution/stderr
STDERR_CONTENTS=


atac.align_mito Failed. SHARD_IDX=2, RC=None, JOB_ID=334711, RUN_START=2019-10-30T03:38:26.676Z, RUN_END=2019-10-30T03:46:25.631Z, STDOUT=/project1/atac/52491eaf-92d4-4091-a673-9ac6eabd2005/call-align_mito/shard-2/execution/stdout, STDERR=/project1/atac/52491eaf-92d4-4091-a673-9ac6eabd2005/call-align_mito/shard-2/execution/stderr
STDERR_CONTENTS=


atac.align_mito Failed. SHARD_IDX=3, RC=None, JOB_ID=334716, RUN_START=2019-10-30T03:38:36.665Z, RUN_END=2019-10-30T03:46:08.963Z, STDOUT=/project1/atac/52491eaf-92d4-4091-a673-9ac6eabd2005/call-align_mito/shard-3/execution/stdout, STDERR=/project1/atac/52491eaf-92d4-4091-a673-9ac6eabd2005/call-align_mito/shard-3/execution/stderr

atac.align_mito Failed. SHARD_IDX=4, RC=None, JOB_ID=334713, RUN_START=2019-10-30T03:38:30.665Z, RUN_END=2019-10-30T03:45:08.133Z, STDOUT=/project1/atac/52491eaf-92d4-4091-a673-9ac6eabd2005/call-align_mito/shard-4/execution/stdout, STDERR=/project1/atac/52491eaf-92d4-4091-a673-9ac6eabd2005/call-align_mito/shard-4/execution/stderr

atac.align_mito Failed. SHARD_IDX=5, RC=None, JOB_ID=334715, RUN_START=2019-10-30T03:38:34.666Z, RUN_END=2019-10-30T03:45:21.353Z, STDOUT=/project1/atac/52491eaf-92d4-4091-a673-9ac6eabd2005/call-align_mito/shard-5/execution/stdout, STDERR=/project1/atac/52491eaf-92d4-4091-a673-9ac6eabd2005/call-align_mito/shard-5/execution/stderr

atac.align Failed. SHARD_IDX=0, RC=None, JOB_ID=334707, RUN_START=2019-10-30T03:38:18.678Z, RUN_END=2019-10-30T03:46:30.471Z, STDOUT=/project1/atac/52491eaf-92d4-4091-a673-9ac6eabd2005/call-align/shard-0/execution/stdout, STDERR=/project1/atac/52491eaf-92d4-4091-a673-9ac6eabd2005/call-align/shard-0/execution/stderr
STDERR_CONTENTS=


atac.align Failed. SHARD_IDX=1, RC=None, JOB_ID=334710, RUN_START=2019-10-30T03:38:24.685Z, RUN_END=2019-10-30T03:46:23.992Z, STDOUT=/project1/atac/52491eaf-92d4-4091-a673-9ac6eabd2005/call-align/shard-1/execution/stdout, STDERR=/project1/atac/52491eaf-92d4-4091-a673-9ac6eabd2005/call-align/shard-1/execution/stderr
STDERR_CONTENTS=


atac.align Failed. SHARD_IDX=2, RC=None, JOB_ID=334705, RUN_START=2019-10-30T03:38:14.688Z, RUN_END=2019-10-30T03:46:02.961Z, STDOUT=/project1/atac/52491eaf-92d4-4091-a673-9ac6eabd2005/call-align/shard-2/execution/stdout, STDERR=/project1/atac/52491eaf-92d4-4091-a673-9ac6eabd2005/call-align/shard-2/execution/stderr
STDERR_CONTENTS=


atac.align Failed. SHARD_IDX=3, RC=None, JOB_ID=334709, RUN_START=2019-10-30T03:38:22.685Z, RUN_END=2019-10-30T03:44:55.012Z, STDOUT=/project1/atac/52491eaf-92d4-4091-a673-9ac6eabd2005/call-align/shard-3/execution/stdout, STDERR=/project1/atac/52491eaf-92d4-4091-a673-9ac6eabd2005/call-align/shard-3/execution/stderr
STDERR_CONTENTS=


atac.align Failed. SHARD_IDX=4, RC=None, JOB_ID=334708, RUN_START=2019-10-30T03:38:20.670Z, RUN_END=2019-10-30T03:46:01.263Z, STDOUT=/project1/atac/52491eaf-92d4-4091-a673-9ac6eabd2005/call-align/shard-4/execution/stdout, STDERR=/project1/atac/52491eaf-92d4-4091-a673-9ac6eabd2005/call-align/shard-4/execution/stderr
STDERR_CONTENTS=


atac.align Failed. SHARD_IDX=5, RC=None, JOB_ID=334706, RUN_START=2019-10-30T03:38:16.667Z, RUN_END=2019-10-30T03:44:54.327Z, STDOUT=/project1/atac/52491eaf-92d4-4091-a673-9ac6eabd2005/call-align/shard-5/execution/stdout, STDERR=/project1/atac/52491eaf-92d4-4091-a673-9ac6eabd2005/call-align/shard-5/execution/stderr
STDERR_CONTENTS=

[Caper] run:  1 52491eaf-92d4-4091-a673-9ac6eabd2005 /project1/atac/52491eaf-92d4-4091-a673-9ac6eabd2005/metadata.json

So i looked up into the stderrs:

user@omics:/project1$ ll /project1/atac/52491eaf-92d4-4091-a673-9ac6eabd2005/call-align/shard-5/execution/stderr*
-rw-r--r-- 1 user group1   0 Oct 30 11:38 /project1/atac/52491eaf-92d4-4091-a673-9ac6eabd2005/call-align/shard-5/execution/stderr
-rw-r--r-- 1 user group1 410 Oct 30 11:41 /project1/atac/52491eaf-92d4-4091-a673-9ac6eabd2005/call-align/shard-5/execution/stderr.check
-rw-r--r-- 1 user group1   0 Oct 30 11:38 /project1/atac/52491eaf-92d4-4091-a673-9ac6eabd2005/call-align/shard-5/execution/stderr.submit

So I cat the stderr file with non-zero size:

user@hpc:/project1$ cat atac/52491eaf-92d4-4091-a673-9ac6eabd2005/call-align/shard-1/execution/stderr.check
qstat: invalid option -- 'j'
usage:
qstat [-f] [-J] [-p] [-t] [-x] [-E] [-F format] [-D delim] [ job_identifier... | destination... ]
qstat [-a|-i|-r|-H|-T] [-J] [-t] [-u user] [-n] [-s] [-G|-M] [-1] [-w]
        [ job_identifier... | destination... ]
qstat -Q [-f] [-F format] [-D delim] [ destination... ]
qstat -q [-G|-M] [ destination... ]
qstat -B [-f] [-F format] [-D delim] [ server_name... ]
qstat --version

I went and checked your source codes:

$ grep "qstat" *.py
caper_backend.py:                        "check-alive": "qstat -j ${job_id}",
caper_backend.py:                        "check-alive": "qstat -j ${job_id}",

What I can confirm is that qstat -j is not right. qstat does not accept -j at least in our HPC.
Should it be 'qstat -J' instead?

Thanks.

Best
Chan

@leepc12
Copy link
Contributor

leepc12 commented Oct 31, 2019

Can you edit caper_backend.py to remove -j from the ""check-alive"" command line and try again?

To find a correct caper_backend.py.

$ python3 -c "import caper; print(caper.caper_args.__file__)"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants