The following exercises deal with Condor-G and DAGMan.
Check the Condor queue with condor_q
Condor should already be set up and running on terminable.ci.uchicago.edu. You can check this by running condor_q:
$ condor_q
-- Submitter: terminable.ci.uchicago.edu : <128.135.125.193:36236> : terminable.ci.uchicago.edu
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
0 jobs; 0 idle, 0 running, 0 held
When you submit jobs using Condor, you will see your jobs listed in the output of condor_q. You might also see other students' jobs in the queue alongside your own.
Create Your Working Directories
Next, create some directories for you to work in. Make them in your home directory:
$cd /work$mkdir YOURLOGIN$cd YOURLOGIN$mkdir condor-tutorial$cd condor-tutorial$mkdir submit
Now we are ready to submit our first job with Condor-G. The basic procedure is to create a Condor job submit description file. This file can tell Condor what executable to run, what resources to use, how to handle failures, where to store the job's output, and many other characteristics of the job submission. Then this file is given to condor_submit.
There are many options that can be specified in a Condor-G submit description file. We will start out with just a few. We'll be sending the job to the computer terminable.ci.uchicago.edu and running under the "jobmanager-fork" job manager. We're setting notification to never to avoid getting email messages about the completion of our job, and redirecting the stdout/err of the job back to the submission computer.
For more information, see the condor_submit manual.
Feel free to use your favorite editor, but we will demonstrate with cat in the example below. When using cat to create files, press Ctrl+D to close the file - don't actually type Ctrl+D into the file. Whenever you create a file using cat, we suggest you use cat to display the file and confirm that it contains the expected text.
Move to our scratch submission directory and create the submit file. Verify that it was entered correctly:
$cd /work/YOURLOGIN/condor-tutorial/submit$cat > myjob.submit executable=/home/benc//primetest arguments=117 output=results.output error=results.error log=results.log notification=never universe=grid grid_resource=gt2 osg-edu.cs.wisc.edu/jobmanager-condor queue Ctrl+D
$ condor_submit myjob.submit
Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 1.Run condor_q to see the progress of your job. You may also want to run condor_q -globus at regular intervals to see Globus-specific status information. (See the condor_q manual for more information.)
$condor_q-- Submitter: terminable.ci.uchicago.edu : <128.135.125.193:35688> : terminable.ci.uchicago.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 1.0 YOURLOGIN 7/10 17:28 0+00:00:00 I 0 0.0 myscript.sh TestJo 1 jobs; 1 idle, 0 running, 0 held $condor_q -globus-- Submitter: terminable.ci.uchicago.edu : <128.135.125.193:35688> : terminable.ci.uchicago.edu ID OWNER STATUS MANAGER HOST EXECUTABLE 1.0 YOURLOGIN UNSUBMITTED fork gridlab2.ci.uchicago.edu /home/YOURLOGIN/cond $condor_q-- Submitter: terminable.ci.uchicago.edu : <128.135.125.193:35688> : terminable.ci.uchicago.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 1.0 YOURLOGIN 7/10 17:28 0+00:00:27 R 0 0.0 myscript.sh TestJo 1 jobs; 0 idle, 1 running, 0 held $condor_q -globus-- Submitter: terminable.ci.uchicago.edu : <128.135.125.193:35688> : terminable.ci.uchicago.edu ID OWNER STATUS MANAGER HOST EXECUTABLE 1.0 YOURLOGIN ACTIVE fork gridlab2.ci.uchicago.edu /home/YOURLOGIN/cond $condor_q-- Submitter: terminable.ci.uchicago.edu : <128.135.125.193:33785> : terminable.ci.uchicago.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 1.0 YOURLOGIN 7/10 17:28 0+00:00:40 C 0 0.0 myscript.sh 0 jobs; 0 idle, 0 running, 0 held $condor_q -globus-- Submitter: terminable.ci.uchicago.edu : <128.135.125.193:33785> : terminable.ci.uchicago.edu ID OWNER STATUS MANAGER HOST EXECUTABLE 1.0 adesmet DONE fork gridlab2.ci.uchicago.edu /home/YOURLOGIN/cond $condor_q-- Submitter: terminable.ci.uchicago.edu : <128.135.125.193:33785> : terminable.ci.uchicago.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 0 jobs; 0 idle, 0 running, 0 held
In another window, run tail -f on the log file for your job to monitor progress. Re-run tail when you submit one or more jobs throughout this tutorial. You will see how typical Condor-G jobs progress. Use Ctrl+C to stop watching the file.
$cd /work/YOURLOGIN/condor-tutorial/submit$tail -f --lines=500 results.log000 (001.000.000) 07/10 17:28:48 Job submitted from host: <128.135.125.193:35688> ... 017 (001.000.000) 03/24 19:13:30 Job submitted to Globus RM-Contact: terminable.ci.uchicago.edu/jobmanager-fork JM-Contact: https://terminable.ci.uchicago.edu:34127/28997/1174763610/ Can-Restart-JM: 1 ... 027 (001.000.000) 07/10 17:29:01 Job submitted to grid resource GridResource: gt2 terminable.ci.uchicago.edu/jobmanager-fork GridJobId: gt2 terminable.ci.uchicago.edu/jobmanager-fork https://terminable.ci.uchicago.edu:51277/31413/1174756212/ ... 001 (001.000.000) 07/10 17:29:01 Job executing on host: gt2 terminable.ci.uchicago.edu/jobmanager-fork ... 005 (001.000.000) 07/10 17:30:08 Job terminated. (1) Normal termination (return value 0) Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage 0 - Run Bytes Sent By Job 0 - Run Bytes Received By Job 0 - Total Bytes Sent By Job 0 - Total Bytes Received By Job ...
When the job is no longer listed in condor_q, or when the log file reports Job terminated, the results can be viewed using condor_history:
$ condor_history
ID OWNER SUBMITTED RUN_TIME ST COMPLETED CMD
1.0 YOURLOGIN 7/10 10:28 0+00:00:00 C ??? /home/YOURLOGIN/condWhen the job completes, verify that the output is as expected.
$lsmyjob.submit myscript.sh* results.error results.log results.output $cat results.error$cat results.outputNO - 3 is a factor
If you didn't watch results.log with tail -f, you might want to examine the logged information with cat results.log .
When a problem occurs in the middleware, Condor-G will hold your job. Held jobs remain in the queue, waiting for user intervention. When you resolve the problem, you can use condor_release to free the job to continue.
Use condor_hold to manually place jobs on hold (e.g., to delay your run).
For this example, we'll make the output file non-writable. The job will be unable to copy the results back and will be placed on hold.
Submit the job again, but this time immediately after submitting it, mark the output file as read-only:
$ condor_submit myjob.submit ; chmod a-w results.output
Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 3.Watch the job with tail. When the job goes on hold, use Ctrl+C to exit tail. Note that condor_q reports that the job is in the H or "held state".
$tail -f --lines=500 results.log000 (003.000.000) 07/12 22:35:44 Job submitted from host: <128.135.125.193:32864> ... 027 (003.000.000) 07/12 22:35:57 Job submitted to grid resource GridResource: gt2 terminable.ci.uchicago.edu/jobmanager-fork GridJobId: gt2 terminable.ci.uchicago.edu/jobmanager-fork https://terminable.ci.uchicago.edu:44026/31670/1174757075/ ... 001 (003.000.000) 07/12 22:35:57 Job executing on host: gt2 terminable.ci.uchicago.edu/jobmanager-fork ... 012 (003.000.000) 07/12 22:36:52 Job was held. Globus error 155: the job manager could not stage out a file Code 2 Subcode 155 ...Ctrl+C$condor_q-- Submitter: terminable.ci.uchicago.edu : <128.135.125.193:32864> : terminable.ci.uchicago.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 2.0 YOURLOGIN 7/12 22:35 0+00:00:55 H 0 0.0 myscript.sh TestJo 1 jobs; 0 idle, 0 running, 1 held
Fix the problem (make the file writable again), then release the job. You can specifiy the job's ID, or just use -all to release all held jobs.
$chmod u+w results.output$condor_release -allAll jobs released.
Run tail -f in another window to watch the log until the job finishes:
$tail -f --lines=500 results.log000 (003.000.000) 07/12 22:35:44 Job submitted from host: <L128.135.125.193:32864> ... 027 (003.000.000) 07/12 22:35:57 Job submitted to grid resource GridResource: gt2 terminable.ci.uchicago.edu/jobmanager-fork GridJobId: gt2 terminable.ci.uchicago.edu/jobmanager-fork https://terminable.ci.uchicago.edu:44026/31670/1174757075/... ... 001 (003.000.000) 07/12 22:35:57 Job executing on host: terminable.ci.uchicago.edu ... 012 (003.000.000) 07/12 22:36:52 Job was held. Globus error 155: the job manager could not stage out a file Code 2 Subcode 155 ... 013 (003.000.000) 07/12 22:44:33 Job was released. via condor_release (by user YOURLOGIN) ... 027 (003.000.000) 07/12 22:35:57 Job submitted to grid resource GridResource: gt2 terminable.ci.uchicago.edu/jobmanager-fork GridJobId: gt2 terminable.ci.uchicago.edu/jobmanager-fork https://terminable.ci.uchicago.edu:44026/31670/1174757075/... ... 001 (003.000.000) 07/12 22:44:46 Job executing on host: gt2 terminable.ci.uchicago.edu/jobmanager-fork ... 005 (003.000.000) 07/12 22:44:51 Job terminated. (1) Normal termination (return value 0) Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage 0 - Run Bytes Sent By Job 0 - Run Bytes Received By Job 0 - Total Bytes Sent By Job 0 - Total Bytes Received By Job ...Ctrl+C
After your job has finished running, check that the results have been retreived successfully:
$ cat results.output
NO - 3 is a factorClean up the results before continuing:
$ rm results.*Now we'll use DAGman, a tool which will help us run several grid jobs at once. (more information)
Create a small shell script to monitor the Condor-G queue. We will use this throughout the rest of the tutorial:
$cat > watch_condor_q #! /bin/sh while true; do condor_q YOURLOGIN condor_q -globus YOURLOGIN sleep 10 done Ctrl+D$cat watch_condor_q#! /bin/sh while true; do condor_q condor_q -globus sleep 10 done $chmod a+x watch_condor_q
Create a minimal DAG for DAGMan. This DAG will have a single node.
$cat > mydag.dag Job HelloWorld myjob.submit Ctrl+D$cat mydag.dagJob HelloWorld myjob.submit
Submit the DAG.
This section requires you to have three windows open. We will submit the DAG in the first window and watch the progress of it and the job in the other two. We will do these in the following order:
In the first window, submit the DAG and then watch condor with watch_condor_q.
In the second window, tail the results log.
In the third window, tail the DAGMan log.
Submit the DAG with condor_submit_dag and watch the run with watch_condor_q. condor_dagman is running as a job and submits your real job on your behalf, without your direct intervention. You might see the C (completed) state as your job finishes, but that often goes by too quickly to notice.
$condor_submit_dag mydag.dagChecking your DAG input file and all submit files it references. This might take a while... Done. ----------------------------------------------------------------------- File for submitting this DAG to Condor : mydag.dag.condor.sub Log of DAGMan debugging messages : mydag.dag.dagman.out Log of Condor library debug messages : mydag.dag.lib.out Log of the life of condor_dagman itself : mydag.dag.dagman.log Condor Log file for all jobs of this DAG : results.log Submitting job(s). Logging submit event(s). 1 job(s) submitted to cluster 2. ----------------------------------------------------------------------- $./watch_condor_q
In the first log window, watch the job log file as your job runs:
$ tail -f --lines=500 results.log
In a third window, watch DAGMan's log file by runnning tail -f --lines=500 mydag.dag.dagman.out. We suggest that you re-run this command whenever you submit a DAG during the remainder of this tutorial. This will show you how a typical DAG progresses. Use Ctrl+C to stop watching the file. An example is shown below:
$cd /work/YOURLOGIN/condor-tutorial/submit$tail -f --lines=500 mydag.dag.dagman.out7/10 10:36:43 ****************************************************** 7/10 10:36:43 ** condor_scheduniv_exec.6.0 (CONDOR_DAGMAN) STARTING UP 7/10 10:36:43 ** $CondorVersion: 6.8.4 Apr 22 2006 $ 7/10 10:36:43 ** $CondorPlatform: INTEL-LINUX-GLIBC22 $ 7/10 10:36:43 ** PID = 26844 7/10 10:36:43 ****************************************************** 7/10 10:36:44 DaemonCore: Command Socket at <128.135.125.193:34571> 7/10 10:36:44 argv[0] == "condor_scheduniv_exec.6.0" 7/10 10:36:44 argv[1] == "-Debug" 7/10 10:36:44 argv[2] == "3" 7/10 10:36:44 argv[3] == "-Lockfile" 7/10 10:36:44 argv[4] == "mydag.dag.lock" 7/10 10:36:44 argv[5] == "-Condorlog" 7/10 10:36:44 argv[6] == "results.log" 7/10 10:36:44 argv[7] == "-Dag" 7/10 10:36:44 argv[8] == "mydag.dag" 7/10 10:36:44 argv[9] == "-Rescue" 7/10 10:36:44 argv[10] == "mydag.dag.rescue" 7/10 10:36:44 Condor log will be written to results.log 7/10 10:36:44 DAG Lockfile will be written to mydag.dag.lock 7/10 10:36:44 DAG Input file is mydag.dag 7/10 10:36:44 Rescue DAG will be written to mydag.dag.rescue 7/10 10:36:44 Parsing mydag.dag ... 7/10 10:36:44 Dag contains 1 total jobs 7/10 10:36:44 Bootstrapping... 7/10 10:36:44 Number of pre-completed jobs: 0 7/10 10:36:44 Submitting Job HelloWorld ... 7/10 10:36:44 assigned Condor ID (7.0.0) 7/10 10:36:45 Event: ULOG_SUBMIT for Job HelloWorld (7.0.0) 7/10 10:36:45 0/1 done, 0 failed, 1 submitted, 0 ready, 0 pre, 0 post 7/10 10:37:05 Event: ULOG_GLOBUS_SUBMIT for Job HelloWorld (7.0.0) 7/10 10:37:05 Event: ULOG_EXECUTE for Job HelloWorld (7.0.0) 7/10 10:38:10 Event: ULOG_JOB_TERMINATED for Job HelloWorld (7.0.0) 7/10 10:38:10 Job HelloWorld completed successfully. 7/10 10:38:10 1/1 done, 0 failed, 0 submitted, 0 ready, 0 pre, 0 post 7/10 10:38:10 All jobs Completed! 7/10 10:38:10 **** condor_scheduniv_exec.6.0 (condor_DAGMAN) EXITING WITH STATUS 0
The first window, running watch_condor_q, should look something like the following:
$./watch_condor_q-- Submitter: terminable.ci.uchicago.edu : <128.135.125.193:35688> : terminable.ci.uchicago.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 2.0 YOURLOGIN 7/10 17:33 0+00:00:03 R 0 2.6 condor_dagman -f - 3.0 YOURLOGIN 7/10 17:33 0+00:00:00 I 0 0.0 myscript.sh TestJo 2 jobs; 1 idle, 1 running, 0 held -- Submitter: terminable.ci.uchicago.edu : <128.135.125.193:35688> : terminable.ci.uchicago.edu ID OWNER STATUS MANAGER HOST EXECUTABLE 3.0 YOURLOGIN UNSUBMITTED fork terminable.ci.uchicago.edu /tmp/YOURLOGIN-cond -- Submitter: terminable.ci.uchicago.edu : <128.135.125.193:35688> : terminable.ci.uchicago.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 2.0 YOURLOGIN 7/10 17:33 0+00:00:33 R 0 2.6 condor_dagman -f - 3.0 YOURLOGIN 7/10 17:33 0+00:00:15 R 0 0.0 myscript.sh TestJo 2 jobs; 0 idle, 2 running, 0 held -- Submitter: terminable.ci.uchicago.edu : <128.135.125.193:35688> : terminable.ci.uchicago.edu ID OWNER STATUS MANAGER HOST EXECUTABLE 3.0 YOURLOGIN ACTIVE fork terminable.ci.uchicago.edu /home/YOURLOGIN/cond -- Submitter: terminable.ci.uchicago.edu : <128.135.125.193:35688> : terminable.ci.uchicago.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 2.0 YOURLOGIN 7/10 17:33 0+00:01:03 R 0 2.6 condor_dagman -f - 3.0 YOURLOGIN 7/10 17:33 0+00:00:45 R 0 0.0 myscript.sh TestJo 2 jobs; 0 idle, 2 running, 0 held -- Submitter: terminable.ci.uchicago.edu : <128.135.125.193:35688> : terminable.ci.uchicago.edu ID OWNER STATUS MANAGER HOST EXECUTABLE 3.0 YOURLOGIN ACTIVE fork terminable.ci.uchicago.edu /tmp/YOURLOGIN-cond -- Submitter: terminable.ci.uchicago.edu : <128.135.125.193:35688> : terminable.ci.uchicago.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 0 jobs; 0 idle, 0 running, 0 held -- Submitter: terminable.ci.uchicago.edu : <128.135.125.193:35688> : terminable.ci.uchicago.edu ID OWNER STATUS MANAGER HOST EXECUTABLECtrl+C
Verify your results:
$ls -ltotal 12 -rw-r--r-- 1 YOURLOGIN YOURLOGIN 28 Jul 10 10:35 mydag.dag -rw-r--r-- 1 YOURLOGIN YOURLOGIN 523 Jul 10 10:36 mydag.dag.condor.sub -rw-r--r-- 1 YOURLOGIN YOURLOGIN 608 Jul 10 10:38 mydag.dag.dagman.log -rw-r--r-- 1 YOURLOGIN YOURLOGIN 1860 Jul 10 10:38 mydag.dag.dagman.out -rw-r--r-- 1 YOURLOGIN YOURLOGIN 29 Jul 10 10:38 mydag.dag.lib.out -rw------- 1 YOURLOGIN YOURLOGIN 0 Jul 10 10:36 mydag.dag.lock -rw-r--r-- 1 YOURLOGIN YOURLOGIN 175 Jul 9 18:13 myjob.submit -rwxr-xr-x 1 YOURLOGIN YOURLOGIN 194 Jul 10 10:36 myscript.sh -rw-r--r-- 1 YOURLOGIN YOURLOGIN 31 Jul 10 10:37 results.error -rw------- 1 YOURLOGIN YOURLOGIN 833 Jul 10 10:38 results.log -rw-r--r-- 1 YOURLOGIN YOURLOGIN 261 Jul 10 10:37 results.output -rwxr-xr-x 1 YOURLOGIN YOURLOGIN 81 Jul 10 10:35 watch_condor_q $cat results.error$cat results.outputNO - 3 is a factor
Looking at DAGMan's various files, we see that DAGMan itself ran as a Condor job (specifically, a scheduler universe job):
$lsmydag.dag mydag.dag.dagman.log mydag.dag.lib.out myjob.submit results.error results.output mydag.dag.condor.sub mydag.dag.dagman.out mydag.dag.lock myscript.sh results.log watch_condor_q $cat mydag.dag.condor.sub# Filename: mydag.dag.condor.sub # Generated by condor_submit_dag mydag.dag universe = scheduler executable = /path/to/condor/bin/condor_dagman getenv = True output = mydag.dag.lib.out error = mydag.dag.lib.out log = mydag.dag.dagman.log remove_kill_sig = SIGUSR1 arguments = -f -l . -Debug 3 -Lockfile mydag.dag.lock -Condorlog results.log -Dag mydag.dag -Rescue mydag.dag.rescue environment = _CONDOR_DAGMAN_LOG=mydag.dag.dagman.out;_CONDOR_MAX_DAGMAN_LOG=0 queue $cat mydag.dag.dagman.log000 (006.000.000) 07/10 10:36:43 Job submitted from host: <128.135.125.193:33785> ... 001 (006.000.000) 07/10 10:36:44 Job executing on host: <128.135.125.193:33785> ... 005 (006.000.000) 07/10 10:38:10 Job terminated. (1) Normal termination (return value 0) Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage 0 - Run Bytes Sent By Job 0 - Run Bytes Received By Job 0 - Total Bytes Sent By Job 0 - Total Bytes Received By Job ...
If you weren't watching the DAGMan output file with tail -f, you can examine the file with the following command:
$ cat mydag.dag.dagman.out
Clean up your results. Be careful when deleting mydag.dag.* to not delete mydag.dag. Note the .*!
$ rm mydag.dag.* results.*
Typically each node in a DAG will have its own Condor submit file. Create some more submit files by copying our existing file. For simplicity during this tutorial, we'll keep the submit files very similar, notably using the same executable. In real-world use, your submit files and executables can differ.
$cp myjob.submit job.setup.submit$cp myjob.submit job.work1.submit$cp myjob.submit job.work2.submit$cp myjob.submit job.workfinal.submit$cp myjob.submit job.finalize.submit
Change the output and error entries to point to results.NODE.output and results.NODE.error files where NODE is actually the middle word in the submit file (job.NODE.submit).
So job.finalize.error would include:
output=results.finalize.output error=results.finalize.error
Here is one possible set of settings for the output entries:
$ grep '^output=' job.*.submit
job.finalize.submit:output=results.finalize.output
job.setup.submit:output=results.setup.output
job.work1.submit:output=results.work1.output
job.work2.submit:output=results.work2.output
job.workfinal.submit:output=results.workfinal.outputThis prevents the various nodes from overwriting each other's output.
Do not change the log entries. DAGMan requires that all nodes output their logs in the same location. Condor will ensure that the different jobs will not overwrite each other's entries in the log.
Change the arguments entries so that the first argument is something unique to each node (perhaps the NODE name).
For node work2, change the second argument to 120 so that it looks something like arguments=MyWorkerNode2 120
$cat mydag.dagJob HelloWorld myjob.submit $cat >> mydag.dag Job Setup job.setup.submit Job WorkerNode_1 job.work1.submit Job WorkerNode_Two job.work2.submit Job CollectResults job.workfinal.submit Job LastNode job.finalize.submit PARENT Setup CHILD WorkerNode_1 WorkerNode_Two PARENT WorkerNode_1 WorkerNode_Two CHILD CollectResults PARENT CollectResults CHILD LastNode Ctrl+D$cat mydag.dagJob HelloWorld myjob.submit Job Setup job.setup.submit Job WorkerNode_1 job.work1.submit Job WorkerNode_Two job.work2.submit Job CollectResults job.workfinal.submit Job LastNode job.finalize.submit PARENT Setup CHILD WorkerNode_1 WorkerNode_Two PARENT WorkerNode_1 WorkerNode_Two CHILD CollectResults PARENT CollectResults CHILD LastNodeCtrl+C
condor_q -dag will organize jobs into their associated DAGs. Change watch_condor_q to use this:
$rm watch_condor_q$cat > watch_condor_q #! /bin/sh while true; do echo .... echo .... Output from condor_q echo .... condor_q train99 echo .... echo .... Output from condor_q -globus echo .... condor_q -globus train99 echo .... echo .... Output from condor_q -dag echo .... condor_q -dag train99 sleep 10 done Ctrl+D$cat watch_condor_q#! /bin/sh while true; do echo .... echo .... Output from condor_q echo .... condor_q echo .... echo .... Output from condor_q -globus echo .... condor_q -globus echo .... echo .... Output from condor_q -dag echo .... condor_q -dag sleep 10 done $chmod a+x watch_condor_q
In separate windows, run tail -f --lines=500 results.log and tail -f --lines=500 mydag.dag.dagman.out to monitor the job's progress.
$condor_submit_dag mydag.dagChecking your DAG input file and all submit files it references. This might take a while... Done. ----------------------------------------------------------------------- File for submitting this DAG to Condor : mydag.dag.condor.sub Log of DAGMan debugging messages : mydag.dag.dagman.out Log of Condor library debug messages : mydag.dag.lib.out Log of the life of condor_dagman itself : mydag.dag.dagman.log Condor Log file for all jobs of this DAG : results.log Submitting job(s). Logging submit event(s). 1 job(s) submitted to cluster 8. ----------------------------------------------------------------------- $./watch_condor_q-- Submitter: terminable.ci.uchicago.edu : <128.135.125.193:35688> : terminable.ci.uchicago.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 4.0 adesmet 7/10 17:45 0+00:00:08 R 0 2.6 condor_dagman -f - 5.0 adesmet 7/10 17:45 0+00:00:00 I 0 0.0 myscript.sh TestJo 6.0 adesmet 7/10 17:45 0+00:00:00 I 0 0.0 myscript.sh Setup 3 jobs; 2 idle, 1 running, 0 held -- Submitter: terminable.ci.uchicago.edu : <128.135.125.193:35688> : terminable.ci.uchicago.edu ID OWNER STATUS MANAGER HOST EXECUTABLE 5.0 YOURLOGIN UNSUBMITTED fork terminable.ci.uchicago.edu /tmp/username-cond 6.0 YOURLOGIN UNSUBMITTED fork terminable.ci.uchicago.edu /tmp/username-cond -- Submitter: terminable.ci.uchicago.edu : <128.135.125.193:35688> : terminable.ci.uchicago.edu ID OWNER/NODENAME SUBMITTED RUN_TIME ST PRI SIZE CMD 4.0 YOURLOGIN 7/10 17:45 0+00:00:08 R 0 2.6 condor_dagman -f - 5.0 |-HelloWorld 7/10 17:45 0+00:00:00 I 0 0.0 myscript.sh TestJo 6.0 |-Setup 7/10 17:45 0+00:00:00 I 0 0.0 myscript.sh Setup 3 jobs; 2 idle, 1 running, 0 held -- Submitter: terminable.ci.uchicago.edu : <128.135.125.193:35688> : terminable.ci.uchicago.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 4.0 YOURLOGIN 7/10 17:45 0+00:00:12 R 0 2.6 condor_dagman -f - 5.0 YOURLOGIN 7/10 17:45 0+00:00:00 I 0 0.0 myscript.sh TestJo 6.0 YOURLOGIN 7/10 17:45 0+00:00:00 I 0 0.0 myscript.sh Setup 3 jobs; 2 idle, 1 running, 0 held -- Submitter: terminable.ci.uchicago.edu : <128.135.125.193:35688> : terminable.ci.uchicago.edu ID OWNER STATUS MANAGER HOST EXECUTABLE 5.0 YOURLOGIN UNSUBMITTED fork terminable.ci.uchicago.edu /tmp/username-cond 6.0 YOURLOGIN UNSUBMITTED fork terminable.ci.uchicago.edu /tmp/username-cond -- Submitter: terminable.ci.uchicago.edu : <128.135.125.193:35688> : terminable.ci.uchicago.edu ID OWNER/NODENAME SUBMITTED RUN_TIME ST PRI SIZE CMD 4.0 YOURLOGIN 7/10 17:45 0+00:00:12 R 0 2.6 condor_dagman -f - 5.0 |-HelloWorld 7/10 17:45 0+00:00:00 I 0 0.0 myscript.sh TestJo 6.0 |-Setup 7/10 17:45 0+00:00:00 I 0 0.0 myscript.sh Setup 3 jobs; 2 idle, 1 running, 0 held -- Submitter: terminable.ci.uchicago.edu : <128.135.125.193:35688> : terminable.ci.uchicago.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 4.0 YOURLOGIN 7/10 17:45 0+00:00:42 R 0 2.6 condor_dagman -f - 5.0 YOURLOGIN 7/10 17:45 0+00:00:24 R 0 0.0 myscript.sh TestJo 6.0 YOURLOGIN 7/10 17:45 0+00:00:24 R 0 0.0 myscript.sh Setup 3 jobs; 0 idle, 3 running, 0 held -- Submitter: terminable.ci.uchicago.edu : <128.105.185.14:35688> : terminable.ci.uchicago.edu ID OWNER STATUS MANAGER HOST EXECUTABLE 5.0 YOURLOGIN ACTIVE fork terminable.ci.uchicago.edu /tmp/username-cond 6.0 adesmet ACTIVE fork gk2 /tmp/username-cond -- Submitter: terminable.ci.uchicago.edu : <128.135.125.193:35688> : terminable.ci.uchicago.edu ID OWNER/NODENAME SUBMITTED RUN_TIME ST PRI SIZE CMD 4.0 YOURLOGIN 7/10 17:45 0+00:00:42 R 0 2.6 condor_dagman -f - 5.0 |-HelloWorld 7/10 17:45 0+00:00:24 R 0 0.0 myscript.sh TestJo 6.0 |-Setup 7/10 17:45 0+00:00:24 R 0 0.0 myscript.sh Setup 3 jobs; 0 idle, 3 running, 0 held -- Submitter: terminable.ci.uchicago.edu : <128.135.125.193:35688> : terminable.ci.uchicago.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 4.0 YOURLOGIN 7/10 17:45 0+00:01:12 R 0 2.6 condor_dagman -f - 5.0 YOURLOGIN 7/10 17:45 0+00:00:54 R 0 0.0 myscript.sh TestJo 6.0 YOURLOGIN 7/10 17:45 0+00:00:54 R 0 0.0 myscript.sh Setup 3 jobs; 0 idle, 3 running, 0 held -- Submitter: terminable.ci.uchicago.edu : <128.135.125.193:35688> : terminable.ci.uchicago.edu ID OWNER STATUS MANAGER HOST EXECUTABLE 5.0 YOURLOGIN ACTIVE fork terminable.ci.uchicago.edu /tmp/username-cond 6.0 YOURLOGIN ACTIVE fork terminable.ci.uchicago.edu /tmp/username-cond -- Submitter: terminable.ci.uchicago.edu : <128.135.125.193:35688> : terminable.ci.uchicago.edu ID OWNER/NODENAME SUBMITTED RUN_TIME ST PRI SIZE CMD 4.0 YOURLOGIN 7/10 17:45 0+00:01:12 R 0 2.6 condor_dagman -f - 5.0 |-HelloWorld 7/10 17:45 0+00:00:54 R 0 0.0 myscript.sh TestJo 6.0 |-Setup 7/10 17:45 0+00:00:54 R 0 0.0 myscript.sh Setup 3 jobs; 0 idle, 3 running, 0 held -- Submitter: terminable.ci.uchicago.edu : <128.135.125.193:35688> : terminable.ci.uchicago.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 4.0 YOURLOGIN 7/10 17:45 0+00:01:42 R 0 2.6 condor_dagman -f - 7.0 YOURLOGIN 7/10 17:46 0+00:00:00 I 0 0.0 myscript.sh work1 8.0 YOURLOGIN 7/10 17:46 0+00:00:00 I 0 0.0 myscript.sh Worker 3 jobs; 2 idle, 1 running, 0 held -- Submitter: terminable.ci.uchicago.edu : <128.135.125.193:35688> : terminable.ci.uchicago.edu ID OWNER STATUS MANAGER HOST EXECUTABLE 7.0 YOURLOGIN UNSUBMITTED fork terminable.ci.uchicago.edu /tmp/username-cond 8.0 YOURLOGIN UNSUBMITTED fork terminable.ci.uchicago.edu /tmp/username-cond -- Submitter: terminable.ci.uchicago.edu : <128.135.125.193:35688> : terminable.ci.uchicago.edu ID OWNER/NODENAME SUBMITTED RUN_TIME ST PRI SIZE CMD 4.0 YOURLOGIN 7/10 17:45 0+00:01:42 R 0 2.6 condor_dagman -f - 7.0 |-WorkerNode_ 7/10 17:46 0+00:00:00 I 0 0.0 myscript.sh work1 8.0 |-WorkerNode_ 7/10 17:46 0+00:00:00 I 0 0.0 myscript.sh Worker 3 jobs; 2 idle, 1 running, 0 held -- Submitter: terminable.ci.uchicago.edu : <128.135.125.193:35688> : terminable.ci.uchicago.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 4.0 YOURLOGIN 7/10 17:45 0+00:02:12 R 0 2.6 condor_dagman -f - 7.0 YOURLOGIN 7/10 17:46 0+00:00:27 R 0 0.0 myscript.sh work1 8.0 YOURLOGIN 7/10 17:46 0+00:00:27 R 0 0.0 myscript.sh Worker 3 jobs; 0 idle, 3 running, 0 held -- Submitter: terminable.ci.uchicago.edu : <128.135.125.193:35688> : terminable.ci.uchicago.edu ID OWNER STATUS MANAGER HOST EXECUTABLE 7.0 YOURLOGIN ACTIVE fork terminable.ci.uchicago.edu /tmp/username-cond 8.0 YOURLOGIN ACTIVE fork terminable.ci.uchicago.edu /tmp/username-cond -- Submitter: terminable.ci.uchicago.edu : <128.135.125.193:35688> : terminable.ci.uchicago.edu ID OWNER/NODENAME SUBMITTED RUN_TIME ST PRI SIZE CMD 4.0 YOURLOGIN 7/10 17:45 0+00:02:12 R 0 2.6 condor_dagman -f - 7.0 |-WorkerNode_ 7/10 17:46 0+00:00:27 R 0 0.0 myscript.sh work1 8.0 |-WorkerNode_ 7/10 17:46 0+00:00:27 R 0 0.0 myscript.sh Worker 3 jobs; 0 idle, 3 running, 0 held -- Submitter: terminable.ci.uchicago.edu : <128.135.125.193:35688> : terminable.ci.uchicago.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 4.0 YOURLOGIN 7/10 17:45 0+00:02:42 R 0 2.6 condor_dagman -f - 7.0 YOURLOGIN 7/10 17:46 0+00:00:57 R 0 0.0 myscript.sh work1 8.0 YOURLOGIN 7/10 17:46 0+00:00:57 R 0 0.0 myscript.sh Worker 3 jobs; 0 idle, 3 running, 0 held -- Submitter: terminable.ci.uchicago.edu : <128.135.125.193:35688> : terminable.ci.uchicago.edu ID OWNER STATUS MANAGER HOST EXECUTABLE 7.0 YOURLOGIN ACTIVE fork terminable.ci.uchicago.edu /tmp/username-cond 8.0 YOURLOGIN ACTIVE fork terminable.ci.uchicago.edu /tmp/username-cond -- Submitter: terminable.ci.uchicago.edu : <128.135.125.193:35688> : terminable.ci.uchicago.edu ID OWNER/NODENAME SUBMITTED RUN_TIME ST PRI SIZE CMD 4.0 YOURLOGIN 7/10 17:45 0+00:02:43 R 0 2.6 condor_dagman -f - 7.0 |-WorkerNode_ 7/10 17:46 0+00:00:58 R 0 0.0 myscript.sh work1 8.0 |-WorkerNode_ 7/10 17:46 0+00:00:58 R 0 0.0 myscript.sh Worker 3 jobs; 0 idle, 3 running, 0 held -- Submitter: terminable.ci.uchicago.edu : <128.135.125.193:35688> : terminable.ci.uchicago.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 4.0 YOURLOGIN 7/10 17:45 0+00:03:13 R 0 2.6 condor_dagman -f - 8.0 YOURLOGIN 7/10 17:46 0+00:01:28 R 0 0.0 myscript.sh Worker 2 jobs; 0 idle, 2 running, 0 held -- Submitter: terminable.ci.uchicago.edu : <128.135.125.193:35688> : terminable.ci.uchicago.edu ID OWNER STATUS MANAGER HOST EXECUTABLE 8.0 YOURLOGIN ACTIVE fork terminable.ci.uchicago.edu /tmp/username-cond -- Submitter: terminable.ci.uchicago.edu : <128.135.125.193:35688> : terminable.ci.uchicago.edu ID OWNER/NODENAME SUBMITTED RUN_TIME ST PRI SIZE CMD 4.0 YOURLOGIN 7/10 17:45 0+00:03:13 R 0 2.6 condor_dagman -f - 8.0 |-WorkerNode_ 7/10 17:46 0+00:01:28 R 0 0.0 myscript.sh Worker 2 jobs; 0 idle, 2 running, 0 held -- Submitter: terminable.ci.uchicago.edu : <128.135.125.193:35688> : terminable.ci.uchicago.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 4.0 YOURLOGIN 7/10 17:45 0+00:03:43 R 0 2.6 condor_dagman -f - 8.0 YOURLOGIN 7/10 17:46 0+00:01:58 R 0 0.0 myscript.sh Worker 2 jobs; 0 idle, 2 running, 0 held -- Submitter: terminable.ci.uchicago.edu : <128.135.125.193:35688> : terminable.ci.uchicago.edu ID OWNER STATUS MANAGER HOST EXECUTABLE 8.0 YOURLOGIN ACTIVE fork terminable.ci.uchicago.edu /tmp/username-cond -- Submitter: terminable.ci.uchicago.edu : <128.135.125.193:35688> : terminable.ci.uchicago.edu ID OWNER/NODENAME SUBMITTED RUN_TIME ST PRI SIZE CMD 4.0 YOURLOGIN 7/10 17:45 0+00:03:43 R 0 2.6 condor_dagman -f - 8.0 |-WorkerNode_ 7/10 17:46 0+00:01:58 R 0 0.0 myscript.sh Worker 2 jobs; 0 idle, 2 running, 0 held -- Submitter: terminable.ci.uchicago.edu : <128.135.125.193:35688> : terminable.ci.uchicago.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 4.0 YOURLOGIN 7/10 17:45 0+00:04:13 R 0 2.6 condor_dagman -f - 9.0 YOURLOGIN 7/10 17:49 0+00:00:02 R 0 0.0 myscript.sh workfi 2 jobs; 0 idle, 2 running, 0 held -- Submitter: terminable.ci.uchicago.edu : <128.135.125.193:35688> : terminable.ci.uchicago.edu ID OWNER STATUS MANAGER HOST EXECUTABLE 9.0 YOURLOGIN ACTIVE fork terminable.ci.uchicago.edu /tmp/username-cond -- Submitter: terminable.ci.uchicago.edu : <128.135.125.193:35688> : terminable.ci.uchicago.edu ID OWNER/NODENAME SUBMITTED RUN_TIME ST PRI SIZE CMD 4.0 YOURLOGIN 7/10 17:45 0+00:04:13 R 0 2.6 condor_dagman -f - 9.0 |-CollectResu 7/10 17:49 0+00:00:02 R 0 0.0 myscript.sh workfi 2 jobs; 0 idle, 2 running, 0 held -- Submitter: terminable.ci.uchicago.edu : <128.135.125.193:35688> : terminable.ci.uchicago.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 4.0 YOURLOGIN 7/10 17:45 0+00:04:43 R 0 2.6 condor_dagman -f - 9.0 YOURLOGIN 7/10 17:49 0+00:00:32 R 0 0.0 myscript.sh workfi 2 jobs; 0 idle, 2 running, 0 held -- Submitter: terminable.ci.uchicago.edu : <128.135.125.193:35688> : terminable.ci.uchicago.edu ID OWNER STATUS MANAGER HOST EXECUTABLE 9.0 YOURLOGIN ACTIVE fork terminable.ci.uchicago.edu /tmp/username-cond -- Submitter: terminable.ci.uchicago.edu : <128.135.125.193:35688> : terminable.ci.uchicago.edu ID OWNER/NODENAME SUBMITTED RUN_TIME ST PRI SIZE CMD 4.0 YOURLOGIN 7/10 17:45 0+00:04:43 R 0 2.6 condor_dagman -f - 9.0 |-CollectResu 7/10 17:49 0+00:00:32 R 0 0.0 myscript.sh workfi 2 jobs; 0 idle, 2 running, 0 held -- Submitter: terminable.ci.uchicago.edu : <128.135.125.193:35688> : terminable.ci.uchicago.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 4.0 YOURLOGIN 7/10 17:45 0+00:05:13 R 0 2.6 condor_dagman -f - 9.0 YOURLOGIN 7/10 17:49 0+00:01:02 R 0 0.0 myscript.sh workfi 2 jobs; 0 idle, 2 running, 0 held -- Submitter: terminable.ci.uchicago.edu : <128.135.125.193:35688> : terminable.ci.uchicago.edu ID OWNER STATUS MANAGER HOST EXECUTABLE 9.0 YOURLOGIN DONE fork terminable.ci.uchicago.edu /tmp/username-cond -- Submitter: terminable.ci.uchicago.edu : <128.135.125.193:35688> : terminable.ci.uchicago.edu ID OWNER/NODENAME SUBMITTED RUN_TIME ST PRI SIZE CMD 4.0 YOURLOGIN 7/10 17:45 0+00:05:13 R 0 2.6 condor_dagman -f - 9.0 |-CollectResu 7/10 17:49 0+00:01:02 C 0 0.0 myscript.sh workfi 1 jobs; 0 idle, 1 running, 0 held -- Submitter: terminable.ci.uchicago.edu : <128.135.125.193:35688> : terminable.ci.uchicago.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 4.0 YOURLOGIN 7/10 17:45 0+00:05:43 R 0 2.6 condor_dagman -f - 10.0 YOURLOGIN 7/10 17:50 0+00:00:13 R 0 0.0 myscript.sh Final 2 jobs; 0 idle, 2 running, 0 held -- Submitter: terminable.ci.uchicago.edu : <128.135.125.193:35688> : terminable.ci.uchicago.edu ID OWNER STATUS MANAGER HOST EXECUTABLE 10.0 YOURLOGIN ACTIVE fork terminable.ci.uchicago.edu /tmp/username-cond -- Submitter: terminable.ci.uchicago.edu : <128.135.125.193:35688> : terminable.ci.uchicago.edu ID OWNER/NODENAME SUBMITTED RUN_TIME ST PRI SIZE CMD 4.0 YOURLOGIN 7/10 17:45 0+00:05:44 R 0 2.6 condor_dagman -f - 10.0 |-LastNode 7/10 17:50 0+00:00:13 R 0 0.0 myscript.sh Final 2 jobs; 0 idle, 2 running, 0 held -- Submitter: terminable.ci.uchicago.edu : <128.135.125.193:35688> : terminable.ci.uchicago.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 4.0 YOURLOGIN 7/10 17:45 0+00:06:14 R 0 2.6 condor_dagman -f - 10.0 YOURLOGIN 7/10 17:50 0+00:00:43 R 0 0.0 myscript.sh Final 2 jobs; 0 idle, 2 running, 0 held -- Submitter: terminable.ci.uchicago.edu : <128.135.125.193:35688> : terminable.ci.uchicago.edu ID OWNER STATUS MANAGER HOST EXECUTABLE 10.0 YOURLOGIN ACTIVE fork terminable.ci.uchicago.edu /tmp/username-cond -- Submitter: terminable.ci.uchicago.edu : <128.135.125.193:35688> : terminable.ci.uchicago.edu ID OWNER/NODENAME SUBMITTED RUN_TIME ST PRI SIZE CMD 4.0 YOURLOGIN 7/10 17:45 0+00:06:14 R 0 2.6 condor_dagman -f - 10.0 |-LastNode 7/10 17:50 0+00:00:43 R 0 0.0 myscript.sh Final 2 jobs; 0 idle, 2 running, 0 held -- Submitter: terminable.ci.uchicago.edu : <128.135.125.193:35688> : terminable.ci.uchicago.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 0 jobs; 0 idle, 0 running, 0 held -- Submitter: terminable.ci.uchicago.edu : <128.135.125.193:35688> : terminable.ci.uchicago.edu ID OWNER STATUS MANAGER HOST EXECUTABLE -- Submitter: terminable.ci.uchicago.edu : <128.135.125.193:35688> : terminable.ci.uchicago.edu ID OWNER/NODENAME SUBMITTED RUN_TIME ST PRI SIZE CMD 0 jobs; 0 idle, 0 running, 0 heldCtrl+C
Watching the logs or the condor_q output,
you'll note that the CollectResults node (workfinal)
wasn't run until both of the WorkerNode nodes (work1
and work2) finished.
$lsjob.finalize.submit mydag.dag.condor.sub myscript.sh results.setup.error results.workfinal.error job.setup.submit mydag.dag.dagman.log results.error results.setup.output results.workfinal.output job.work1.submit mydag.dag.dagman.out results.finalize.error results.work1.error watch_condor_q job.work2.submit mydag.dag.lib.out results.finalize.output results.work1.output job.workfinal.submit mydag.dag.lock results.log results.work2.error mydag.dag myjob.submit results.output results.work2.output $tail --lines=500 results.*.error==> results.finalize.error <== This is sent to standard error ==> results.setup.error <== This is sent to standard error ==> results.work1.error <== This is sent to standard error ==> results.work2.error <== This is sent to standard error ==> results.workfinal.error <== This is sent to standard error $tail --lines=500 results.*.output==> results.finalize.output <== I'm process id 29614 on terminable.ci.uchicago.edu Thu Jul 10 10:53:58 CDT 2003 Running as binary /home/YOURLOGIN/.globus/.gass_cache/local/md5/0d/7c60aa10b34817d3ffe467dd116816/md5/de/03c3eb8a20852948a2af53438bbce1/data Finalize 1 My name (argument 1) is Finalize My sleep duration (argument 2) is 1 Sleep of 1 seconds finished. Exiting ==> results.setup.output <== I'm process id 29337 on terminable.ci.uchicago.edu Thu Jul 10 10:50:31 CDT 2003 Running as binary /home/YOURLOGIN/.globus/.gass_cache/local/md5/a5/fab7b658db65dbfec3ecf0a5414e1c/md5/f4/e9a04ae03bff43f00a10c78ebd60fd/data Setup 1 My name (argument 1) is Setup My sleep duration (argument 2) is 1 Sleep of 1 seconds finished. Exiting ==> results.work1.output <== I'm process id 29444 on terminable.ci.uchicago.edu Thu Jul 10 10:51:04 CDT 2003 Running as binary /home/YOURLOGIN/.globus/.gass_cache/local/md5/2e/17db42df4e113f813cea7add42e03e/md5/f6/f1bd82a2fec9a3a372a44c009a63ca/data WorkerNode1 1 My name (argument 1) is WorkerNode1 My sleep duration (argument 2) is 1 Sleep of 1 seconds finished. Exiting ==> results.work2.output <== I'm process id 29432 on terminable.ci.uchicago.edu Thu Jul 10 10:51:03 CDT 2003 Running as binary /home/YOURLOGIN/.globus/.gass_cache/local/md5/ea/9a3c8d16346b2fea808cda4b5969fa/md5/f6/f1bd82a2fec9a3a372a44c009a63ca/data WorkerNode2 120 My name (argument 1) is WorkerNode2 My sleep duration (argument 2) is 120 Sleep of 120 seconds finished. Exiting ==> results.workfinal.output <== I'm process id 29554 on terminable.ci.uchicago.edu Thu Jul 10 10:53:27 CDT 2003 Running as binary /home/YOURLOGIN/.globus/.gass_cache/local/md5/c9/7ba5d43acad3d9ebdfa633839e75c3/md5/11/cd84efa75305d54100f0f451b46b35/data WorkFinal 1 My name (argument 1) is WorkFinal My sleep duration (argument 2) is 1 Sleep of 1 seconds finished. Exiting
$ cat results.log
000 (005.000.000) 07/10 17:45:24 Job submitted from host: <terminable.ci.uchicago.edu:35688>
DAG Node: HelloWorld
...
000 (006.000.000) 07/10 17:45:24 Job submitted from host: <terminable.ci.uchicago.edu:35688>
DAG Node: Setup
...
017 (006.000.000) 07/10 17:45:42 Job submitted to Globus
RM-Contact: gk2:/jobmanager-fork
JM-Contact: https://gk2:2349/914/1057877133/
Can-Restart-JM: 1
...
001 (006.000.000) 07/10 17:45:42 Job executing on host: gt2 terminable.ci.uchicago.edu/jobmanager-fork
...
017 (005.000.000) 07/10 17:45:42 Job submitted to Globus
RM-Contact: terminable.ci.uchicago.edu:/jobmanager-fork
JM-Contact: https://terminable.ci.uchicago.edu:2348/915/1057877133/
Can-Restart-JM: 1
...
001 (005.000.000) 07/10 17:45:42 Job executing on host: gk2
...
005 (005.000.000) 07/10 17:46:50 Job terminated.
(1) Normal termination (return value 0)
Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
0 - Run Bytes Sent By Job
0 - Run Bytes Received By Job
0 - Total Bytes Sent By Job
0 - Total Bytes Received By Job
...
005 (006.000.000) 07/10 17:46:50 Job terminated.
(1) Normal termination (return value 0)
Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
0 - Run Bytes Sent By Job
0 - Run Bytes Received By Job
0 - Total Bytes Sent By Job
0 - Total Bytes Received By Job
...
000 (007.000.000) 07/10 17:46:55 Job submitted from host: <terminable.ci.uchicago.edu:35688>
DAG Node: WorkerNode_1
...
000 (008.000.000) 07/10 17:46:56 Job submitted from host: <terminable.ci.uchicago.edu:35688>
DAG Node: WorkerNode_Two
...
017 (008.000.000) 07/10 17:47:09 Job submitted to Globus
RM-Contact: terminable.ci.uchicago.edu:/jobmanager-fork
JM-Contact: https://terminable.ci.uchicago.edu:2364/1037/1057877219/
Can-Restart-JM: 1
...
001 (008.000.000) 07/10 17:47:09 Job executing on host: gt2 terminable.ci.uchicago.edu/jobmanager-fork
...
017 (007.000.000) 07/10 17:47:09 Job submitted to Globus
RM-Contact: terminable.ci.uchicago.edu:/jobmanager-fork
JM-Contact: https://terminable.ci.uchicago.edu:2367/1040/1057877220/
Can-Restart-JM: 1
...
001 (007.000.000) 07/10 17:47:09 Job executing on host: gt2 terminable.ci.uchicago.edu/jobmanager-fork
...
005 (007.000.000) 07/10 17:48:17 Job terminated.
(1) Normal termination (return value 0)
Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
0 - Run Bytes Sent By Job
0 - Run Bytes Received By Job
0 - Total Bytes Sent By Job
0 - Total Bytes Received By Job
...
005 (008.000.000) 07/10 17:49:18 Job terminated.
(1) Normal termination (return value 0)
Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
0 - Run Bytes Sent By Job
0 - Run Bytes Received By Job
0 - Total Bytes Sent By Job
0 - Total Bytes Received By Job
...
000 (009.000.000) 07/10 17:49:22 Job submitted from host: <terminable.ci.uchicago.edu:35688>
DAG Node: CollectResults
...
017 (009.000.000) 07/10 17:49:35 Job submitted to Globus
RM-Contact: terminable.ci.uchicago.edu:/jobmanager-fork
JM-Contact: https://terminable.ci.uchicago.edu:2383/1185/1057877366/
Can-Restart-JM: 1
...
001 (009.000.000) 07/10 17:49:35 Job executing on host: gt2 terminable.ci.uchicago.edu/jobmanager-fork
...
005 (009.000.000) 07/10 17:50:42 Job terminated.
(1) Normal termination (return value 0)
Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
0 - Run Bytes Sent By Job
0 - Run Bytes Received By Job
0 - Total Bytes Sent By Job
0 - Total Bytes Received By Job
...
000 (010.000.000) 07/10 17:50:42 Job submitted from host: <terminable.ci.uchicago.edu:35688>
DAG Node: LastNode
...
017 (010.000.000) 07/10 17:50:55 Job submitted to Globus
RM-Contact: terminable.ci.uchicago.edu:/jobmanager-fork
JM-Contact: https://terminable.ci.uchicago.edu:2392/1247/1057877446/
Can-Restart-JM: 1
...
001 (010.000.000) 07/10 17:50:55 Job executing on host: gt2 terminable.ci.uchicago.edu/jobmanager-fork
...
005 (010.000.000) 07/10 17:52:02 Job terminated.
(1) Normal termination (return value 0)
Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
0 - Run Bytes Sent By Job
0 - Run Bytes Received By Job
0 - Total Bytes Sent By Job
0 - Total Bytes Received By Job
...$ cat mydag.dag.dagman.out
7/10 17:45:24 ******************************************************
7/10 17:45:24 ** condor_scheduniv_exec.4.0 (CONDOR_DAGMAN) STARTING UP
7/10 17:45:24 ** $CondorVersion: 6.8.4 Apr 22 2006 $
7/10 17:45:24 ** $CondorPlatform: INTEL-LINUX-GLIBC22 $
7/10 17:45:24 ** PID = 18826
7/10 17:45:24 ******************************************************
7/10 17:45:24 DaemonCore: Command Socket at <terminable.ci.uchicago.edu:35774>
7/10 17:45:24 argv[0] == "condor_scheduniv_exec.4.0"
7/10 17:45:24 argv[1] == "-Debug"
7/10 17:45:24 argv[2] == "3"
7/10 17:45:24 argv[3] == "-Lockfile"
7/10 17:45:24 argv[4] == "mydag.dag.lock"
7/10 17:45:24 argv[5] == "-Condorlog"
7/10 17:45:24 argv[6] == "results.log"
7/10 17:45:24 argv[7] == "-Dag"
7/10 17:45:24 argv[8] == "mydag.dag"
7/10 17:45:24 argv[9] == "-Rescue"
7/10 17:45:24 argv[10] == "mydag.dag.rescue"
7/10 17:45:24 Condor log will be written to results.log
7/10 17:45:24 DAG Lockfile will be written to mydag.dag.lock
7/10 17:45:24 DAG Input file is mydag.dag
7/10 17:45:24 Rescue DAG will be written to mydag.dag.rescue
7/10 17:45:24 Parsing mydag.dag ...
7/10 17:45:24 Dag contains 6 total jobs
7/10 17:45:24 Bootstrapping...
7/10 17:45:24 Number of pre-completed jobs: 0
7/10 17:45:24 Submitting Job HelloWorld ...
7/10 17:45:24 assigned Condor ID (5.0.0)
7/10 17:45:24 Submitting Job Setup ...
7/10 17:45:24 assigned Condor ID (6.0.0)
7/10 17:45:25 Event: ULOG_SUBMIT for Job HelloWorld (5.0.0)
7/10 17:45:25 Event: ULOG_SUBMIT for Job Setup (6.0.0)
7/10 17:45:25 0/6 done, 0 failed, 2 submitted, 0 ready, 0 pre, 0 post
7/10 17:45:45 Event: ULOG_GLOBUS_SUBMIT for Job Setup (6.0.0)
7/10 17:45:45 Event: ULOG_EXECUTE for Job Setup (6.0.0)
7/10 17:45:45 Event: ULOG_GLOBUS_SUBMIT for Job HelloWorld (5.0.0)
7/10 17:45:45 Event: ULOG_EXECUTE for Job HelloWorld (5.0.0)
7/10 17:46:55 Event: ULOG_JOB_TERMINATED for Job HelloWorld (5.0.0)
7/10 17:46:55 Job HelloWorld completed successfully.
7/10 17:46:55 Event: ULOG_JOB_TERMINATED for Job Setup (6.0.0)
7/10 17:46:55 Job Setup completed successfully.
7/10 17:46:55 Submitting Job WorkerNode_1 ...
7/10 17:46:55 assigned Condor ID (7.0.0)
7/10 17:46:55 Submitting Job WorkerNode_Two ...
7/10 17:46:56 assigned Condor ID (8.0.0)
7/10 17:46:56 Event: ULOG_SUBMIT for Job WorkerNode_1 (7.0.0)
7/10 17:46:56 Event: ULOG_SUBMIT for Job WorkerNode_Two (8.0.0)
7/10 17:46:56 2/6 done, 0 failed, 2 submitted, 0 ready, 0 pre, 0 post
7/10 17:47:11 Event: ULOG_GLOBUS_SUBMIT for Job WorkerNode_Two (8.0.0)
7/10 17:47:11 Event: ULOG_EXECUTE for Job WorkerNode_Two (8.0.0)
7/10 17:47:11 Event: ULOG_GLOBUS_SUBMIT for Job WorkerNode_1 (7.0.0)
7/10 17:47:11 Event: ULOG_EXECUTE for Job WorkerNode_1 (7.0.0)
7/10 17:48:21 Event: ULOG_JOB_TERMINATED for Job WorkerNode_1 (7.0.0)
7/10 17:48:21 Job WorkerNode_1 completed successfully.
7/10 17:48:21 3/6 done, 0 failed, 1 submitted, 0 ready, 0 pre, 0 post
7/10 17:49:21 Event: ULOG_JOB_TERMINATED for Job WorkerNode_Two (8.0.0)
7/10 17:49:21 Job WorkerNode_Two completed successfully.
7/10 17:49:21 Submitting Job CollectResults ...
7/10 17:49:22 assigned Condor ID (9.0.0)
7/10 17:49:22 Event: ULOG_SUBMIT for Job CollectResults (9.0.0)
7/10 17:49:22 4/6 done, 0 failed, 1 submitted, 0 ready, 0 pre, 0 post
7/10 17:49:37 Event: ULOG_GLOBUS_SUBMIT for Job CollectResults (9.0.0)
7/10 17:49:37 Event: ULOG_EXECUTE for Job CollectResults (9.0.0)
7/10 17:50:42 Event: ULOG_JOB_TERMINATED for Job CollectResults (9.0.0)
7/10 17:50:42 Job CollectResults completed successfully.
7/10 17:50:42 Submitting Job LastNode ...
7/10 17:50:42 assigned Condor ID (10.0.0)
7/10 17:50:42 Event: ULOG_SUBMIT for Job LastNode (10.0.0)
7/10 17:50:42 5/6 done, 0 failed, 1 submitted, 0 ready, 0 pre, 0 post
7/10 17:50:57 Event: ULOG_GLOBUS_SUBMIT for Job LastNode (10.0.0)
7/10 17:50:57 Event: ULOG_EXECUTE for Job LastNode (10.0.0)
7/10 17:52:02 Event: ULOG_JOB_TERMINATED for Job LastNode (10.0.0)
7/10 17:52:02 Job LastNode completed successfully.
7/10 17:52:02 6/6 done, 0 failed, 0 submitted, 0 ready, 0 pre, 0 post
7/10 17:52:02 All jobs Completed!
7/10 17:52:02 **** condor_scheduniv_exec.4.0 (condor_DAGMAN) EXITING WITH STATUS 0Clean up your results. Be careful about deleting the mydag.dag.* files, you do not want to delete mydag.dag, just mydag.dag.*.
$ rm mydag.dag.* results.*You can try redoing this section, but with other Grid sites. Modify some of the grid_resource entries in your submit files to point to other servers. A single DAG can send jobs to a variety of sites. Condor-G is capable of managing jobs being distributed to many different sites simultaneously.
DAGMan can handle a situation where some of the nodes in a DAG fails. DAGMan will run as many nodes as possible, then create a rescue DAG making it easy to continue when the problem is fixed.
Let's create a script that will fail so we can see this:
$cat > myscript2.sh #! /bin/sh echo "I'm process id $$ on" `hostname` echo "This is sent to standard error" 1>&2 date echo "Running as binary $0" "$@" echo "My name (argument 1) is $1" echo "My sleep duration (argument 2) is $2" sleep $2 echo "Sleep of $2 seconds finished. Exiting" echo "RESULT: 1 FAILURE" exit 1 Ctrl+D$cat myscript2.sh#! /bin/sh echo "I'm process id $$ on" `hostname` echo "This is sent to standard error" 1>&2 date echo "Running as binary $0" "$@" echo "My name (argument 1) is $1" echo "My sleep duration (argument 2) is $2" sleep $2 echo "Sleep of $2 seconds finished. Exiting" echo "RESULT: 1 FAILURE" exit 1 $chmod a+x myscript2.sh
Modify job.work2.submit to run myscript2.sh instead of primetest:
$rm job.work2.submit$cat > job.work2.submit executable=myscript2.sh output=results.work2.output error=results.work2.error log=results.log notification=never universe=grid grid_resource=gt2 osg-edu.cs.wisc.edu/jobmanager-condor arguments=WorkerNode2 60 queue Ctrl+D$cat job.work2.submitexecutable=myscript2.sh output=results.work2.output error=results.work2.error log=results.log notification=never universe=grid grid_resource=gt2 terminable.ci.uchicago.edu/jobmanager-fork arguments=WorkerNode2 60 queue
Submit the dag again.
$ condor_submit_dag mydag.dag
Checking your DAG input file and all submit files it references.
This might take a while...
Done.
-----------------------------------------------------------------------
File for submitting this DAG to Condor : mydag.dag.condor.sub
Log of DAGMan debugging messages : mydag.dag.dagman.out
Log of Condor library debug messages : mydag.dag.lib.out
Log of the life of condor_dagman itself : mydag.dag.dagman.log
Condor Log file for all jobs of this DAG : results.log
Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 15.
-----------------------------------------------------------------------Use watch_condor_q to watch the jobs until they finish.
In separate windows run tail -f --lines=500 results.log and tail -f --lines=500 mydag.dag.dagman.out to monitor the job's progress.
$./watch_condor_q-- Submitter: terminable.ci.uchicago.edu : <128.135.125.193:33785> : terminable.ci.uchicago.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 15.0 YOURLOGIN 7/10 11:11 0+00:00:04 R 0 2.6 condor_dagman -f - 16.0 YOURLOGIN 7/10 11:11 0+00:00:00 I 0 0.0 myscript.sh 17.0 YOURLOGIN 7/10 11:11 0+00:00:00 I 0 0.0 myscript.sh Setup 3 jobs; 2 idle, 1 running, 0 held -- Submitter: terminable.ci.uchicago.edu : <128.135.125.193:33785> : terminable.ci.uchicago.edu ID OWNER STATUS MANAGER HOST EXECUTABLE 16.0 YOURLOGIN UNSUBMITTED fork terminable.ci.uchicago.edu /home/YOURLOGIN/condo 17.0 YOURLOGIN UNSUBMITTED fork terminable.ci.uchicago.edu /home/YOURLOGIN/condo -- Submitter: terminable.ci.uchicago.edu : <128.135.125.193:33785> : terminable.ci.uchicago.edu ID OWNER/NODENAME SUBMITTED RUN_TIME ST PRI SIZE CMD 15.0 YOURLOGIN 7/10 11:11 0+00:00:04 R 0 2.6 condor_dagman -f - 16.0 |-HelloWorld 7/10 11:11 0+00:00:00 I 0 0.0 myscript.sh 17.0 |-Setup 7/10 11:11 0+00:00:00 I 0 0.0 myscript.sh Setup 3 jobs; 2 idle, 1 running, 0 held Output of watch_condor_q truncated -- Submitter: terminable.ci.uchicago.edu : <128.135.125.193:33785> : terminable.ci.uchicago.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 0 jobs; 0 idle, 0 running, 0 held -- Submitter: terminable.ci.uchicago.edu : <128.135.125.193:33785> : terminable.ci.uchicago.edu ID OWNER STATUS MANAGER HOST EXECUTABLE -- Submitter: terminable.ci.uchicago.edu : <128.135.125.193:33785> : terminable.ci.uchicago.edu ID OWNER/NODENAME SUBMITTED RUN_TIME ST PRI SIZE CMD 0 jobs; 0 idle, 0 running, 0 heldCtrl+C
$lsjob.finalize.submit mydag.dag.condor.sub myscript.sh results.output results.work2.output job.setup.submit mydag.dag.dagman.log myscript2.sh results.setup.error results.workfinal.error job.work1.submit mydag.dag.dagman.out results.error results.setup.output results.workfinal.output job.work2.submit mydag.dag.lib.out results.finalize.error results.work1.error watch_condor_q job.workfinal.submit mydag.dag.lock results.finalize.output results.work1.output mydag.dag myjob.submit results.log results.work2.error $cat results.work2.outputI'm process id 29921 on terminable.ci.uchicago.edu Thu Jul 10 11:12:42 CDT 2003 Running as binary /home/YOURLOGIN/.globus/.gass_cache/local/md5/87/459c159766cefb36f0d75023de0e35/md5/70/5d82b930ec61460d9c9ca65cbe5a8a/data WorkerNode2 60 My name (argument 1) is WorkerNode2 My sleep duration (argument 2) is 60 Sleep of 60 seconds finished. Exiting RESULT: 1 FAILURE $cat mydag.dag.dagman.out7/10 11:11:55 ****************************************************** 7/10 11:11:55 ** condor_scheduniv_exec.15.0 (CONDOR_DAGMAN) STARTING UP 7/10 11:11:55 ** $CondorVersion: 6.8.4 Apr 22 2003 $ 7/10 11:11:55 ** $CondorPlatform: INTEL-LINUX-GLIBC22 $ 7/10 11:11:55 ** PID = 27126 7/10 11:11:55 ****************************************************** 7/10 11:11:55 DaemonCore: Command Socket at <terminable.ci.uchicago.edu:34769> 7/10 11:11:55 argv[0] == "condor_scheduniv_exec.15.0" 7/10 11:11:55 argv[1] == "-Debug" 7/10 11:11:55 argv[2] == "3" 7/10 11:11:55 argv[3] == "-Lockfile" 7/10 11:11:55 argv[4] == "mydag.dag.lock" 7/10 11:11:55 argv[5] == "-Condorlog" 7/10 11:11:55 argv[6] == "results.log" 7/10 11:11:55 argv[7] == "-Dag" 7/10 11:11:55 argv[8] == "mydag.dag" 7/10 11:11:55 argv[9] == "-Rescue" 7/10 11:11:55 argv[10] == "mydag.dag.rescue" 7/10 11:11:55 Condor log will be written to results.log 7/10 11:11:55 DAG Lockfile will be written to mydag.dag.lock 7/10 11:11:55 DAG Input file is mydag.dag 7/10 11:11:55 Rescue DAG will be written to mydag.dag.rescue 7/10 11:11:55 Parsing mydag.dag ... 7/10 11:11:55 Dag contains 6 total jobs 7/10 11:11:55 Bootstrapping... 7/10 11:11:55 Number of pre-completed jobs: 0 7/10 11:11:55 Submitting Job HelloWorld ... 7/10 11:11:55 assigned Condor ID (16.0.0) 7/10 11:11:55 Submitting Job Setup ... 7/10 11:11:55 assigned Condor ID (17.0.0) 7/10 11:11:56 Event: ULOG_SUBMIT for Job HelloWorld (16.0.0) 7/10 11:11:56 Event: ULOG_SUBMIT for Job Setup (17.0.0) 7/10 11:11:56 0/6 done, 0 failed, 2 submitted, 0 ready, 0 pre, 0 post 7/10 11:12:16 Event: ULOG_GLOBUS_SUBMIT for Job HelloWorld (16.0.0) 7/10 11:12:16 Event: ULOG_EXECUTE for Job HelloWorld (16.0.0) 7/10 11:12:16 Event: ULOG_GLOBUS_SUBMIT for Job Setup (17.0.0) 7/10 11:12:16 Event: ULOG_EXECUTE for Job Setup (17.0.0) 7/10 11:12:21 Event: ULOG_JOB_TERMINATED for Job HelloWorld (16.0.0) 7/10 11:12:21 Job HelloWorld completed successfully. 7/10 11:12:21 1/6 done, 0 failed, 1 submitted, 0 ready, 0 pre, 0 post 7/10 11:12:31 Event: ULOG_JOB_TERMINATED for Job Setup (17.0.0) 7/10 11:12:31 Job Setup completed successfully. 7/10 11:12:31 Submitting Job WorkerNode_1 ... 7/10 11:12:32 assigned Condor ID (18.0.0) 7/10 11:12:32 Submitting Job WorkerNode_Two ... 7/10 11:12:32 assigned Condor ID (19.0.0) 7/10 11:12:32 Event: ULOG_SUBMIT for Job WorkerNode_1 (18.0.0) 7/10 11:12:32 Event: ULOG_SUBMIT for Job WorkerNode_Two (19.0.0) 7/10 11:12:32 2/6 done, 0 failed, 2 submitted, 0 ready, 0 pre, 0 post 7/10 11:12:47 Event: ULOG_GLOBUS_SUBMIT for Job WorkerNode_Two (19.0.0) 7/10 11:12:47 Event: ULOG_EXECUTE for Job WorkerNode_Two (19.0.0) 7/10 11:12:47 Event: ULOG_GLOBUS_SUBMIT for Job WorkerNode_1 (18.0.0) 7/10 11:12:47 Event: ULOG_EXECUTE for Job WorkerNode_1 (18.0.0) 7/10 11:13:07 Event: ULOG_JOB_TERMINATED for Job WorkerNode_1 (18.0.0) 7/10 11:13:07 Job WorkerNode_1 completed successfully. 7/10 11:13:07 3/6 done, 0 failed, 1 submitted, 0 ready, 0 pre, 0 post 7/10 11:13:57 Event: ULOG_JOB_TERMINATED for Job WorkerNode_Two (19.0.0) 7/10 11:13:57 Job WorkerNode_Two completed successfully. 7/10 11:13:57 Submitting Job CollectResults ... 7/10 11:13:57 assigned Condor ID (20.0.0) 7/10 11:13:57 Event: ULOG_SUBMIT for Job CollectResults (20.0.0) 7/10 11:13:57 4/6 done, 0 failed, 1 submitted, 0 ready, 0 pre, 0 post 7/10 11:14:12 Event: ULOG_GLOBUS_SUBMIT for Job CollectResults (20.0.0) 7/10 11:14:12 Event: ULOG_EXECUTE for Job CollectResults (20.0.0) 7/10 11:14:32 Event: ULOG_JOB_TERMINATED for Job CollectResults (20.0.0) 7/10 11:14:32 Job CollectResults completed successfully. 7/10 11:14:32 Submitting Job LastNode ... 7/10 11:14:32 assigned Condor ID (21.0.0) 7/10 11:14:32 Event: ULOG_SUBMIT for Job LastNode (21.0.0) 7/10 11:14:32 5/6 done, 0 failed, 1 submitted, 0 ready, 0 pre, 0 post 7/10 11:14:47 Event: ULOG_GLOBUS_SUBMIT for Job LastNode (21.0.0) 7/10 11:14:47 Event: ULOG_EXECUTE for Job LastNode (21.0.0) 7/10 11:15:02 Event: ULOG_JOB_TERMINATED for Job LastNode (21.0.0) 7/10 11:15:02 Job LastNode completed successfully. 7/10 11:15:02 6/6 done, 0 failed, 0 submitted, 0 ready, 0 pre, 0 post 7/10 11:15:02 All jobs Completed! 7/10 11:15:02 **** condor_scheduniv_exec.15.0 (condor_DAGMAN) EXITING WITH STATUS 0
Uh oh, DAGMan ran that remaining nodes based on bad data from node work2. Normally DAGMan checks the return code and considers non-zero a failure. We did modify myscript2.sh to return non-zero. That would normally work, but we're using Condor-G, not normal Condor. Condor-G relies on Globus and Globus doesn't return error codes.
If you're interested in having DAGMan notice a failed job and stopping the DAG at that point, you'll need to use a POST script to detect the problem. One solution is to wrap your executable in a script that will output the executable's return code to stdout and have the POST script scan the stdout for the status. Of perhaps your executable's normal output contains enough information to make the decision.
In this case, our executable is emitting a well known message. Let's add a POST script.
Now create a script to check the output.
$cat > postscript_checker #! /bin/sh grep 'RESULT: 0 SUCCESS' $1 > /dev/null 2>/dev/null Ctrl+D$cat postscript_checker#! /bin/sh grep 'RESULT: 0 SUCCESS' $1 > /dev/null 2>/dev/null $chmod a+x postscript_checker
Modify your mydag.dag to use the new script for the nodes.
$cat >>mydag.dag Script POST Setup postscript_checker results.setup.output Script POST WorkerNode_1 postscript_checker results.work1.output Script POST WorkerNode_Two postscript_checker results.work2.output Script POST CollectResults postscript_checker results.workfinal.output Script POST LastNode postscript_checker results.finalize.output Ctrl+D$cat mydag.dagJob HelloWorld myjob.submit Job Setup job.setup.submit Job WorkerNode_1 job.work1.submit Job WorkerNode_Two job.work2.submit Job CollectResults job.workfinal.submit Job LastNode job.finalize.submit PARENT Setup CHILD WorkerNode_1 WorkerNode_Two PARENT WorkerNode_1 WorkerNode_Two CHILD CollectResults PARENT CollectResults CHILD LastNode Script POST Setup postscript_checker results.setup.output Script POST WorkerNode_1 postscript_checker results.work1.output Script POST WorkerNode_Two postscript_checker results.work2.output Script POST CollectResults postscript_checker results.workfinal.output Script POST LastNode postscript_checker results.finalize.output $lsjob.finalize.submit job.work1.submit job.workfinal.submit myjob.submit myscript2.sh watch_condor_q job.setup.submit job.work2.submit mydag.dag myscript.sh postscript_checker
Submit the DAG again with the new POST scripts in place.
$ condor_submit_dag mydag.dag
Checking your DAG input file and all submit files it references.
This might take a while...
Done.
-----------------------------------------------------------------------
File for submitting this DAG to Condor : mydag.dag.condor.sub
Log of DAGMan debugging messages : mydag.dag.dagman.out
Log of Condor library debug messages : mydag.dag.lib.out
Log of the life of condor_dagman itself : mydag.dag.dagman.log
Condor Log file for all jobs of this DAG : results.log
Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 22.
-----------------------------------------------------------------------Watch the job with watch_condor_q.
In separate windows run tail -f --lines=500 results.log and tail -f --lines=500 mydag.dag.dagman.out to monitor the job's progress.
$./watch_condor_q-- Submitter: terminable.ci.uchicago.edu : <128.135.125.193:33785> : terminable.ci.uchicago.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 22.0 YOURLOGIN 7/10 11:25 0+00:00:03 R 0 2.6 condor_dagman -f - 23.0 YOURLOGIN 7/10 11:25 0+00:00:00 I 0 0.0 myscript.sh 24.0 YOURLOGIN 7/10 11:25 0+00:00:00 I 0 0.0 myscript.sh Setup 3 jobs; 2 idle, 1 running, 0 held -- Submitter: terminable.ci.uchicago.edu : <128.135.125.193:33785> : terminable.ci.uchicago.edu ID OWNER STATUS MANAGER HOST EXECUTABLE 23.0 YOURLOGIN UNSUBMITTED fork terminable.ci.uchicago.edu /home/YOURLOGIN/condo 24.0 YOURLOGIN UNSUBMITTED fork terminable.ci.uchicago.edu /home/YOURLOGIN/condo -- Submitter: terminable.ci.uchicago.edu : <128.135.125.193:33785> : terminable.ci.uchicago.edu ID OWNER/NODENAME SUBMITTED RUN_TIME ST PRI SIZE CMD 22.0 YOURLOGIN 7/10 11:25 0+00:00:03 R 0 2.6 condor_dagman -f - 23.0 |-HelloWorld 7/10 11:25 0+00:00:00 I 0 0.0 myscript.sh 24.0 |-Setup 7/10 11:25 0+00:00:00 I 0 0.0 myscript.sh Setup 3 jobs; 2 idle, 1 running, 0 held [Output of watch_condor_q truncated] -- Submitter: terminable.ci.uchicago.edu : <128.135.125.193:33785> : terminable.ci.uchicago.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 0 jobs; 0 idle, 0 running, 0 held -- Submitter: terminable.ci.uchicago.edu : <128.135.125.193:33785> : terminable.ci.uchicago.edu ID OWNER STATUS MANAGER HOST EXECUTABLE -- Submitter: terminable.ci.uchicago.edu : <128.135.125.193:33785> : terminable.ci.uchicago.edu ID OWNER/NODENAME SUBMITTED RUN_TIME ST PRI SIZE CMD 0 jobs; 0 idle, 0 running, 0 heldCtrl+C
$lsjob.finalize.submit mydag.dag mydag.dag.rescue results.error results.work1.error job.setup.submit mydag.dag.condor.sub myjob.submit results.log results.work1.output job.work1.submit mydag.dag.dagman.log myscript.sh results.output results.work2.error job.work2.submit mydag.dag.dagman.out myscript2.sh results.setup.error results.work2.output job.workfinal.submit mydag.dag.lib.out postscript_checker results.setup.output watch_condor_q $cat mydag.dag.dagman.out7/10 11:25:35 ****************************************************** 7/10 11:25:35 ** condor_scheduniv_exec.22.0 (CONDOR_DAGMAN) STARTING UP 7/10 11:25:35 ** $CondorVersion: 6.8.4 Apr 22 2003 $ 7/10 11:25:35 ** $CondorPlatform: INTEL-LINUX-GLIBC22 $ 7/10 11:25:35 ** PID = 27251 7/10 11:25:35 ****************************************************** 7/10 11:25:35 DaemonCore: Command Socket at <128.105.185.14:34913> 7/10 11:25:35 argv[0] == "condor_scheduniv_exec.22.0" 7/10 11:25:35 argv[1] == "-Debug" 7/10 11:25:35 argv[2] == "3" 7/10 11:25:35 argv[3] == "-Lockfile" 7/10 11:25:35 argv[4] == "mydag.dag.lock" 7/10 11:25:35 argv[5] == "-Condorlog" 7/10 11:25:35 argv[6] == "results.log" 7/10 11:25:35 argv[7] == "-Dag" 7/10 11:25:35 argv[8] == "mydag.dag" 7/10 11:25:35 argv[9] == "-Rescue" 7/10 11:25:35 argv[10] == "mydag.dag.rescue" 7/10 11:25:35 Condor log will be written to results.log 7/10 11:25:35 DAG Lockfile will be written to mydag.dag.lock 7/10 11:25:35 DAG Input file is mydag.dag 7/10 11:25:35 Rescue DAG will be written to mydag.dag.rescue 7/10 11:25:35 Parsing mydag.dag ... 7/10 11:25:35 jobName: Setup 7/10 11:25:35 jobName: WorkerNode_1 7/10 11:25:35 jobName: WorkerNode_Two 7/10 11:25:35 jobName: CollectResults 7/10 11:25:35 jobName: LastNode 7/10 11:25:35 Dag contains 6 total jobs 7/10 11:25:35 Bootstrapping... 7/10 11:25:35 Number of pre-completed jobs: 0 7/10 11:25:35 Submitting Job HelloWorld ... 7/10 11:25:35 assigned Condor ID (23.0.0) 7/10 11:25:35 Submitting Job Setup ... 7/10 11:25:35 assigned Condor ID (24.0.0) 7/10 11:25:36 Event: ULOG_SUBMIT for Job HelloWorld (23.0.0) 7/10 11:25:36 Event: ULOG_SUBMIT for Job Setup (24.0.0) 7/10 11:25:36 0/6 done, 0 failed, 2 submitted, 0 ready, 0 pre, 0 post 7/10 11:25:56 Event: ULOG_GLOBUS_SUBMIT for Job HelloWorld (23.0.0) 7/10 11:25:56 Event: ULOG_EXECUTE for Job HelloWorld (23.0.0) 7/10 11:25:56 Event: ULOG_GLOBUS_SUBMIT for Job Setup (24.0.0) 7/10 11:25:56 Event: ULOG_EXECUTE for Job Setup (24.0.0) 7/10 11:26:01 Event: ULOG_JOB_TERMINATED for Job HelloWorld (23.0.0) 7/10 11:26:01 Job HelloWorld completed successfully. 7/10 11:26:01 1/6 done, 0 failed, 1 submitted, 0 ready, 0 pre, 0 post 7/10 11:26:11 Event: ULOG_JOB_TERMINATED for Job Setup (24.0.0) 7/10 11:26:11 Job Setup completed successfully. 7/10 11:26:11 Running POST script of Job Setup... 7/10 11:26:11 1/6 done, 0 failed, 0 submitted, 0 ready, 0 pre, 1 post 7/10 11:26:16 Event: ULOG_POST_SCRIPT_TERMINATED for Job Setup (24.0.0) 7/10 11:26:16 POST Script of Job Setup completed successfully. 7/10 11:26:16 Submitting Job WorkerNode_1 ... 7/10 11:26:16 assigned Condor ID (25.0.0) 7/10 11:26:16 Submitting Job WorkerNode_Two ... 7/10 11:26:17 assigned Condor ID (26.0.0) 7/10 11:26:17 Event: ULOG_SUBMIT for Job WorkerNode_1 (25.0.0) 7/10 11:26:17 Event: ULOG_SUBMIT for Job WorkerNode_Two (26.0.0) 7/10 11:26:17 2/6 done, 0 failed, 2 submitted, 0 ready, 0 pre, 0 post 7/10 11:26:32 Event: ULOG_GLOBUS_SUBMIT for Job WorkerNode_1 (25.0.0) 7/10 11:26:32 Event: ULOG_EXECUTE for Job WorkerNode_1 (25.0.0) 7/10 11:26:32 Event: ULOG_GLOBUS_SUBMIT for Job WorkerNode_Two (26.0.0) 7/10 11:26:32 Event: ULOG_EXECUTE for Job WorkerNode_Two (26.0.0) 7/10 11:26:52 Event: ULOG_JOB_TERMINATED for Job WorkerNode_1 (25.0.0) 7/10 11:26:52 Job WorkerNode_1 completed successfully. 7/10 11:26:52 Running POST script of Job WorkerNode_1... 7/10 11:26:52 2/6 done, 0 failed, 1 submitted, 0 ready, 0 pre, 1 post 7/10 11:26:57 Event: ULOG_POST_SCRIPT_TERMINATED for Job WorkerNode_1 (25.0.0) 7/10 11:26:57 POST Script of Job WorkerNode_1 completed successfully. 7/10 11:26:57 3/6 done, 0 failed, 1 submitted, 0 ready, 0 pre, 0 post 7/10 11:27:42 Event: ULOG_JOB_TERMINATED for Job WorkerNode_Two (26.0.0) 7/10 11:27:42 Job WorkerNode_Two completed successfully. 7/10 11:27:42 Running POST script of Job WorkerNode_Two... 7/10 11:27:42 3/6 done, 0 failed, 0 submitted, 0 ready, 0 pre, 1 post 7/10 11:27:47 Event: ULOG_POST_SCRIPT_TERMINATED for Job WorkerNode_Two (26.0.0) 7/10 11:27:47 POST Script of Job WorkerNode_Two failed with status 1 7/10 11:27:47 3/6 done, 1 failed, 0 submitted, 0 ready, 0 pre, 0 post 7/10 11:27:47 ERROR: the following job(s) failed: 7/10 11:27:47 ---------------------- Job ---------------------- 7/10 11:27:47 Node Name: WorkerNode_Two 7/10 11:27:47 NodeID: 3 7/10 11:27:47 Node Status: STATUS_ERROR 7/10 11:27:47 Error: POST Script failed with status 1 7/10 11:27:47 Job Submit File: job.work2.submit 7/10 11:27:47 POST Script: postscript_checker results.work2.output 7/10 11:27:47 Condor Job ID: (26.0.0) 7/10 11:27:47 Q_PARENTS: 1, <END> 7/10 11:27:47 Q_WAITING: <END> 7/10 11:27:47 Q_CHILDREN: 4, <END> 7/10 11:27:47 --------------------------------------- <END> 7/10 11:27:47 Writing Rescue DAG file... 7/10 11:27:47 **** condor_scheduniv_exec.22.0 (condor_DAGMAN) EXITING WITH STATUS 1
DAGMan notices that one of the jobs failed. DAGMan ran as much of the DAG as possible and logged enough information to continue the run when the situation is resolved.
Look at the rescue DAG. It's the same structurally as your original DAG, but notes that finished are marked DONE. (DAGMan also reorganized the file.) When you submit the rescue DAG, DONE nodes will be skipped.
$ cat mydag.dag.rescue
# Rescue DAG file, created after running
# the mydag.dag DAG file
#
# Total number of Nodes: 6
# Nodes premarked DONE: 3
# Nodes that failed: 1
# WorkerNode_Two,<ENDLIST>
JOB HelloWorld myjob.submit DONE
JOB Setup job.setup.submit DONE
SCRIPT POST Setup postscript_checker results.setup.output
JOB WorkerNode_1 job.work1.submit DONE
SCRIPT POST WorkerNode_1 postscript_checker results.work1.output
JOB WorkerNode_Two job.work2.submit
SCRIPT POST WorkerNode_Two postscript_checker results.work2.output
JOB CollectResults job.workfinal.submit
SCRIPT POST CollectResults postscript_checker results.workfinal.output
JOB LastNode job.finalize.submit
SCRIPT POST LastNode postscript_checker results.finalize.output
PARENT Setup CHILD WorkerNode_1 WorkerNode_Two
PARENT WorkerNode_1 CHILD CollectResults
PARENT WorkerNode_Two CHILD CollectResults
PARENT CollectResults CHILD LastNodeWe know there is a problem with the work2 step. Let's "fix" it.
$rm myscript2.sh$cp myscript.sh myscript2.sh
Now we can submit our rescue DAG.
mydag.dag.rescue.rescue).In separate windows run tail -f --lines=500 results.log and tail -f --lines=500 mydag.dag.dagman.out to monitor the job's progress.
$condor_submit_dag mydag.dag.rescueChecking your DAG input file and all submit files it references. This might take a while... Done. ----------------------------------------------------------------------- File for submitting this DAG to Condor : mydag.dag.rescue.condor.sub Log of DAGMan debugging messages : mydag.dag.rescue.dagman.out Log of Condor library debug messages : mydag.dag.rescue.lib.out Log of the life of condor_dagman itself : mydag.dag.rescue.dagman.log Condor Log file for all jobs of this DAG : results.log Submitting job(s). Logging submit event(s). 1 job(s) submitted to cluster 27. ----------------------------------------------------------------------- $./watch_condor_q-- Submitter: terminable.ci.uchicago.edu : <128.135.125.193:33785> : terminable.ci.uchicago.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 27.0 YOURLOGIN 7/10 11:34 0+00:00:01 R 0 2.6 condor_dagman -f - 28.0 YOURLOGIN 7/10 11:34 0+00:00:00 I 0 0.0 myscript2.sh Worke 2 jobs; 1 idle, 1 running, 0 held -- Submitter: terminable.ci.uchicago.edu : <128.135.125.193:33785> : terminable.ci.uchicago.edu ID OWNER STATUS MANAGER HOST EXECUTABLE 28.0 YOURLOGIN UNSUBMITTED fork terminable.ci.uchicago.edu /home/YOURLOGIN/condo -- Submitter: terminable.ci.uchicago.edu : <128.135.125.193:33785> : terminable.ci.uchicago.edu ID OWNER/NODENAME SUBMITTED RUN_TIME ST PRI SIZE CMD 27.0 YOURLOGIN 7/10 11:34 0+00:00:01 R 0 2.6 condor_dagman -f - 28.0 |-WorkerNode_ 7/10 11:34 0+00:00:00 I 0 0.0 myscript2.sh Worke 2 jobs; 1 idle, 1 running, 0 held-- Submitter: terminable.ci.uchicago.edu : <128.135.125.193:33785> : terminable.ci.uchicago.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 0 jobs; 0 idle, 0 running, 0 held -- Submitter: terminable.ci.uchicago.edu : <128.135.125.193:33785> : terminable.ci.uchicago.edu ID OWNER STATUS MANAGER HOST EXECUTABLE -- Submitter: terminable.ci.uchicago.edu : <128.135.125.193:33785> : terminable.ci.uchicago.edu ID OWNER/NODENAME SUBMITTED RUN_TIME ST PRI SIZE CMD 0 jobs; 0 idle, 0 running, 0 heldNote
[Output of watch_condor_q truncated]Ctrl+C
$lsjob.finalize.submit mydag.dag.lib.out myscript2.sh results.work1.error job.setup.submit mydag.dag.rescue postscript_checker results.work1.output job.work1.submit mydag.dag.rescue.condor.sub results.error results.work2.error job.work2.submit mydag.dag.rescue.dagman.log results.finalize.error results.work2.output job.workfinal.submit mydag.dag.rescue.dagman.out results.finalize.output results.workfinal.error mydag.dag mydag.dag.rescue.lib.out results.log results.workfinal.output mydag.dag.condor.sub mydag.dag.rescue.lock results.output watch_condor_q mydag.dag.dagman.log myjob.submit results.setup.error mydag.dag.dagman.out myscript.sh results.setup.output $cat mydag.dag.rescue.dagman.out7/10 11:34:33 ****************************************************** 7/10 11:34:33 ** condor_scheduniv_exec.27.0 (CONDOR_DAGMAN) STARTING UP 7/10 11:34:33 ** $CondorVersion: 6.8.4 Apr 22 2003 $ 7/10 11:34:33 ** $CondorPlatform: INTEL-LINUX-GLIBC22 $ 7/10 11:34:33 ** PID = 27317 7/10 11:34:33 ****************************************************** 7/10 11:34:33 DaemonCore: Command Socket at <128.135.125.193:35032> 7/10 11:34:33 argv[0] == "condor_scheduniv_exec.27.0" 7/10 11:34:33 argv[1] == "-Debug" 7/10 11:34:33 argv[2] == "3" 7/10 11:34:33 argv[3] == "-Lockfile" 7/10 11:34:33 argv[4] == "mydag.dag.rescue.lock" 7/10 11:34:33 argv[5] == "-Condorlog" 7/10 11:34:33 argv[6] == "results.log" 7/10 11:34:33 argv[7] == "-Dag" 7/10 11:34:33 argv[8] == "mydag.dag.rescue" 7/10 11:34:33 argv[9] == "-Rescue" 7/10 11:34:33 argv[10] == "mydag.dag.rescue.rescue" 7/10 11:34:33 Condor log will be written to results.log 7/10 11:34:33 DAG Lockfile will be written to mydag.dag.rescue.lock 7/10 11:34:33 DAG Input file is mydag.dag.rescue 7/10 11:34:33 Rescue DAG will be written to mydag.dag.rescue.rescue 7/10 11:34:33 Parsing mydag.dag.rescue ... 7/10 11:34:33 jobName: Setup 7/10 11:34:33 jobName: WorkerNode_1 7/10 11:34:33 jobName: WorkerNode_Two 7/10 11:34:33 jobName: CollectResults 7/10 11:34:33 jobName: LastNode 7/10 11:34:33 Dag contains 6 total jobs 7/10 11:34:33 Deleting older version of results.log 7/10 11:34:33 Bootstrapping... 7/10 11:34:33 Number of pre-completed jobs: 3 7/10 11:34:33 Submitting Job WorkerNode_Two ... 7/10 11:34:33 assigned Condor ID (28.0.0) 7/10 11:34:34 Event: ULOG_SUBMIT for Job WorkerNode_Two (28.0.0) 7/10 11:34:34 3/6 done, 0 failed, 1 submitted, 0 ready, 0 pre, 0 post 7/10 11:34:54 Event: ULOG_GLOBUS_SUBMIT for Job WorkerNode_Two (28.0.0) 7/10 11:34:54 Event: ULOG_EXECUTE for Job WorkerNode_Two (28.0.0) 7/10 11:35:59 Event: ULOG_JOB_TERMINATED for Job WorkerNode_Two (28.0.0) 7/10 11:35:59 Job WorkerNode_Two completed successfully. 7/10 11:35:59 Running POST script of Job WorkerNode_Two... 7/10 11:35:59 3/6 done, 0 failed, 0 submitted, 0 ready, 0 pre, 1 post 7/10 11:36:04 Event: ULOG_POST_SCRIPT_TERMINATED for Job WorkerNode_Two (28.0.0) 7/10 11:36:04 POST Script of Job WorkerNode_Two completed successfully. 7/10 11:36:04 Submitting Job CollectResults ... 7/10 11:36:04 assigned Condor ID (29.0.0) 7/10 11:36:04 Event: ULOG_SUBMIT for Job CollectResults (29.0.0) 7/10 11:36:04 4/6 done, 0 failed, 1 submitted, 0 ready, 0 pre, 0 post 7/10 11:36:19 Event: ULOG_GLOBUS_SUBMIT for Job CollectResults (29.0.0) 7/10 11:36:19 Event: ULOG_EXECUTE for Job CollectResults (29.0.0) 7/10 11:36:34 Event: ULOG_JOB_TERMINATED for Job CollectResults (29.0.0) 7/10 11:36:34 Job CollectResults completed successfully. 7/10 11:36:34 Running POST script of Job CollectResults... 7/10 11:36:34 4/6 done, 0 failed, 0 submitted, 0 ready, 0 pre, 1 post 7/10 11:36:39 Event: ULOG_POST_SCRIPT_TERMINATED for Job CollectResults (29.0.0) 7/10 11:36:39 POST Script of Job CollectResults completed successfully. 7/10 11:36:39 Submitting Job LastNode ... 7/10 11:36:39 assigned Condor ID (30.0.0) 7/10 11:36:39 Event: ULOG_SUBMIT for Job LastNode (30.0.0) 7/10 11:36:39 5/6 done, 0 failed, 1 submitted, 0 ready, 0 pre, 0 post 7/10 11:36:54 Event: ULOG_GLOBUS_SUBMIT for Job LastNode (30.0.0) 7/10 11:36:54 Event: ULOG_EXECUTE for Job LastNode (30.0.0) 7/10 11:37:09 Event: ULOG_JOB_TERMINATED for Job LastNode (30.0.0) 7/10 11:37:09 Job LastNode completed successfully. 7/10 11:37:09 Running POST script of Job LastNode... 7/10 11:37:09 5/6 done, 0 failed, 0 submitted, 0 ready, 0 pre, 1 post 7/10 11:37:14 Event: ULOG_POST_SCRIPT_TERMINATED for Job LastNode (30.0.0) 7/10 11:37:14 POST Script of Job LastNode completed successfully. 7/10 11:37:14 6/6 done, 0 failed, 0 submitted, 0 ready, 0 pre, 0 post 7/10 11:37:14 All jobs Completed! 7/10 11:37:14 **** condor_scheduniv_exec.27.0 (condor_DAGMAN) EXITING WITH STATUS 0 $cat results.work2.outputI'm process id 30478 on terminable.ci.uchicago.edu Thu Jul 10 11:34:46 CDT 2003 Running as binary /home/YOURLOGIN/.globus/.gass_cache/local/md5/23/61b50cd9b278330cac68107dd390d6/md5/5e/004f7216b8b846d548357da00985f4/data WorkerNode2 60 My name (argument 1) is WorkerNode2 My sleep duration (argument 2) is 60 Sleep of 60 seconds finished. Exiting RESULT: 0 SUCCESS $exit
That's it. There is a lot more you can do with Condor-G and DAGMan, but this basic introduction is all you need to know to get started. Good luck!