Table of Contents
These exercises introduce you to some simple Grid activities. They will give you the necessary skills to begin using the grid for your own applications.
These notes will guide you through a number of exercises at your own pace. You will be given commands to type, along with the expected output and notes highlighting the key points of each step.
There are lab assistants to help you with problems or to answer any questions that you have. Do not hesitate to talk to them.
These notes have transcripts from a machine called
workshop2.ci.uchicago.edu. You might be
using a different machine, in which case you should be careful to replace
workshop2.ci.uchicago.edu with the name of the machine
you are logged in to.
The exercise notes was prepared by running as user
train99. The lab assistants will give
you your own login name and number. Make sure to use that in the exercises
instead of train99 throughout the
exercises.
You will see various styles of text in the tutorial notes.
Text like this represents output from your computer.
Text like this is input that you should type.
Text like this is a listing of the content of a file, such as a program will will need to type in.
You will be doing all the lab exercises on a set of Linux computers
(or hosts)
named workshop2 and
osg-edu.cs.wisc.edu.
Each host has a fully qualified host name which uniquely identifies it on the internet; for examplin workshop2.ci.uchicago.edu.
From these hosts, we will run Grid jobs locally and on real sites on the Open Science Grid.
To access workshop2
from your computer, use secure shell.
On a Windows machine, use the PuTTY program. Download and open PuTTY and enter the hostname of the computer that you will use. PuTTY can be downloaded here.
On a Mac, use the Terminal and ssh command-line tool. Open Terminal and type:
$ssh train99@workshop2.ci.uchicago.eduThe authenticity of host 'workshop2.ci.uchicago.edu (1.1.1.1)' can't be established. RSA key fingerprint is 36:74:78:a8:ed:6b:38:96:63:20:01:df:46:9b:59:3b. Are you sure you want to continue connecting (yes/no)?Note
Make sure to replace the login name with your login name, as assigned by the instructors.yesWarning: Permanently added 'workshop2.ci.uchicago.edu,1.1.1.1' (RSA) to the list of known hosts. train99@workshop2.ci.uchicago.edu's password:PASSWORD# not echoed workshop2$
After the first time you do this, you won't get the "Are you sure..." prompt. Some of you will never see this, as your computers were used for testing this material, and the "yes" reply was already supplied by a tester. So it will look like:
$ssh train99@workshop2.ci.uchicago.eduPassword:PASSWORDworkshop2$
You should be able to reach the other lab host
osg-edu.cs.wisc.edu in this way too. Another
machine available to you is
workshop4.ci.uchicago.edu.
To start, practice cutting text from this page in your terminal window to run a command or set of commands. Cut the pwd command from the box below and paste it into your terminal window to execute it. This is a good way to avoid making typing mistakes while entering commands, but make sure to read the command and check that you have replaced any necessary parameters such as your user name.
Some suggestions for editors (when copy and paste will no longer be sufficient): vi, pico, nano, emacs. Feel free to choose any of these.
$ pwd
/home/train99
Now you should be able to run some execution jobs on the hosts in the lab.
First we'll try a simple 'Hello World' job:
workshop2$ globus-job-run localhost /bin/echo Hello World
Hello World
You've just submitted a job (the Linux command echo) to run on workshop2.ci.uchicago.edu. This is a simple building block for grid execution.
The globus-job-run utility runs commands on remote sites. You must tell this command several pieces of information:
The name of the host on which to run the job. In this example, we specified 'localhost', meaning the host you are using.
The name of the command to execute remotely. This must be be fully qualified path names (i.e., it must start with a "/"). In this example, we specified '/bin/echo'.
Parameters to pass to the command. In this example, we specify a message for echo, the text 'Hello World'.
Now we will run the Linux command hostname on the remote site to verify that we're talking to the resource we think we are.
Run it locally to make sure you are invoking it correctly.
$ hostname
workshop2
Use the command which to discover the location of the version of hostname that you are using. It will return a fully-qualified path name.
workshop2$ which hostname
/bin/hostname
This tells you that to run hostname via
globus-job-run, use /bin/hostname.
Use which to discover the location of the following commands on the system:
Now run hostname remotely, on osg-edu.cs.wisc.edu, to verify that you really are reaching a remote system:
workshop2$ globus-job-run osg-edu.cs.wisc.edu /bin/hostname
osg-edu.cs.wisc.edu
Next, see what else can you learn about the remote system with this approach.
Discover what user ID your job ran under using id.
Discover what environment variables are set using env.
Discover the load on the remote Grid server using uptime.
Discover the default working directory in which your remote job will run using pwd.
Do an ls of this working directory.
Use df to discover how much storage space exists in this working directory.
Use df to discover how much storage space
exists in the remote /tmp
directory.
GRAM, the Globus component for running remote jobs, supports the concept of a job manager as an adapter to Local Resource Managers. Each site can support one or more such job managers. Our lab systems have two job managers: The fork job manager runs a job immediately. The Condor job manager submits jobs into the Condor Condor batch scheduling system.
Now we will investigate some of the differences between
the fork and Condor jobmanagers. Which do you think will be faster?
Use the command time to test which jobmanager is faster.
The "fork" job manager is very fast - it has low scheduling latency. It runs trivial commands very quickly. But it also has very little compute power - its usually just a single CPU on a front-end computer called the head node. A batch job manager, on the other hand, has a higher scheduling overhead, but usually gives you access to all computers in a cluster and access to a lot more compute power.
Our lab hosts use the Condor LRM. Other sites systems sometimes use
other LRMs. For example, Portable Batch System (PBS)
is very common. To submit a job to a site using PBS, you must specify
jobmanager-pbs.
Now try a job through Condor on a different machine:
workshop2$ globus-job-run osg-edu.cs.wisc.edu/jobmanager-condor /bin/hostname
To time a command, enter time :
commandname
workshop2$ time sleep 3
real 0m3.007s
user 0m0.004s
sys 0m0.000s
Use this to time a few trivial Grid jobs to compare Fork and Condor:
workshop2$time globus-job-run osg-edu.cs.wisc.edu/jobmanager-condor /bin/hostnameworkshop2 real 0m10.678s user 0m0.090s sys 0m0.030s workshop2$time globus-job-run osg-edu.cs.wisc.edu/jobmanager-fork /bin/hostnameworkshop2 real 0m0.488s user 0m0.090s sys 0m0.020s
Throughout this tutorial we will use a simple application that tests if a number is prime. (wikipedia, the prime pages).
The prime testing application is invoked using the
primetest command. In most simple use, it takes
a single parameter: the number to test.
For example, this command will test the number 122 for primality.
$ primetest 122
NO - 2 is a factor
As you can see, the number is not prime, because the application determined that 2 is a factor.
This algorithm can take some time. Use time to measure how long the command takes. This time will vary a lot, depending on several factors - for example, the size of the number, the structure of its factors, how many other people are running on the same computer.
Test how long it takes to test the integers: 3, 524287, 524288, and 1500450271.
$ time primetest 524287
The prime testing application uses a simple algorithm to test for primes - each possible factor below the target number is tested. So when primetest is run with input 122, it tests every number between 2 and 121 in sequence. This is a very simple exampe of a parameter sweep.
We can split this sweep into several separate pieces, and combine the results:
$primetest 122 2 50NO - 2 is a factor $primetest 122 51 121NO - 61 is a factor
The first run tested potential factors between 2 and 50, and the second run tested potential factors between 51 and 121.
We can combine the results from several runs of primetest as follows: A number is prime if no runs of primetest find a factor. A number is not prime if any run of primetest finds a factor. So in the case of 122, we can see the 122 is not prime because at least one of the pieces returned 'NO'.
Now test some of the same integers: 3, 524287, 524288, 1500450271, with the sweep divided into several sections.
This prime application is installed in several other sites on the grid. We can use GRAM to execute that application remotely at one of these sites.
Run primetest on osg-edu.cs.wisc.edu using GRAM:
workshop2$ globus-job-run osg-edu.cs.wisc.edu \
/nfs/osgedu/primetest 143Can you run the same application on osg-edu.cs.wisc.edu? Which steps above do you need to do? What do you need to change from the above to make them work with osg-edu.cs.wisc.edu? What extra information do you need? You can ask the lab instructors for that extra information.
Earlier, we learned about Condor-G and DAGMan. Now we will submit some simple jobs using these components.
Check the Condor queue with condor_q
$ condor_q
-- Submitter: workshop2.ci.uchicago.edu : <1.1.1.1:36236> : workshop2.ci.uchicago.edu
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
0 jobs; 0 idle, 0 running, 0 heldThis command lists everything that Condor has been asked to run. Everyone will be using the same Condor installation for these exercises, so you will often see other students' jobs in the queue alongside your own.
Create Your Working Directories
Next, create some directories for you to work in. Make them in your home directory:
$cd ~$mkdir condor-tutorial$cd condor-tutorial$mkdir submit
Now we are ready to submit our first job with Condor-G. The basic procedure is to create a Condor job submit description file. This file can tell Condor what executable to run, what resources to use, how to handle failures, where to store the job's output, and many other characteristics of the job submission. Then this file is given to condor_submit.
There are many options that can be specified in a Condor-G submit description file. We will start out with just a few. We'll be sending the job to the computer workshop2.ci.uchicago.edu and running under the "jobmanager-fork" job manager. We're setting notification to never to avoid getting email messages about the completion of our job, and redirecting the stdout/err of the job back to the submission computer.
For more information, see the condor_submit manual.
Move to our scratch submission directory and create the submit file. Verify that it was entered correctly:
$cd ~/condor-tutorial/submitUSE YOUR FAVOURITE TEXT EDITOR TO ENTER THE FILE CONTENT$cat myjob.submitexecutable=/sw/national_grids/primetest arguments=143 output=results.output error=results.error log=results.log notification=never universe=grid grid_resource=gt2 workshop2.ci.uchicago.edu/jobmanager-fork queue
$ condor_submit myjob.submit
Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 1.Run condor_q to see the progress of your job. You can also run condor_q -globus to see Globus-specific status information. (See the condor_q manual for more information.)
$condor_q-- Submitter: workshop2.ci.uchicago.edu : <1.1.1.1:35688> : workshop2.ci.uchicago.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 1.0 train99 7/10 17:28 0+00:00:00 I 0 0.0 primetest 143 1 jobs; 1 idle, 0 running, 0 held $condor_q -globus-- Submitter: workshop2.ci.uchicago.edu : <1.1.1.1:35688> : workshop2.ci.uchicago.edu ID OWNER STATUS MANAGER HOST EXECUTABLE 1.0 train99 UNSUBMITTED fork osg-edu.cs.wisc.edu /sw/national_grids
In another window, run tail -f on the log file for your job to monitor progress. Re-run tail when you submit one or more jobs throughout this tutorial. You will see how typical Condor-G jobs progress. Use Ctrl+C to stop watching the file.
$cd ~/condor-tutorial/submit$tail -f --lines=500 results.log000 (001.000.000) 07/10 17:28:48 Job submitted from host: <1.1.1.1:35688> ... 017 (001.000.000) 03/24 19:13:30 Job submitted to Globus RM-Contact: workshop2.ci.uchicago.edu/jobmanager-fork JM-Contact: https://workshop2.ci.uchicago.edu:34127/28997/1174763610/ Can-Restart-JM: 1 ... 027 (001.000.000) 07/10 17:29:01 Job submitted to grid resource GridResource: gt2 workshop2.ci.uchicago.edu/jobmanager-fork GridJobId: gt2 workshop2.ci.uchicago.edu/jobmanager-fork https://workshop2.ci.uchicago.edu:51277/31413/1174756212/ ... 001 (001.000.000) 07/10 17:29:01 Job executing on host: gt2 workshop2.ci.uchicago.edu/jobmanager-fork ... 005 (001.000.000) 07/10 17:30:08 Job terminated. (1) Normal termination (return value 0) Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage 0 - Run Bytes Sent By Job 0 - Run Bytes Received By Job 0 - Total Bytes Sent By Job 0 - Total Bytes Received By Job ...
When the job is no longer listed in condor_q, or when the log file reports Job terminated, the results can be viewed using condor_history:
$ condor_history
ID OWNER SUBMITTED RUN_TIME ST COMPLETED CMD
1.0 train99 7/10 10:28 0+00:00:00 C ??? /home/train99/condWhen the job completes, verify that the output is as expected. The binary name is different from what you created because of how Globus and Condor-G cooperate to stage your file to execute computer.
$lsmyjob.submit myscript.sh* results.error results.log results.output $cat results.error$cat results.outputNO - 11 is a factor
If you didn't watch results.log with tail -f, you will want to examine the logged information with cat results.log .
Create a new submit file:
$cat > myjob2.submit executable=/nfs/osgedu/primetest arguments=143 output=results2.output error=results2.error log=results2.log notification=never universe=grid grid_resource=gt2 osg-edu.cs.wisc.edu/jobmanager-condor queue Ctrl+D$cat myjob2.submitexecutable=primetest arguments=143 output=results2.output error=results2.error log=results2.log notification=never universe=grid grid_resource=gt2 osg-edu.cs.wisc.edu/jobmanager-condor queue
Notice that the setting for the grid_resource now refers to condor instead of fork. Globus will submit the job to Condor on osg-edu.cs.wisc.edu instead of running the job directly.
Submit the job to Condor-G:
$ condor_submit myjob2.submit
Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 2.You can monitor the job's progress just like the first job. If you log into osg-edu.cs.wisc.edu in another window, you can see your job in the Condor queue there. Be quick, or the job will finish before you look!
$ssh osg-edu.cs.wisc.edutrain99@osg-edu.cs.wisc.edu's password: $condor_statusName OpSys Arch State Activity LoadAv Mem ActvtyTime vm1@clu1.phys LINUX INTEL Unclaimed Idle 0.000 9 0+00:03:34 vm2@clu1.phys LINUX INTEL Unclaimed Idle 0.000 9 0+00:03:32 Machines Owner Claimed Unclaimed Matched Preempting INTEL/LINUX 100 0 0 100 0 0 Total 100 0 0 100 0 0 $condor_q-- Submitter: osg-edu.cs.wisc.edu : <1.1.1.1:36311> : osg-edu.cs.wisc.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 11.0 train99 7/10 23:04 0+00:00:00 I 0 0.0 data TestJob 10 1 jobs; 1 idle, 0 running, 0 held
Clean up the results after the second job has finished running:
$ rm results.* results2.*Now we'll use DAGman, a tool which will help is run several grid jobs at once. (more information)
Create a small shell script to monitor the Condor-G queue. We will use this throughout the rest of the tutorial:
$cat > watch_condor_q #! /bin/sh while true; do condor_q train99 condor_q -globus train99 sleep 10 done Ctrl+D$cat watch_condor_q#! /bin/sh while true; do condor_q condor_q -globus sleep 10 done $chmod a+x watch_condor_q
Create a minimal DAG for DAGMan. This DAG will have a single node.
$cat > mydag.dag Job HelloWorld myjob.submit Ctrl+D$cat mydag.dagJob HelloWorld myjob.submit
Submit the DAG.
This section requires you to have three windows open. We will submit the DAG in the first window and watch the progress of it and the job in the other two. We will do these in the following order:
In the first window, submit the DAG and then watch condor with watch_condor_q.
In the second window, tail the results log.
In the third window, tail the DAGMan log.
Submit the DAG with condor_submit_dag and watch the run with watch_condor_q. condor_dagman is running as a job and submits your real job on your behalf, without your direct intervention. You might see the C (completed) state as your job finishes, but that often goes by too quickly to notice.
$condor_submit_dag mydag.dagChecking your DAG input file and all submit files it references. This might take a while... Done. ----------------------------------------------------------------------- File for submitting this DAG to Condor : mydag.dag.condor.sub Log of DAGMan debugging messages : mydag.dag.dagman.out Log of Condor library debug messages : mydag.dag.lib.out Log of the life of condor_dagman itself : mydag.dag.dagman.log Condor Log file for all jobs of this DAG : results.log Submitting job(s). Logging submit event(s). 1 job(s) submitted to cluster 2. ----------------------------------------------------------------------- $./watch_condor_q
In the first log window, watch the job log file as your job runs:
$ tail -f --lines=500 results.log
In a third window, watch DAGMan's log file by runnning tail -f --lines=500 mydag.dag.dagman.out. We suggest that you re-run this command whenever you submit a DAG during the remainder of this tutorial. This will show you how a typical DAG progresses. Use Ctrl+C to stop watching the file. An example is shown below:
$cd ~/condor-tutorial/submit$tail -f --lines=500 mydag.dag.dagman.out[...] 11/10 01:06:54 Of 1 nodes total: 11/10 01:06:54 Done Pre Queued Post Ready Un-Ready Failed 11/10 01:06:54 === === === === === === === 11/10 01:06:54 1 0 0 0 0 0 0 11/10 01:06:54 All jobs Completed! 11/10 01:06:54 Note: 0 total job deferrals because of -MaxJobs limit (0) 11/10 01:06:54 Note: 0 total job deferrals because of -MaxIdle limit (0) 11/10 01:06:54 Note: 0 total PRE script deferrals because of -MaxPre limit (0) 11/10 01:06:54 Note: 0 total POST script deferrals because of -MaxPost limit (0) 11/10 01:06:54 **** condor_scheduniv_exec.1474.0 (condor_DAGMAN) EXITING WITH STATUS 0
The first window, running watch_condor_q, should look something like the following:
$./watch_condor_q-- Submitter: workshop2.ci.uchicago.edu : <1.1.1.1:35688> : workshop2.ci.uchicago.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 2.0 train99 7/10 17:33 0+00:00:03 R 0 2.6 condor_dagman -f - 3.0 train99 7/10 17:33 0+00:00:00 I 0 0.0 myscript.sh TestJo 2 jobs; 1 idle, 1 running, 0 held -- Submitter: workshop2.ci.uchicago.edu : <1.1.1.1:35688> : workshop2.ci.uchicago.edu ID OWNER STATUS MANAGER HOST EXECUTABLE 3.0 train99 UNSUBMITTED fork workshop2.ci.uchicago.edu /tmp/train99-cond -- Submitter: workshop2.ci.uchicago.edu : <1.1.1.1:35688> : workshop2.ci.uchicago.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 2.0 train99 7/10 17:33 0+00:00:33 R 0 2.6 condor_dagman -f - 3.0 train99 7/10 17:33 0+00:00:15 R 0 0.0 myscript.sh TestJo 2 jobs; 0 idle, 2 running, 0 held -- Submitter: workshop2.ci.uchicago.edu : <1.1.1.1:35688> : workshop2.ci.uchicago.edu ID OWNER STATUS MANAGER HOST EXECUTABLE 3.0 train99 ACTIVE fork workshop2.ci.uchicago.edu /home/train99/cond -- Submitter: workshop2.ci.uchicago.edu : <1.1.1.1:35688> : workshop2.ci.uchicago.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 2.0 train99 7/10 17:33 0+00:01:03 R 0 2.6 condor_dagman -f - 3.0 train99 7/10 17:33 0+00:00:45 R 0 0.0 myscript.sh TestJo 2 jobs; 0 idle, 2 running, 0 held -- Submitter: workshop2.ci.uchicago.edu : <1.1.1.1:35688> : workshop2.ci.uchicago.edu ID OWNER STATUS MANAGER HOST EXECUTABLE 3.0 train99 ACTIVE fork workshop2.ci.uchicago.edu /tmp/train99-cond -- Submitter: workshop2.ci.uchicago.edu : <1.1.1.1:35688> : workshop2.ci.uchicago.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 0 jobs; 0 idle, 0 running, 0 held -- Submitter: workshop2.ci.uchicago.edu : <1.1.1.1:35688> : workshop2.ci.uchicago.edu ID OWNER STATUS MANAGER HOST EXECUTABLECtrl+C
Verify your results:
$ls -ltotal 12 -rw-r--r-- 1 train99 train99 28 Jul 10 10:35 mydag.dag -rw-r--r-- 1 train99 train99 523 Jul 10 10:36 mydag.dag.condor.sub -rw-r--r-- 1 train99 train99 608 Jul 10 10:38 mydag.dag.dagman.log -rw-r--r-- 1 train99 train99 1860 Jul 10 10:38 mydag.dag.dagman.out -rw-r--r-- 1 train99 train99 29 Jul 10 10:38 mydag.dag.lib.out -rw------- 1 train99 train99 0 Jul 10 10:36 mydag.dag.lock -rw-r--r-- 1 train99 train99 175 Jul 9 18:13 myjob.submit -rwxr-xr-x 1 train99 train99 194 Jul 10 10:36 myscript.sh -rw-r--r-- 1 train99 train99 31 Jul 10 10:37 results.error -rw------- 1 train99 train99 833 Jul 10 10:38 results.log -rw-r--r-- 1 train99 train99 261 Jul 10 10:37 results.output -rwxr-xr-x 1 train99 train99 81 Jul 10 10:35 watch_condor_q $cat results.error$cat results.outputNO - 11 is a factor
Looking at DAGMan's various files, we see that DAGMan itself ran as a Condor job.
$lsmydag.dag mydag.dag.dagman.log mydag.dag.lib.out myjob.submit results.error results.output mydag.dag.condor.sub mydag.dag.dagman.out mydag.dag.lock myscript.sh results.log watch_condor_q $cat mydag.dag.condor.sub# Filename: mydag.dag.condor.sub # Generated by condor_submit_dag mydag.dag universe = scheduler executable = /path/to/condor/bin/condor_dagman getenv = True output = mydag.dag.lib.out error = mydag.dag.lib.out log = mydag.dag.dagman.log remove_kill_sig = SIGUSR1 arguments = -f -l . -Debug 3 -Lockfile mydag.dag.lock -Condorlog results.log -Dag mydag.dag -Rescue mydag.dag.rescue environment = _CONDOR_DAGMAN_LOG=mydag.dag.dagman.out;_CONDOR_MAX_DAGMAN_LOG=0 queue $cat mydag.dag.dagman.log000 (006.000.000) 07/10 10:36:43 Job submitted from host: <1.1.1.1:33785> ... 001 (006.000.000) 07/10 10:36:44 Job executing on host: <1.1.1.1:33785> ... 005 (006.000.000) 07/10 10:38:10 Job terminated. (1) Normal termination (return value 0) Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage 0 - Run Bytes Sent By Job 0 - Run Bytes Received By Job 0 - Total Bytes Sent By Job 0 - Total Bytes Received By Job ...
If you weren't watching the DAGMan output file with tail -f, you can examine the file with the following command:
$ cat mydag.dag.dagman.out
Clean up your results. Be careful when deleting mydag.dag.* to not delete mydag.dag. Note the .*!
$ rm mydag.dag.* results.*
Typically each node in a DAG will have its own Condor submit file. Create some more submit files by copying our existing file. For simplicity during this tutorial, we'll keep the submit files very similar, notably using the same executable. In real-world use, your submit files and executables can differ.
$cp myjob.submit job.setup.submit$cp myjob.submit job.work1.submit$cp myjob.submit job.work2.submit$cp myjob.submit job.workfinal.submit$cp myjob.submit job.finalize.submit
Change the output and error entries to point to results.NODE.output and results.NODE.error files where NODE is actually the middle word in the submit file (job.NODE.submit).
So job.finalize.error would include:
output=results.finalize.output error=results.finalize.error
Here is one possible set of settings for the output entries:
$ grep '^output=' job.*.submit
job.finalize.submit:output=results.finalize.output
job.setup.submit:output=results.setup.output
job.work1.submit:output=results.work1.output
job.work2.submit:output=results.work2.output
job.workfinal.submit:output=results.workfinal.outputThis prevents the various nodes from overwriting each other's output.
Do not change the log entries. DAGMan requires that all nodes output their logs in the same location. Condor will ensure that the different jobs will not overwrite each other's entries in the log.
Change the arguments entries so that the first argument is something unique to each node (perhaps the NODE name).
For node work2, change the second argument to 120 so that it looks something like arguments=MyWorkerNode2 120
$cat mydag.dagJob HelloWorld myjob.submit $cat >> mydag.dag Job Setup job.setup.submit Job WorkerNode_1 job.work1.submit Job WorkerNode_Two job.work2.submit Job CollectResults job.workfinal.submit Job LastNode job.finalize.submit PARENT Setup CHILD WorkerNode_1 WorkerNode_Two PARENT WorkerNode_1 WorkerNode_Two CHILD CollectResults PARENT CollectResults CHILD LastNode Ctrl+D$cat mydag.dagJob HelloWorld myjob.submit Job Setup job.setup.submit Job WorkerNode_1 job.work1.submit Job WorkerNode_Two job.work2.submit Job CollectResults job.workfinal.submit Job LastNode job.finalize.submit PARENT Setup CHILD WorkerNode_1 WorkerNode_Two PARENT WorkerNode_1 WorkerNode_Two CHILD CollectResults PARENT CollectResults CHILD LastNode
condor_q -dag will organize jobs into their associated DAGs. Change watch_condor_q to use this:
$rm watch_condor_q$cat > watch_condor_q #! /bin/sh while true; do echo .... echo .... Output from condor_q echo .... condor_q train99 echo .... echo .... Output from condor_q -globus echo .... condor_q -globus train99 echo .... echo .... Output from condor_q -dag echo .... condor_q -dag train99 sleep 10 done Ctrl+D$chmod a+x watch_condor_q
In separate windows, run tail -f --lines=500 results.log and tail -f --lines=500 mydag.dag.dagman.out to monitor the job's progress.
$condor_submit_dag mydag.dagChecking your DAG input file and all submit files it references. This might take a while... Done. ----------------------------------------------------------------------- File for submitting this DAG to Condor : mydag.dag.condor.sub Log of DAGMan debugging messages : mydag.dag.dagman.out Log of Condor library debug messages : mydag.dag.lib.out Log of the life of condor_dagman itself : mydag.dag.dagman.log Condor Log file for all jobs of this DAG : results.log Submitting job(s). Logging submit event(s). 1 job(s) submitted to cluster 8. ----------------------------------------------------------------------- $./watch_condor_q-- Submitter: workshop2.ci.uchicago.edu : <1.1.1.1:35688> : workshop2.ci.uchicago.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 4.0 adesmet 7/10 17:45 0+00:00:08 R 0 2.6 condor_dagman -f - 5.0 adesmet 7/10 17:45 0+00:00:00 I 0 0.0 myscript.sh TestJo 6.0 adesmet 7/10 17:45 0+00:00:00 I 0 0.0 myscript.sh Setup 3 jobs; 2 idle, 1 running, 0 held -- Submitter: workshop2.ci.uchicago.edu : <1.1.1.1:35688> : workshop2.ci.uchicago.edu ID OWNER STATUS MANAGER HOST EXECUTABLE 5.0 train99 UNSUBMITTED fork workshop2.ci.uchicago.edu /tmp/username-cond 6.0 train99 UNSUBMITTED fork workshop2.ci.uchicago.edu /tmp/username-cond [...] -- Submitter: workshop2.ci.uchicago.edu : <1.1.1.1:35688> : workshop2.ci.uchicago.edu ID OWNER/NODENAME SUBMITTED RUN_TIME ST PRI SIZE CMD 4.0 train99 7/10 17:45 0+00:03:13 R 0 2.6 condor_dagman -f - 8.0 |-WorkerNode_ 7/10 17:46 0+00:01:28 R 0 0.0 myscript.sh Worker 2 jobs; 0 idle, 2 running, 0 held [...] -- Submitter: workshop2.ci.uchicago.edu : <1.1.1.1:35688> : workshop2.ci.uchicago.edu ID OWNER STATUS MANAGER HOST EXECUTABLE -- Submitter: workshop2.ci.uchicago.edu : <1.1.1.1:35688> : workshop2.ci.uchicago.edu ID OWNER/NODENAME SUBMITTED RUN_TIME ST PRI SIZE CMD 0 jobs; 0 idle, 0 running, 0 heldCtrl+C
Watching the logs or the condor_q output,
you'll note that the CollectResults node (workfinal)
wasn't run until both of the WorkerNode nodes (work1
and work2) finished.
$lsjob.finalize.submit mydag.dag.condor.sub myscript.sh results.setup.error results.workfinal.error job.setup.submit mydag.dag.dagman.log results.error results.setup.output results.workfinal.output job.work1.submit mydag.dag.dagman.out results.finalize.error results.work1.error watch_condor_q job.work2.submit mydag.dag.lib.out results.finalize.output results.work1.output job.workfinal.submit mydag.dag.lock results.log results.work2.error mydag.dag myjob.submit results.output results.work2.output $tail --lines=500 results.*.error==> results.finalize.error <== This is sent to standard error ==> results.setup.error <== This is sent to standard error ==> results.work1.error <== This is sent to standard error ==> results.work2.error <== This is sent to standard error ==> results.workfinal.error <== This is sent to standard error $tail --lines=500 results.*.output==> results.finalize.output <== I'm process id 29614 on workshop2.ci.uchicago.edu Thu Jul 10 10:53:58 CDT 2003 Running as binary /home/train99/.globus/.gass_cache/local/md5/0d/7c60aa10b34817d3ffe467dd116816/md5/de/03c3eb8a20852948a2af53438bbce1/data Finalize 1 My name (argument 1) is Finalize My sleep duration (argument 2) is 1 Sleep of 1 seconds finished. Exiting ==> results.setup.output <== I'm process id 29337 on workshop2.ci.uchicago.edu Thu Jul 10 10:50:31 CDT 2003 Running as binary /home/train99/.globus/.gass_cache/local/md5/a5/fab7b658db65dbfec3ecf0a5414e1c/md5/f4/e9a04ae03bff43f00a10c78ebd60fd/data Setup 1 My name (argument 1) is Setup My sleep duration (argument 2) is 1 Sleep of 1 seconds finished. Exiting ==> results.work1.output <== I'm process id 29444 on workshop2.ci.uchicago.edu Thu Jul 10 10:51:04 CDT 2003 Running as binary /home/train99/.globus/.gass_cache/local/md5/2e/17db42df4e113f813cea7add42e03e/md5/f6/f1bd82a2fec9a3a372a44c009a63ca/data WorkerNode1 1 My name (argument 1) is WorkerNode1 My sleep duration (argument 2) is 1 Sleep of 1 seconds finished. Exiting ==> results.work2.output <== I'm process id 29432 on workshop2.ci.uchicago.edu Thu Jul 10 10:51:03 CDT 2003 Running as binary /home/train99/.globus/.gass_cache/local/md5/ea/9a3c8d16346b2fea808cda4b5969fa/md5/f6/f1bd82a2fec9a3a372a44c009a63ca/data WorkerNode2 120 My name (argument 1) is WorkerNode2 My sleep duration (argument 2) is 120 Sleep of 120 seconds finished. Exiting ==> results.workfinal.output <== I'm process id 29554 on workshop2.ci.uchicago.edu Thu Jul 10 10:53:27 CDT 2003 Running as binary /home/train99/.globus/.gass_cache/local/md5/c9/7ba5d43acad3d9ebdfa633839e75c3/md5/11/cd84efa75305d54100f0f451b46b35/data WorkFinal 1 My name (argument 1) is WorkFinal My sleep duration (argument 2) is 1 Sleep of 1 seconds finished. Exiting
$ cat results.log
000 (005.000.000) 07/10 17:45:24 Job submitted from host: <workshop2.ci.uchicago.edu:35688>
DAG Node: HelloWorld
...
000 (006.000.000) 07/10 17:45:24 Job submitted from host: <workshop2.ci.uchicago.edu:35688>
DAG Node: Setup
...
017 (006.000.000) 07/10 17:45:42 Job submitted to Globus
RM-Contact: gk2:/jobmanager-fork
JM-Contact: https://gk2:2349/914/1057877133/
Can-Restart-JM: 1
...
001 (006.000.000) 07/10 17:45:42 Job executing on host: gt2 workshop2.ci.uchicago.edu/jobmanager-fork
...
017 (005.000.000) 07/10 17:45:42 Job submitted to Globus
RM-Contact: workshop2.ci.uchicago.edu:/jobmanager-fork
JM-Contact: https://workshop2.ci.uchicago.edu:2348/915/1057877133/
Can-Restart-JM: 1
...
001 (005.000.000) 07/10 17:45:42 Job executing on host: gk2
...
005 (005.000.000) 07/10 17:46:50 Job terminated.
(1) Normal termination (return value 0)
Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
0 - Run Bytes Sent By Job
0 - Run Bytes Received By Job
0 - Total Bytes Sent By Job
0 - Total Bytes Received By Job
...
005 (006.000.000) 07/10 17:46:50 Job terminated.
(1) Normal termination (return value 0)
Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
0 - Run Bytes Sent By Job
0 - Run Bytes Received By Job
0 - Total Bytes Sent By Job
0 - Total Bytes Received By Job
...
000 (007.000.000) 07/10 17:46:55 Job submitted from host: <workshop2.ci.uchicago.edu:35688>
DAG Node: WorkerNode_1
...
000 (008.000.000) 07/10 17:46:56 Job submitted from host: <workshop2.ci.uchicago.edu:35688>
DAG Node: WorkerNode_Two
...
017 (008.000.000) 07/10 17:47:09 Job submitted to Globus
RM-Contact: workshop2.ci.uchicago.edu:/jobmanager-fork
JM-Contact: https://workshop2.ci.uchicago.edu:2364/1037/1057877219/
Can-Restart-JM: 1
...
001 (008.000.000) 07/10 17:47:09 Job executing on host: gt2 workshop2.ci.uchicago.edu/jobmanager-fork
...
017 (007.000.000) 07/10 17:47:09 Job submitted to Globus
RM-Contact: workshop2.ci.uchicago.edu:/jobmanager-fork
JM-Contact: https://workshop2.ci.uchicago.edu:2367/1040/1057877220/
Can-Restart-JM: 1
...
001 (007.000.000) 07/10 17:47:09 Job executing on host: gt2 workshop2.ci.uchicago.edu/jobmanager-fork
...
005 (007.000.000) 07/10 17:48:17 Job terminated.
(1) Normal termination (return value 0)
Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
0 - Run Bytes Sent By Job
0 - Run Bytes Received By Job
0 - Total Bytes Sent By Job
0 - Total Bytes Received By Job
...
005 (008.000.000) 07/10 17:49:18 Job terminated.
(1) Normal termination (return value 0)
Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
0 - Run Bytes Sent By Job
0 - Run Bytes Received By Job
0 - Total Bytes Sent By Job
0 - Total Bytes Received By Job
...
000 (009.000.000) 07/10 17:49:22 Job submitted from host: <workshop2.ci.uchicago.edu:35688>
DAG Node: CollectResults
...
017 (009.000.000) 07/10 17:49:35 Job submitted to Globus
RM-Contact: workshop2.ci.uchicago.edu:/jobmanager-fork
JM-Contact: https://workshop2.ci.uchicago.edu:2383/1185/1057877366/
Can-Restart-JM: 1
...
001 (009.000.000) 07/10 17:49:35 Job executing on host: gt2 workshop2.ci.uchicago.edu/jobmanager-fork
...
005 (009.000.000) 07/10 17:50:42 Job terminated.
(1) Normal termination (return value 0)
Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
0 - Run Bytes Sent By Job
0 - Run Bytes Received By Job
0 - Total Bytes Sent By Job
0 - Total Bytes Received By Job
...
000 (010.000.000) 07/10 17:50:42 Job submitted from host: <workshop2.ci.uchicago.edu:35688>
DAG Node: LastNode
...
017 (010.000.000) 07/10 17:50:55 Job submitted to Globus
RM-Contact: workshop2.ci.uchicago.edu:/jobmanager-fork
JM-Contact: https://workshop2.ci.uchicago.edu:2392/1247/1057877446/
Can-Restart-JM: 1
...
001 (010.000.000) 07/10 17:50:55 Job executing on host: gt2 workshop2.ci.uchicago.edu/jobmanager-fork
...
005 (010.000.000) 07/10 17:52:02 Job terminated.
(1) Normal termination (return value 0)
Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
0 - Run Bytes Sent By Job
0 - Run Bytes Received By Job
0 - Total Bytes Sent By Job
0 - Total Bytes Received By Job
...$ cat mydag.dag.dagman.out
7/10 17:45:24 ******************************************************
7/10 17:45:24 ** condor_scheduniv_exec.4.0 (CONDOR_DAGMAN) STARTING UP
7/10 17:45:24 ** $CondorVersion: 6.8.4 Apr 22 2006 $
7/10 17:45:24 ** $CondorPlatform: INTEL-LINUX-GLIBC22 $
7/10 17:45:24 ** PID = 18826
7/10 17:45:24 ******************************************************
7/10 17:45:24 DaemonCore: Command Socket at <workshop2.ci.uchicago.edu:35774>
7/10 17:45:24 argv[0] == "condor_scheduniv_exec.4.0"
7/10 17:45:24 argv[1] == "-Debug"
7/10 17:45:24 argv[2] == "3"
7/10 17:45:24 argv[3] == "-Lockfile"
7/10 17:45:24 argv[4] == "mydag.dag.lock"
7/10 17:45:24 argv[5] == "-Condorlog"
7/10 17:45:24 argv[6] == "results.log"
7/10 17:45:24 argv[7] == "-Dag"
7/10 17:45:24 argv[8] == "mydag.dag"
7/10 17:45:24 argv[9] == "-Rescue"
7/10 17:45:24 argv[10] == "mydag.dag.rescue"
7/10 17:45:24 Condor log will be written to results.log
7/10 17:45:24 DAG Lockfile will be written to mydag.dag.lock
7/10 17:45:24 DAG Input file is mydag.dag
7/10 17:45:24 Rescue DAG will be written to mydag.dag.rescue
7/10 17:45:24 Parsing mydag.dag ...
7/10 17:45:24 Dag contains 6 total jobs
7/10 17:45:24 Bootstrapping...
7/10 17:45:24 Number of pre-completed jobs: 0
7/10 17:45:24 Submitting Job HelloWorld ...
7/10 17:45:24 assigned Condor ID (5.0.0)
7/10 17:45:24 Submitting Job Setup ...
7/10 17:45:24 assigned Condor ID (6.0.0)
7/10 17:45:25 Event: ULOG_SUBMIT for Job HelloWorld (5.0.0)
7/10 17:45:25 Event: ULOG_SUBMIT for Job Setup (6.0.0)
7/10 17:45:25 0/6 done, 0 failed, 2 submitted, 0 ready, 0 pre, 0 post
7/10 17:45:45 Event: ULOG_GLOBUS_SUBMIT for Job Setup (6.0.0)
7/10 17:45:45 Event: ULOG_EXECUTE for Job Setup (6.0.0)
7/10 17:45:45 Event: ULOG_GLOBUS_SUBMIT for Job HelloWorld (5.0.0)
7/10 17:45:45 Event: ULOG_EXECUTE for Job HelloWorld (5.0.0)
7/10 17:46:55 Event: ULOG_JOB_TERMINATED for Job HelloWorld (5.0.0)
7/10 17:46:55 Job HelloWorld completed successfully.
7/10 17:46:55 Event: ULOG_JOB_TERMINATED for Job Setup (6.0.0)
7/10 17:46:55 Job Setup completed successfully.
7/10 17:46:55 Submitting Job WorkerNode_1 ...
7/10 17:46:55 assigned Condor ID (7.0.0)
7/10 17:46:55 Submitting Job WorkerNode_Two ...
7/10 17:46:56 assigned Condor ID (8.0.0)
7/10 17:46:56 Event: ULOG_SUBMIT for Job WorkerNode_1 (7.0.0)
7/10 17:46:56 Event: ULOG_SUBMIT for Job WorkerNode_Two (8.0.0)
7/10 17:46:56 2/6 done, 0 failed, 2 submitted, 0 ready, 0 pre, 0 post
7/10 17:47:11 Event: ULOG_GLOBUS_SUBMIT for Job WorkerNode_Two (8.0.0)
7/10 17:47:11 Event: ULOG_EXECUTE for Job WorkerNode_Two (8.0.0)
7/10 17:47:11 Event: ULOG_GLOBUS_SUBMIT for Job WorkerNode_1 (7.0.0)
7/10 17:47:11 Event: ULOG_EXECUTE for Job WorkerNode_1 (7.0.0)
7/10 17:48:21 Event: ULOG_JOB_TERMINATED for Job WorkerNode_1 (7.0.0)
7/10 17:48:21 Job WorkerNode_1 completed successfully.
7/10 17:48:21 3/6 done, 0 failed, 1 submitted, 0 ready, 0 pre, 0 post
7/10 17:49:21 Event: ULOG_JOB_TERMINATED for Job WorkerNode_Two (8.0.0)
7/10 17:49:21 Job WorkerNode_Two completed successfully.
7/10 17:49:21 Submitting Job CollectResults ...
7/10 17:49:22 assigned Condor ID (9.0.0)
7/10 17:49:22 Event: ULOG_SUBMIT for Job CollectResults (9.0.0)
7/10 17:49:22 4/6 done, 0 failed, 1 submitted, 0 ready, 0 pre, 0 post
7/10 17:49:37 Event: ULOG_GLOBUS_SUBMIT for Job CollectResults (9.0.0)
7/10 17:49:37 Event: ULOG_EXECUTE for Job CollectResults (9.0.0)
7/10 17:50:42 Event: ULOG_JOB_TERMINATED for Job CollectResults (9.0.0)
7/10 17:50:42 Job CollectResults completed successfully.
7/10 17:50:42 Submitting Job LastNode ...
7/10 17:50:42 assigned Condor ID (10.0.0)
7/10 17:50:42 Event: ULOG_SUBMIT for Job LastNode (10.0.0)
7/10 17:50:42 5/6 done, 0 failed, 1 submitted, 0 ready, 0 pre, 0 post
7/10 17:50:57 Event: ULOG_GLOBUS_SUBMIT for Job LastNode (10.0.0)
7/10 17:50:57 Event: ULOG_EXECUTE for Job LastNode (10.0.0)
7/10 17:52:02 Event: ULOG_JOB_TERMINATED for Job LastNode (10.0.0)
7/10 17:52:02 Job LastNode completed successfully.
7/10 17:52:02 6/6 done, 0 failed, 0 submitted, 0 ready, 0 pre, 0 post
7/10 17:52:02 All jobs Completed!
7/10 17:52:02 **** condor_scheduniv_exec.4.0 (condor_DAGMAN) EXITING WITH STATUS 0Clean up your results. Be careful about deleting the mydag.dag.* files, you do not want to delete mydag.dag, just mydag.dag.*.
$ rm mydag.dag.* results.*
Make a working directory for this exercise. For the rest of this exercise, all your work should be done in there.
$mkdir dataex$cd dataex
Next create some files of different sizes, to use for exercises:
$dd if=/dev/zero of=smallfile-train99 bs=1M count=10$dd if=/dev/zero of=mediumfile-train99 bs=1M count=50$dd if=/dev/zero of=largefile-train99 bs=1M count=200$ls -shtotal 261M 201M largefile-train99 51M mediumfile-train99 11M smallfile-train99
Now try transferring a file to a remote site.
First you will need some scratch space on the remote system. You can create a working directory in your remote home directory.
workshop2.ci.uchicago.edu$ globus-job-run osg-edu.cs.wisc.edu /bin/mkdir /nfs/osgedu/clemson-train99Now copy the file over to this directory:
workshop2.ci.uchicago.edu$ globus-url-copy -vb file:///home/train99/dataex/smallfile-train99 gsiftp://osg-edu.cs.wisc.edu/nfs/osgedu/clemson-train99/ex1
Source: file:///home/train99/dataex/smallfile
Dest: gsiftp://osg-edu.cs.wisc.edu/home//gpfs1/osg_data/osgedu/train99
largefile-train99 -> ex1
208666624 bytes 1.41 MB/sec avg 1.43 MB/sec instYou will probably find that the transfer rate is much lower than when copying to local machines.
You can try copying to other sites in addition to osg-edu.cs.wisc.edu. Remember that you might need to make a scratch directory on each one, and that the place for this will be different for each site.
See how fast the file transfer is happening by using the -vb flag when copying the large file. Since this is a transfer over a local network that should not be too busy it should be fairly quick:
$ globus-url-copy -vb file:///home/train99/dataex/largefile-train99 gsiftp://osg-edu.cs.wisc.edu/nfs/osgedu/clemson-train99/ex1
Source: file:///home/train99/dataex/
Dest: gsiftp://osg-edu.cs.wisc.edu/home/train99/
largefile-train99 -> ex1
207618048 bytes 8.81 MB/sec avg 9.09 MB/sec instA quick reminder on URL formats: We've seen two kind of URLs so far.
file:///home/train99/dataex/largefile - a file called largefile on the local file system, in the directory /home/train99/dataex/.
gsiftp://osg-edu.cs.wisc.edu/scratch/train99/ - a directory accessible via gsiftp on the host called osg-edu.cs.wisc.edu in directory /scratch/train99.
Trying using 4 parallel data streams by adding the -p flag with an argument of 4:
Use the following globus-url-copy command to transfer the file from workshop2.ci.uchicago.edu to the osg-edu.cs.wisc.edu:
$ globus-url-copy -p 4 -vb file:///home/train99/dataex/smallfile-train99 gsiftp://osg-edu.cs.wisc.edu/nfs/osgedu/clemson-train99/ex1Experiment with transferring different file sizes and numbers of parallel streams, to both local and remote sites and see how the speed varies.
Next try a third-party transfer. You do this by specifying two gsiftp URLs, instead of one gsiftp URL and one file URL.
globus-url-copy will control the transfers but data will not pass through the local machine. Instead, it will go directly between the source and destination machines.
Transfer a file between two remote sites, and see if it is faster than if you had transferred it to workshop2.ci.uchicago.edu and then back out again.
Try to make up a command line for this yourself - you should use two gsiftp URLs, instead of a file url and a gsiftp URL.
Next use RFT, the reliable file transfer service, to transfer a block of files between two sites.
First, create a transfer job file, which lists some RFT
parameters and all of the files to transfer. You can get an example from
workshop2.ci.uchicago.edu:/sw/misc/example.rft. Read
through this and change the URLs (the site names and the files -- pay
attention) at the end to refer to your files.
The RFT command and transfer job file reference is available here.
This example lists three transfers: largefile will be transfered three times, once each to osg-edu.cs.wisc.edu, once to osg-edu.cs.wisc.edu, and once to another host on the grid.
You can launch it as follows. The client will periodically output transfer status. You can watch jobs move from the pending state, to the Active state and then to the Finished state.
$cp /sw/misc/example.rft rft.xfr$vi rft.xfr... make your changes ... $rft -h workshop2.ci.uchicago.edu -f ./rft.xfrNumber of transfers in this request: 3 Subscribed for overall status Termination time to set: 60 minutes Overall status of transfer: Finished/Active/Failed/Retrying/Pending 0/1/0/0/2 Overall status of transfer: Finished/Active/Failed/Retrying/Pending 1/0/0/0/2 Overall status of transfer: Finished/Active/Failed/Retrying/Pending 1/1/0/0/1 Overall status of transfer: Finished/Active/Failed/Retrying/Pending 2/0/0/0/1 Overall status of transfer: Finished/Active/Failed/Retrying/Pending 2/1/0/0/0 Overall status of transfer: Finished/Active/Failed/Retrying/Pending 3/0/0/0/0 All Transfers are completed
Initally all jobs start in the pending state, move to active state and then hopefully to finished state (but maybe fail, in which case they go to the failed state).
The transfer file has a number of options, documented in-line. You can experiment changing them. Interesting ones to try:
Add more URLs to transfer
Transfer between two remote sites
Use parallel streams
Increase the transfer concurrency
In particular you should check that you understand the difference between parallel streams (the number of streams used when transferring one file) and concurrency (the number of files that can be transferred at once).
The above sections have dealt with moving data around, and always made the assumption that you knew where the files you wanted were located.
Next we will deal with the Replica Location Service (RLS).
$ globus-rls-admin -p rls://workshop2.ci.uchicago.edu
ping rls://workshop2.ci.uchicago.edu: 0 secondsFirst perform a simple query for an example logical filename that has been placed in the RLS by the instructors:
$globus-rls-cli rls://workshop2.ci.uchicago.edurls>query lrc lfn exampleexample: gsiftp://workshop2.ci.uchicago.edu/scratch/example example: gsiftp://osg-edu.cs.wisc.edu/scratch/example
This queries for a logical filename example. The results show that this file can be retrieved via either of two URLs (one in scratch space on workshop2.ci.uchicago.edu, and one in scratch space on osg-edu.cs.wisc.edu).
Now try querying for logical filename another-example.
You can also publish your own logical filename into the RLS, with mappings to physical files, using the create command:
rls> create train99-first-lfn gsiftp://workshop2.ci.uchicago.edu/home/train99/dataex/largefile-train99This creates an LFN called train99-first-lfn and then adds a mapping to gsiftp://workshop2.ci.uchicago.edu/home/train99/dataex/largefile-train99.
rls> query lrc lfn train99-first-lfn
train99-first-lfn: gsiftp://workshop2.ci.uchicago.edu/home/train99/dataex/largefile-train99Now copy largefile to another place (on another gridlab machine or on one of the remote sites), and register it into the RLS, with the same LFN. You will need to use the add command instead of the create command, because the LFN already exists and you just need to add a new mapping.
Get a neighbour to query the RLS for your logical filename, and see that the mappings you have made are public for everyone to see.
So far, you have only been using the RLS server on workshop2.ci.uchicago.edu. There are servers running on other machines.
Use globus-rls-admin to ping the RLS server on osg-edu.cs.wisc.edu and check that it is online.
Then, connect to one of the other servers using globus-rls-cli and query for the example LFN that we used above. You should see that there are some other locations from which you can get the example file.
Try adding your own LFN into one of the other servers, using globus-rls-cli.
Next use the -S option to check the status/statistics of each of the two servers. You should see output similar to that below:
$ globus-rls-admin -S rls://workshop2.ci.uchicago.edu
Version: 2.1.5
Uptime: 00:28:15
LRC stats
update method: lfnlist
update method: bloomfilter
updates bloomfilter: rls://osg-edu.cs.wisc.edu:39281 last 06/21/04 22:44:45
lfnlist update interval: 86400
bloomfilter update interval: 900
numlfn: 1
numpfn: 1
nummap: 1
RLI stats
updated by: rls://osg-edu.cs.wisc.edu:39281 last 06/21/04 22:44:35
updated via bloomfilters
globus-rls-admin -S rls://gk2
Version: 2.1.5
Uptime: 00:32:33
LRC stats
update method: lfnlist
update method: bloomfilter
updates bloomfilter: rls://osg-edu.cs.wisc.edu:39281 last 06/21/04 22:44:40
lfnlist update interval: 86400
bloomfilter update interval: 900
numlfn: 2
numpfn: 2
nummap: 2
RLI stats
updated by: rls://osg-edu.cs.wisc.edu:39281 last 06/21/04 22:44:49
updated via bloomfiltersMake a working directory for this exercise. For the rest of this exercise, all your work should be done in there.
$mkdir srmex$cd srmex
There are a few environmental variables already set for you.
SRM_HOME : srm client installation directory; /sw/srmclient2 SRMEP : SRM service endpoint; srm://gwdca04.fnal.gov:8443/srm/managerv2 SRMPATH : Working directory on SRM storage; /pnfs/fnal.gov/data/osgedu MYNAME : your login
Next create a file to use for exercises:
$dd if=/dev/zero of=smallfile-$MYNAME bs=1M count=2$ls -l-rw-r--r-- 1 train03 train03 2097152 2008-01-12 18:15 smallfile-train03
You already have used globus-url-copy to move your files from your local machine to one of designated target machine and from a remote gridftp server to your local machine.
Use srm-ping to find out the status of SRM server on $SRMEP.
$ srm-ping $SRMEP
This returns SRM version number, similar to the following.
Ping versionInfo=v2.2
Extra information
Key=backend_type
Value=dCache
Key=backend_version
Value=production-1-8-0-9
File transfer into SRM managed storage goes through several protocols including gridftp file transfer. This client operation communicates with SRM server through several interfaces internally; srmPrepareToPut to request your file request, srmStatusOfPutRequest to check your request, gridftp file transfer and srmPutDone to finalize the state of your file transfer.
$ srm-copy file:////home/train99/srmex/smallfile-$MYNAME \
$SRMEP\?SFN=$SRMPATH/smallfile-$MYNAME
Upon successful completion, this returns a summary similar to the following:
SRM-CLIENT*REQUESTTYPE=put SRM-CLIENT*TOTALFILES=1 SRM-CLIENT*TOTAL_SUCCESS=1 SRM-CLIENT*TOTAL_FAILED=0 SRM-CLIENT*REQUEST_TOKEN=-2146782625 SRM-CLIENT*REQUEST_STATUS=SRM_SUCCESS SRM-CLIENT*SOURCEURL[0]= file:////home/train99/srmex/smallfile-$MYNAME SRM-CLIENT*TARGETURL[0]= $SRMEP\?SFN=$SRMPATH/smallfile-$MYNAME SRM-CLIENT*TRANSFERURL[0]=gsiftp://gwdca03.fnal.gov:2811///smallfile-alex SRM-CLIENT*ACTUALSIZE[0]=2097152 SRM-CLIENT*FILE_STATUS[0]=SRM_SUCCESS SRM-CLIENT*EXPLANATION[0]=Done
A quick reminder on URL formats:
We've seen two kinds of URLs so far.
file:////home/train99/srmex/smallfile - a file called smallfile on the local file system, in directory /home/train99/srmex/. The appended $MYNAME is only to make the filename unique in this grid school.
srm://gwdca04.fnal.gov:8443/srm/managerv2\?SFN=/pnfs/fnal.gov/data/osgedu/smallfile-train99 - a SiteURL for a file name smallfile-train99 on SRM running on the host called gwdca04.fnal.gov and port 8443 with the web service handle /srm/managerv2 in directory /pnfs/fnal.gov/data/osgedu. SFN represents Site File Name.
Now try to find out the properties of the file that you just put into SRM.
$ srm-ls $SRMEP\?SFN=$SRMPATH/smallfile-$MYNAME
Upon successful completion, this returns a summary similar to the following:
SRM-CLIENT*REQUEST_STATUS=SRM_SUCCESS SRM-CLIENT*REQUEST_EXPLANATION=srm-ls completed normally SRM-CLIENT*SURL=/pnfs/fnal.gov/data/osgedu/smallfile-alex SRM-CLIENT*BYTES=2097152 SRM-CLIENT*FILETYPE=FILE SRM-CLIENT*STORAGETYPE=PERMANENT SRM-CLIENT*FILE_STATUS=SRM_SUCCESS SRM-CLIENT*OWNERPERMISSION=7166 SRM-CLIENT*LIFETIMELEFT=-1 SRM-CLIENT*LIFETIMEASSIGNED=-1 SRM-CLIENT*CHECKSUMTYPE=adler32 SRM-CLIENT*CHECKSUMVALUE=01e00001 SRM-CLIENT*FILELOCALITY=ONLINE SRM-CLIENT*OWNERPERMISSION.USERID=7166 SRM-CLIENT*OWNERPERMISSION.MODE=RW SRM-CLIENT*GROUPPERMISSION.GROUPID=9803 SRM-CLIENT*GROUPPERMISSION.MODE=R SRM-CLIENT*OTHERPERMISSION=R SRM-CLIENT*RETENTIONPOLICY=CUSTODIAL SRM-CLIENT*ACCESSLATENCY=ONLINE SRM-CLIENT*LASTACCESSED=2008-1-12-18-18-39 SRM-CLIENT*CREATEDATTIME=2008-1-12-18-18-39
Now try to get the file that you just browsed and put into SRM from the SRM managed storage to your local machine. This client operation communicates with SRM server through several interfaces internally: srmPrepareToGet to request your file request, srmStatusOfGetRequest to check your request, gridftp file transfer and srmReleaseFiles to release the file after your transfer.
$ srm-copy $SRMEP\?SFN=$SRMPATH/smallfile-$MYNAME \
file:////home/train99/srmex/my-smallfile
Upon successful completion, this returns a summary similar to the following:
SRM-CLIENT*REQUESTTYPE=get SRM-CLIENT*TOTALFILES=1 SRM-CLIENT*TOTAL_SUCCESS=1 SRM-CLIENT*TOTAL_FAILED=0 SRM-CLIENT*REQUEST_TOKEN=-2146782626 SRM-CLIENT*REQUEST_STATUS=SRM_SUCCESS SRM-CLIENT*SOURCEURL[0]= $SRMEP\?SFN=$SRMPATH/smallfile-$MYNAME SRM-CLIENT*TARGETURL[0]= file:////home/train99/srmex/my-smallfile SRM-CLIENT*TRANSFERURL[0]=gsiftp://gwdca03.fnal.gov:2811///smallfile-alex SRM-CLIENT*ACTUALSIZE[0]=2097152 SRM-CLIENT*FILE_STATUS[0]=SRM_FILE_PINNED SRM-CLIENT*EXPLANATION[0]=Done
After srm-copy is completed, find out the file size at the target on your local machine:
$ ls -l /home/train99/srmex/my-smallfile
-rw-r--r-- 1 train99 train99 2097152 2008-01-12 19:29 my-smallfile
Now try to remove the file that you put from the SRM managed storage.
$ srm-rm $SRMEP\?SFN=$SRMPATH/smallfile-$MYNAME
Upon successful completion, this returns a summary similar to the following:
SRM-DIR: Total files to remove: 1
status=SRM_SUCCESS
explanation=successfully removed files
surl=$SRMEP\?SFN=$SRMPATH/smallfile-$MYNAME
After srm-rm returns successfully, find out the file properties of the same SURL on the SRM with srm-ls. You should see that the SURL is invalid.
Now try to create a directory in SRM managed storage.
$ srm-mkdir $SRMEP\?SFN=$SRMPATH/$MYNAME
This will create a directory under the SRM that you can use in your SURLs. Upon successful completion, this returns a summary similar to the following:
SRM-DIR: Sat Jan 12 19:04:09 CST 2008 Calling SrmMkdir
status=SRM_SUCCESS
explanation=success
Browse the directory to see what kind of property information that you retrieve from SRM.
Now try to remove the directory from SRM.
$ srm-rmdir $SRMEP\?SFN=$SRMPATH/$MYNAME
This will remove a directory under the SRM. Upon successful completion, this returns a summary similar to the following:
SRM-DIR: Sat Jan 12 19:06:34 CST 2008 Calling SrmRmdir
status=SRM_SUCCESS
explanation=success
Experiment with putting and getting files with different file sizes and numbers of parallel streams to and from the remote SRM site, and see the differences. When you use 4 parallel data streams by adding the -parallelism option with an argument of 4, the client operation goes through the same protocol, and the parallel streams are used in the gridftp file transfer. Larger files would make a significant difference in file transfer performance.
Experiment with directory structure in your path.
Note: Remember to remove those files and directories that you created afterwards.
Now, let's make a space reservation for 5M bytes of total space, 4M bytes of guaranteed space and lifetime of 900 seconds:
$ srm-sp-reserve -serviceurl $SRMEP -size 5000000 -gsize 4000000 -lifetime 900
Upon successful completion, this returns a summary similar to the following:
SRM-SPACE: Status Code for spaceStatusRequest SRM_SUCCESS
SpaceToken=258138
TotalReservedSpaceSize=4000000
Retention Policy=REPLICA
Access Latency=ONLINE
Upon successful space reservation, this will show you the space token which will be used in the next exercises. (e.g. 258138 from above, but it is not necessarily numbers always and different storage may return different string format.) Note that your reserved space was returned as 4MB. Let's set the returned space token as an environment variable to re-use later on:
$ export SPTOKEN=258138
Finding out space properties from SRM
Now, let's find out the space information with the space token that you just received above:
$ srm-sp-info -serviceurl $SRMEP -spacetoken $SPTOKEN
Upon successful completion, this returns a summary similar to the following:
SRM-SPACE: ....space token details ....
status=SRM_SUCCESS
SpaceToken=258138
TotalSize=4000000
Owner=VoGroup=osgedu VoRole=null
LifetimeAssigned=900
LifetimeLeft=463
UnusedSize=4000000
GuaranteedSize=4000000
RetentionPolicy=REPLICA
AccessLatency=ONLINE
status=SRM_SUCCESS
explanation=ok
Retrieving space tokens from SRM
Supposed you lost your space token, and let’s find out how to retrieve the space tokens that belong to you:
$ srm-sp-tokens -serviceurl $SRMEP
Upon successful completion, this returns a summary similar to the following:
SRM-SPACE: ...................................
Status=SRM_SUCCESS
Explanation=OK
SRM-SPACE (0)SpaceToken=258138
This would show all the space tokens that belong to your grid identity and its mapping on the server.
Some time passed since the above space reservation, and the lifetime of the reserved space may be near the expiration. Now, let's update the lifetime of the space as well as the size of the space. We'llll use 7MB of total space with 6MB of guaranteed space, and make the lifetime 950 seconds:
$ srm-sp-update -serviceurl $SRMEP -spacetoken $SPTOKEN -size 7000000 -gsize 6000000 -lifetime 950
Upon successful completion, this returns a summary similar to the following because the target SRM storage does not support this functionality.
SRM-SPACE: Sat Jan 12 19:09:55 CST 2008 Calling updateSpace request
status=SRM_NOT_SUPPORTED
explanation=can not find a handler, not implemented
Request token=null
However, when the SRM storage supports the functionality and the request is successful, this returns a summary similar to the following.
SRM-SPACE: Sat Jan 12 21:22:50 PST 2008 Calling updateSpace request
status=SRM_SUCCESS
Request token=null
lifetime=950
Min=7000000
Max=7000000
Your space token is the same as before, and upon successful completion, the lifetime and size of your space should be updated. Let’s find out the space information from the SRM and verify using srm-sp-info to see the new updated information.
Now let's put a file into your reserved space using the space token. This client operation communicates with the SRM server, same as before. However, because of your space token, your file will be written into the space that you have reserved.
$ srm-copy file:////home/train99/srmex/smallfile-$MYNAME \
$SRMEP\?SFN=$SRMPATH/smallfile-space-$MYNAME \
-spacetoken $SPTOKEN
Upon successful completion, this returns a summary similar to the following:
SRM-CLIENT*REQUESTTYPE=put SRM-CLIENT*TOTALFILES=1 SRM-CLIENT*TOTAL_SUCCESS=1 SRM-CLIENT*TOTAL_FAILED=0 SRM-CLIENT*REQUEST_TOKEN=-2146782603 SRM-CLIENT*REQUEST_STATUS=SRM_SUCCESS SRM-CLIENT*SOURCEURL[0]= file:////home/train99/srmex/smallfile-$MYNAME SRM-CLIENT*TARGETURL[0]= $SRMEP\?SFN=$SRMPATH/smallfile-space-$MYNAME SRM-CLIENT*TRANSFERURL[0]=gsiftp://gwdca03.fnal.gov:2811///smallfile-space-alex SRM-CLIENT*ACTUALSIZE[0]=2097152 SRM-CLIENT*FILE_STATUS[0]=SRM_SUCCESS SRM-CLIENT*EXPLANATION[0]=Done
After successful completion, find out the file properties with srm-ls.
$ srm-ls $SRMEP\?SFN=$SRMPATH/smallfile-space-$MYNAME
Upon successful completion, this returns a summary similar to the following:
SRM-CLIENT*REQUEST_STATUS=SRM_SUCCESS SRM-CLIENT*REQUEST_EXPLANATION=srm-ls completed normally SRM-CLIENT*SURL=/pnfs/fnal.gov/data/osgedu/smallfile-space-alex SRM-CLIENT*BYTES=2097152 SRM-CLIENT*FILETYPE=FILE SRM-CLIENT*STORAGETYPE=PERMANENT SRM-CLIENT*FILE_STATUS=SRM_SUCCESS SRM-CLIENT*OWNERPERMISSION=7166 SRM-CLIENT*LIFETIMELEFT=-1 SRM-CLIENT*LIFETIMEASSIGNED=-1 SRM-CLIENT*CHECKSUMTYPE=adler32 SRM-CLIENT*CHECKSUMVALUE=01e00001 SRM-CLIENT*FILELOCALITY=ONLINE SRM-CLIENT*OWNERPERMISSION.USERID=7166 SRM-CLIENT*OWNERPERMISSION.MODE=RW SRM-CLIENT*GROUPPERMISSION.GROUPID=9803 SRM-CLIENT*GROUPPERMISSION.MODE=R SRM-CLIENT*OTHERPERMISSION=R SRM-CLIENT*SPACETOKENS(0)=258138 SRM-CLIENT*RETENTIONPOLICY=CUSTODIAL SRM-CLIENT*ACCESSLATENCY=ONLINE SRM-CLIENT*LASTACCESSED=2008-1-12-19-16-37 SRM-CLIENT*CREATEDATTIME=2008-1-12-19-16-37
Note from the previous srm-ls output that this time it shows the space token you used when putting your file into the SRM managed storage.
Now let's release the reserved space using the space token.
$ srm-sp-release -serviceurl $SRMEP -spacetoken $SPTOKEN
Upon successful completion, this returns a summary similar to the following:
SRM-SPACE: Releasing space for token=258138
status=SRM_SUCCESS
explanation=Space released
This operation may fail if you have any files in the space associated with the space token. In such case, remove the files with srm-rm to try releasing the space again.
$ srm-rm $SRMEP\?SFN=$SRMPATH/smallfile-space-$MYNAME
After successful releasing your reserved space, find out the space properties with srm-sp-info.
Experiment on reserving spaces with different space sizes and lifetimes, and putting your files into the reserved spaces with space token. Experiment updating the reserved space after you put your files into the reserved space. Experiment with directory structure in your SURL.
Note: Remember to remove those files and directories that you created afterwards. Also remember to release those spaces that you reserved if still active.
This exercise will provide hands-on experience in using various tools to setup and use the Grid Security Infrastructure (GSI) for working on the grid. The first few sections delve into certificates and proxies and demonstarate how pre-configured credentials can be used to run some grid enable programs. (more information)
In order to do things (like submit jobs or transfer data) on the grid, you need a grid proxy. A grid proxy contains everything necessary to authenticate you to grid resources. You will do more with grid security in the security lab, later on.
We have given each training account a proxy that will work on many grid systems. You can check this with the grid-proxy-info command.
workshop2.ci.uchicago.edu$ grid-proxy-info
subject : /DC=org/DC=doegrids/OU=People/CN=OSG Education student 37 789564/CN=1558914057
issuer : /DC=org/DC=doegrids/OU=People/CN=OSG Education student 37 789564
identity : /DC=org/DC=doegrids/OU=People/CN=OSG Education student 37 789564
type : Proxy draft (pre-RFC) compliant impersonation proxy
strength : 512 bits
path : /tmp/x509up_u1048
timeleft : 98:42:47 (4.1 days)
Look at the timeleft field. This tells you how much time this proxy will be valid for. Check that there is some time left on your proxy. (When this proxy has expired, you will no longer be able to use the grid, and you will have to get a new proxy)
The security details necessary for you to access the grid are stored in a file called a 'grid proxy'. In normal grid usage, you would use your own credentials and a command called grid-proxy-init to make your own proxy. For this tutorial, the instructors made a proxy for you before the tutorial started.
Use grid-proxy-info to show information about your proxy.
Use the -all parameter to display information your proxy:
[train99 ~]$ grid-proxy-info -all
subject : /O=Grid/OU=OSG/CN=Training User 99/CN=203360020
issuer : /O=Grid/OU=OSG/CN=Training User 99
identity : /O=Grid/OU=OSG/CN=Training User 99
type : Proxy draft (pre-RFC) compliant impersonation proxy
strength : 512 bits
path : /tmp/x509up_u539
timeleft : 11:58:58Grid Proxy Details
subjectThe distingushed name (DN) from the certificate, appended with a uniqe string of numbers.
issuer The distinguished name of the user certificate itself.
pathThe file system location where the your proxy is stored.
timeleftHow much longer the proxy will be valid, in hours, minutes and seconds.
As you can see, the issuer of the grid certificate is the user certificate. This shows the chain of trust: CA -> user certificate -> proxy certificate.
The proxy certificate contains the private key generated for proxy, correspnding public key and is signed like a certificate by the user certificate.
Now list the contents of the proxy using grid-cert-info, specifying the full path to your proxy.
$ grid-cert-info -file /path/to/proxy/proxyFileName
Certificate:
Data:
Version: 3 (0x2)
Serial Number: 203360020 (0xc1f0714)
Signature Algorithm: md5WithRSAEncryption
Issuer: C=US, O=SDSC, OU=SDSC, CN=Account Train31/UID=train31
Validity
Not Before: Jun 23 14:55:10 2006 GMT
Not After : Jun 24 03:00:10 2006 GMT
Subject: C=US, O=SDSC, OU=SDSC, CN=Account Train31/UID=train31, CN=203360020
Subject Public Key Info:
Public Key Algorithm: rsaEncryption
RSA Public Key: (512 bit)
Modulus (512 bit):
00:b8:75:e3:a4:3c:31:9e:b9:71:e8:b0:4e:fc:18:
69:e6:79:15:90:f4:0f:49:20:f0:e3:62:9f:e2:92:
d0:96:4c:9b:b5:97:12:b3:bd:87:c7:8c:2f:bb:b0:
fe:79:8c:3d:61:5e:49:f6:c1:46:e1:1e:08:d1:d7:
89:a0:e3:8a:f3
Exponent: 65537 (0x10001)
X509v3 extensions:
1.3.6.1.4.1.3536.1.222: critical
0.0
..+.......
Signature Algorithm: md5WithRSAEncryption
45:05:52:c7:9f:a7:35:32:d9:a8:be:58:92:a7:b0:61:e4:7a:
2a:a2:36:0f:eb:65:0e:0f:ca:40:3d:0e:27:8b:38:14:a6:af:
51:7d:28:2f:ac:3e:3e:05:7b:ea:d6:0e:fc:78:7d:eb:60:80:
6a:74:43:64:ef:ca:e8:25:fe:d3:07:a9:4d:e0:54:4a:75:9f:
c9:8e:9a:1e:82:19:a4:fc:72:a3:6f:0d:de:33:57:d8:f8:cd:
da:d2:bc:8a:ee:48:34:4b:00:3e:7e:b7:5e:66:fa:2e:5c:22:
4a:50:98:02:32:c6:e3:a9:07:b7:bb:e6:4d:02:e8:6c:d4:48:
5e:55:ec:ed:a9:38:ee:b8:33:60:88:c1:ab:38:ce:d8:53:a3:
ac:c3:a2:c1:d8:1e:95:5b:e5:3a:3f:d1:e0:51:c2:5e:82:e0:
a4:48:d3:e6:82:66:56:d9:6b:e0:a5:1e:85:4d:3d:d7:e0:4e:
03:ce:f7:5a:63:cd:5c:9a:38:96:59:0f:92:11:6b:eb:ed:34:
1a:55:73:e1:c0:b0:91:ea:b4:1e:3b:8d:0f:2d:53:83:10:98:
44:19:ac:39:6d:1a:6b:37:90:60:6a:35:9b:c6:41:2e:5a:ef:
ae:54:6c:9e:51:b8:68:c2:97:83:2f:72:25:df:90:b9:bc:31:
92:23:45:77
[train99 ~]$ The contents are similar to your user certificate, but there are some differences; for example, the issuer is the DN of the user certificate, rather than of the certificate authority.
grid-cert-info is useful to see how long your proxy certificate will last (the Not Before and Not After lines under Validity).
Globus services (for example, GRAM and GridFTP) use a
grid mapfile located in /etc/grid-security/grid-mapfile on
each server.
This file has restricted write access, but the file can be read by anyone.
You can look at the gridmap file on workshop2 like this:
[train99@workshop2.ci.uchicago.edu ~]$ cat /etc/grid-security/grid-mapfile
#
# Automatically generated by gx-gen-mapfile (gx-map 0.5.1)
# at Fri 2006-06-23 15:26:02 UTC on workshop2.ci.uchicago.edu.
# DO NOT EDIT THIS FILE. ANY CHANGES YOU MAKE WILL BE LOST ON THE NEXT UPDATE.
#
"/C=US/O=Globus Alliance/OU=User/CN=101497d3dcd.3dcd5aef" ranantha
"/C=US/O=Globus Alliance/OU=User/CN=10bd8f410f6.5f0086b4" benc
"/C=US/O=Globus Alliance/OU=User/CN=10bf234e01a.ac286cfa" ranantha
"/C=US/O=SDSC/OU=SDSC/CN=Account Train10/UID=train10" train10
"/C=US/O=SDSC/OU=SDSC/CN=Account Train11/UID=train11" train11
"/C=US/O=SDSC/OU=SDSC/CN=Account Train12/UID=train12" train12
...
"/C=US/O=SDSC/OU=SDSC/CN=Account Train58/UID=train58" train58
"/C=US/O=SDSC/OU=SDSC/CN=Account Train59/UID=train59" train59
"/C=US/O=SDSC/OU=SDSC/CN=Account Train60/UID=train60" train60
"/DC=org/DC=doegrids/OU=People/CN=Gaurang Mehta 998137" gmehta
Grid mapfiles can be created by system administrators by hand or using a number of tools. In this workshop, the grid mapfile is maintained by a tool called gx-map.
Only the listed DNs are allowed to access Globus services running on workshop2.
Each entry is a mapping from DN to username. For example, DN /O=Grid/OU=OSG/CN=Training User 99 is mapped to usename train99.
So far you've run jobs on various grid sites, but they have all been sites specically chosen and prepared by the instructors. Now we will try running on some other sites on the grid. There are a few issues to cover - some in this section and some in the next.
There are a number of machines that you can probably submit to: some on Open Science Grid, some on TeraGrid and some tutorial-specific machines. You need to use a different mechanism to discover the machines in each of these grids.
VORS (Virtual Organization Resource Selector) is one of the monitoring tools available on the Open Science Grid. It can be used to get a good view of the Open Science Grid. For instance, there is a map of sites in the OSGEDU VO.
You can use this to check the status of many OSG sites on this list, to find out which sites are working and which sites support the OSGEDU VO which you are part of.
Let's test a site, using GRAM2:
$ globus-job-run osg-edu.cs.wisc.edu /bin/date
Sun Jul 10 23:25:25 CDT 2005Try to copy your application to the site.
If you plan to use any of the OSG sites, and you are authorized to do so (GRAM Authentication test successful), go to the VORS page again.
Pick a site (one that supports OSGEDU). Click on that site.
Under that listing, the entry labelled $APP location is the APPDIR you will be using.
The APPDIR is where you should copy your applications. (After making a separate directory, of course. You don't want your application to be messed up by other students, do you?)
Now you can go ahead and
Create your workspace in the APPDIR
"Stage-in" your application with globus-url-copy
Execute your application
Remember to replace SITE, APPDIR and YOURUSERNAME with values that are appropriate for you.
$globus-job-run$SITE/bin/mkdirAPPDIR/YOURUSERNAMEglobus-url-copy file://`pwd`/prime gsiftp://$SITE/APPDIR/YOURUSERNAME/primeglobus-job-run$SITE/bin/chmod +xAPPDIR/YOURUSERNAME/primeglobus-job-runNOSITEAPPDIR/YOURUSERNAME/prime 200 2 200
Condor-G allows one to use Condor tools to submit jobs to Globus resources. You can think of it as a sophisticated globus-job-run. One of the useful features is the ability to submit and monitor multiple jobs to grid resources.
In the Condor world, one has to write a submission file that describes the application your are submitting. A sample submission file is below.
########################################
#
# A sample condor submission file
#
########################################
executable = prime
universe = vanilla
output = prime.out
error = prime.error
log = prime.log
arguments = 107 2 107
queueNote the universe variable. If the universe used is vanilla, the job is infact executed on the submitting site itself. The condor on workshop2.ci.uchicago.edu is configured to run jobs locally.
To submit jobs to a remote resource, the universe should be set to =grid=. Let's see a submission file for the grid universe.
######################################## # # A sample Condor-G submission file # ######################################## executable =APPDIR/YOURUSERNAME/prime transfer_executable = false universe = grid grid_resource = gt2SITE/jobmanager log = prime.log arguments = 100 2 100 output = prime.out queue
Submit your prime application using this submission file to a site. You can monitor your application using condor_q.
$condor_submit example.subSubmitting job(s). Logging submit event(s). 1 job(s) submitted to cluster 33. $condor_q-- Submitter: workshop2.ci.uchicago.edu : <206.76.233.104:36236> : workshop2.ci.uchicago.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 9.0 train99 7/10 23:39 0+00:00:00 I 0 0.0 prime 1 jobs; 1 idle, 0 running, 0 held
XXX. You can also try using condor_q train99 to enlist jobs submitted by user train99.You can try various options like -long and -globus with condor_q to see more details.
There are various ways to use Condor-G to submit multiple jobs to multiple sites:
Write multiple submission files and changes attributes manually or using a script. This is clumsy and difficult to manage.
Write a single common submission file and dynamically change only the attributes that need to be changed
First, we must identify the attributes that need to be changed for different instantiations of the application:
range of divisors
site names
application directories
output file names
How do we do this? By passing parameters to condor_submit:
$ condor_submit -a "arguments = $num $start $end" -a "grid_resource = gt2 $site/jobmanager" ...The strings that you specify with -a option get added to the submission file you specify.
Write a common submission file and submit three instantiations of your prime application to three sites. Note that you have to use different output file names for each instantiation.
Create a submission file named example.sub with following contents.
#################### # # Submission file for prime number finder # #################### transfer_executable = false universe = grid log = prime.log queue
Note that we do not specify the site and arguments.
Create a directory for the output files:
$ mkdir outputSubmit a job using the submission file by passing arguments to condor_submit:
$ condor_submit -a "arguments = 1000 2 1000" -a "output = output/1.out" -a "grid_resource = gt2 ufgrid05.phys.ufl.edu/jobmanager" -a "executable = APPDIR/YOURUSERNAME/prime" example.subSubmit multiple jobs to multiple sites. Note that you have to copy your executables to the site, if it doesn't have it already. Use VORS to find the APP variables.
A simple grep through all your output files should tell whether the number is a prime or not.
$ grep NO output/*These notes were produced from the Open Science Grid Education, Outreach and Training group SVN repository, at this location and revision:
Path: . URL: https://svn.ci.uchicago.edu/svn/osgedu/schools/2008/clemson Repository Root: https://svn.ci.uchicago.edu/svn/osgedu Repository UUID: b4a0e4a1-be33-0410-93ba-8605a86001b8 Revision: 376 Node Kind: directory Schedule: normal Last Changed Author: benc Last Changed Rev: 368 Last Changed Date: 2008-05-19 12:19:22 -0500 (Mon, 19 May 2008)