CI-Day Grid Computing Lab, Clemson University, May 19th 2008


Table of Contents

I. Introducing the Grid
II. Job submission with GRAM
III. Condor components for job management
IV. Data Management
V. Security and Certificates on the Grid
VI. Running applications on other Grid sites

Part I. Introducing the Grid

These exercises introduce you to some simple Grid activities. They will give you the necessary skills to begin using the grid for your own applications.

Notes information

These notes will guide you through a number of exercises at your own pace. You will be given commands to type, along with the expected output and notes highlighting the key points of each step.

There are lab assistants to help you with problems or to answer any questions that you have. Do not hesitate to talk to them.

These notes have transcripts from a machine called workshop2.ci.uchicago.edu. You might be using a different machine, in which case you should be careful to replace workshop2.ci.uchicago.edu with the name of the machine you are logged in to.

The exercise notes was prepared by running as user train99. The lab assistants will give you your own login name and number. Make sure to use that in the exercises instead of train99 throughout the exercises.

You will see various styles of text in the tutorial notes.

Text like this represents output from your computer.

Text like this is input that you should type.

Text like this is a listing of the content of a file, such as a program
will will need to type in.

Note

Some notes are highlighted to draw your attention to them. You should pay special attention to text like this.

Caution

Sometimes we have warnings that indicate where even more attention is required because harmful mistakes often occur here.

Connecting to the Linux training hosts

You will be doing all the lab exercises on a set of Linux computers (or hosts) named workshop2 and osg-edu.cs.wisc.edu.

Each host has a fully qualified host name which uniquely identifies it on the internet; for examplin workshop2.ci.uchicago.edu.

From these hosts, we will run Grid jobs locally and on real sites on the Open Science Grid.

To access workshop2 from your computer, use secure shell.

SSH from a Windows laptop

On a Windows machine, use the PuTTY program. Download and open PuTTY and enter the hostname of the computer that you will use. PuTTY can be downloaded here.

SSH from a Linux or Macintosh laptop

On a Mac, use the Terminal and ssh command-line tool. Open Terminal and type:

$ ssh train99@workshop2.ci.uchicago.edu

Note

Make sure to replace the login name with your login name, as assigned by the instructors.
The authenticity of host 'workshop2.ci.uchicago.edu (1.1.1.1)' can't be established. RSA key fingerprint is 36:74:78:a8:ed:6b:38:96:63:20:01:df:46:9b:59:3b. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added 'workshop2.ci.uchicago.edu,1.1.1.1' (RSA) to the list of known hosts. train99@workshop2.ci.uchicago.edu's password: PASSWORD # not echoed workshop2$

After the first time you do this, you won't get the "Are you sure..." prompt. Some of you will never see this, as your computers were used for testing this material, and the "yes" reply was already supplied by a tester. So it will look like:

$ ssh train99@workshop2.ci.uchicago.edu
Password: PASSWORD 
workshop2$

You should be able to reach the other lab host osg-edu.cs.wisc.edu in this way too. Another machine available to you is workshop4.ci.uchicago.edu.

Cut-and-paste practice

To start, practice cutting text from this page in your terminal window to run a command or set of commands. Cut the pwd command from the box below and paste it into your terminal window to execute it. This is a good way to avoid making typing mistakes while entering commands, but make sure to read the command and check that you have replaced any necessary parameters such as your user name.

Some suggestions for editors (when copy and paste will no longer be sufficient): vi, pico, nano, emacs. Feel free to choose any of these.

$ pwd
/home/train99

Part II. Job submission with GRAM

Running Grid Jobs with Globus Commands

Now you should be able to run some execution jobs on the hosts in the lab.

First we'll try a simple 'Hello World' job:

workshop2$ globus-job-run localhost /bin/echo Hello World
Hello World

You've just submitted a job (the Linux command echo) to run on workshop2.ci.uchicago.edu. This is a simple building block for grid execution.

The globus-job-run utility runs commands on remote sites. You must tell this command several pieces of information:

  1. The name of the host on which to run the job. In this example, we specified 'localhost', meaning the host you are using.

  2. The name of the command to execute remotely. This must be be fully qualified path names (i.e., it must start with a "/"). In this example, we specified '/bin/echo'.

  3. Parameters to pass to the command. In this example, we specify a message for echo, the text 'Hello World'.

Now we will run the Linux command hostname on the remote site to verify that we're talking to the resource we think we are.

  1. Run it locally to make sure you are invoking it correctly.

    $ hostname
    workshop2
    
  2. Use the command which to discover the location of the version of hostname that you are using. It will return a fully-qualified path name.

    workshop2$ which hostname
    /bin/hostname
    
  3. This tells you that to run hostname via globus-job-run, use /bin/hostname.

  4. Use which to discover the location of the following commands on the system:

    • id
    • env
    • ps
    • uptime
  5. Now run hostname remotely, on osg-edu.cs.wisc.edu, to verify that you really are reaching a remote system:

    workshop2$ globus-job-run osg-edu.cs.wisc.edu /bin/hostname
    osg-edu.cs.wisc.edu
    

Next, see what else can you learn about the remote system with this approach.

  1. Discover what user ID your job ran under using id.

  2. Discover what environment variables are set using env.

  3. Discover the load on the remote Grid server using uptime.

  4. Discover the default working directory in which your remote job will run using pwd.

    1. Do an ls of this working directory.

    2. Use df to discover how much storage space exists in this working directory.

  5. Use df to discover how much storage space exists in the remote /tmp directory.

Immediate and Batch Job Managers

GRAM, the Globus component for running remote jobs, supports the concept of a job manager as an adapter to Local Resource Managers. Each site can support one or more such job managers. Our lab systems have two job managers: The fork job manager runs a job immediately. The Condor job manager submits jobs into the Condor Condor batch scheduling system.

Now we will investigate some of the differences between the fork and Condor jobmanagers. Which do you think will be faster? Use the command time to test which jobmanager is faster.

The "fork" job manager is very fast - it has low scheduling latency. It runs trivial commands very quickly. But it also has very little compute power - its usually just a single CPU on a front-end computer called the head node. A batch job manager, on the other hand, has a higher scheduling overhead, but usually gives you access to all computers in a cluster and access to a lot more compute power.

Our lab hosts use the Condor LRM. Other sites systems sometimes use other LRMs. For example, Portable Batch System (PBS) is very common. To submit a job to a site using PBS, you must specify jobmanager-pbs.

Now try a job through Condor on a different machine:

workshop2$ globus-job-run osg-edu.cs.wisc.edu/jobmanager-condor /bin/hostname

To time a command, enter time commandname:

workshop2$ time sleep 3
real    0m3.007s
user    0m0.004s
sys     0m0.000s

Use this to time a few trivial Grid jobs to compare Fork and Condor:

workshop2$ time globus-job-run osg-edu.cs.wisc.edu/jobmanager-condor /bin/hostname
workshop2

real    0m10.678s
user    0m0.090s
sys     0m0.030s

workshop2$ time globus-job-run osg-edu.cs.wisc.edu/jobmanager-fork /bin/hostname
workshop2

real    0m0.488s
user    0m0.090s
sys     0m0.020s

Introduction to the example prime number finding application

Throughout this tutorial we will use a simple application that tests if a number is prime. (wikipedia, the prime pages).

The prime testing application is invoked using the primetest command. In most simple use, it takes a single parameter: the number to test.

For example, this command will test the number 122 for primality.

$ primetest 122
NO - 2 is a factor
  

As you can see, the number is not prime, because the application determined that 2 is a factor.

Timing the prime application

This algorithm can take some time. Use time to measure how long the command takes. This time will vary a lot, depending on several factors - for example, the size of the number, the structure of its factors, how many other people are running on the same computer.

Test how long it takes to test the integers: 3, 524287, 524288, and 1500450271.

$  time primetest 524287

Running the prime test with multiple invocations

The prime testing application uses a simple algorithm to test for primes - each possible factor below the target number is tested. So when primetest is run with input 122, it tests every number between 2 and 121 in sequence. This is a very simple exampe of a parameter sweep.

We can split this sweep into several separate pieces, and combine the results:

$ primetest 122 2 50
NO - 2 is a factor
$ primetest 122 51 121
NO - 61 is a factor
  

The first run tested potential factors between 2 and 50, and the second run tested potential factors between 51 and 121.

We can combine the results from several runs of primetest as follows: A number is prime if no runs of primetest find a factor. A number is not prime if any run of primetest finds a factor. So in the case of 122, we can see the 122 is not prime because at least one of the pieces returned 'NO'.

Now test some of the same integers: 3, 524287, 524288, 1500450271, with the sweep divided into several sections.

Running the prime application remotely

This prime application is installed in several other sites on the grid. We can use GRAM to execute that application remotely at one of these sites.

Run primetest on osg-edu.cs.wisc.edu using GRAM:

workshop2$ globus-job-run osg-edu.cs.wisc.edu \
  /nfs/osgedu/primetest 143

Can you run the same application on osg-edu.cs.wisc.edu? Which steps above do you need to do? What do you need to change from the above to make them work with osg-edu.cs.wisc.edu? What extra information do you need? You can ask the lab instructors for that extra information.

Part III. Condor components for job management

Earlier, we learned about Condor-G and DAGMan. Now we will submit some simple jobs using these components.

Getting Set Up

  1. Check the Condor queue with condor_q

    $ condor_q
    
    -- Submitter: workshop2.ci.uchicago.edu : <1.1.1.1:36236> : workshop2.ci.uchicago.edu
     ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
    
    0 jobs; 0 idle, 0 running, 0 held

    This command lists everything that Condor has been asked to run. Everyone will be using the same Condor installation for these exercises, so you will often see other students' jobs in the queue alongside your own.

  2. Create Your Working Directories

    Next, create some directories for you to work in. Make them in your home directory:

    $ cd ~
    $ mkdir condor-tutorial
    $ cd condor-tutorial
    $ mkdir submit

Submit a Simple Grid Job with CondorG

Now we are ready to submit our first job with Condor-G. The basic procedure is to create a Condor job submit description file. This file can tell Condor what executable to run, what resources to use, how to handle failures, where to store the job's output, and many other characteristics of the job submission. Then this file is given to condor_submit.

There are many options that can be specified in a Condor-G submit description file. We will start out with just a few. We'll be sending the job to the computer workshop2.ci.uchicago.edu and running under the "jobmanager-fork" job manager. We're setting notification to never to avoid getting email messages about the completion of our job, and redirecting the stdout/err of the job back to the submission computer.

For more information, see the condor_submit manual.

Create the Submit File

Move to our scratch submission directory and create the submit file. Verify that it was entered correctly:

$ cd ~/condor-tutorial/submit
USE YOUR FAVOURITE TEXT EDITOR TO ENTER THE FILE
CONTENT
$ cat myjob.submit
executable=/sw/national_grids/primetest
arguments=143
output=results.output
error=results.error
log=results.log
notification=never
universe=grid
grid_resource=gt2 workshop2.ci.uchicago.edu/jobmanager-fork
queue

Submit your test job to Condor-G

$ condor_submit myjob.submit
Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 1.

Run condor_q to see the progress of your job. You can also run condor_q -globus to see Globus-specific status information. (See the condor_q manual for more information.)

$ condor_q

-- Submitter: workshop2.ci.uchicago.edu : <1.1.1.1:35688> : workshop2.ci.uchicago.edu
   ID    OWNER               SUBMITTED     RUN_TIME  ST PRI SIZE CMD
   1.0   train99         7/10 17:28   0+00:00:00 I  0   0.0  primetest 143

1 jobs; 1 idle, 0 running, 0 held
$ condor_q -globus


-- Submitter: workshop2.ci.uchicago.edu : <1.1.1.1:35688> : workshop2.ci.uchicago.edu
 ID      OWNER             STATUS   MANAGER  HOST                EXECUTABLE
   1.0   train99       UNSUBMITTED fork  osg-edu.cs.wisc.edu   /sw/national_grids

Monitoring Progress with tail

In another window, run tail -f on the log file for your job to monitor progress. Re-run tail when you submit one or more jobs throughout this tutorial. You will see how typical Condor-G jobs progress. Use Ctrl+C to stop watching the file.

$ cd ~/condor-tutorial/submit
$ tail -f --lines=500 results.log
000 (001.000.000) 07/10 17:28:48 Job submitted from host: <1.1.1.1:35688>
...
017 (001.000.000) 03/24 19:13:30 Job submitted to Globus
    RM-Contact: workshop2.ci.uchicago.edu/jobmanager-fork
    JM-Contact: https://workshop2.ci.uchicago.edu:34127/28997/1174763610/
    Can-Restart-JM: 1
...
027 (001.000.000) 07/10 17:29:01 Job submitted to grid resource
    GridResource: gt2 workshop2.ci.uchicago.edu/jobmanager-fork
    GridJobId: gt2 workshop2.ci.uchicago.edu/jobmanager-fork https://workshop2.ci.uchicago.edu:51277/31413/1174756212/
...
001 (001.000.000) 07/10 17:29:01 Job executing on host: gt2 workshop2.ci.uchicago.edu/jobmanager-fork
...
005 (001.000.000) 07/10 17:30:08 Job terminated.
        (1) Normal termination (return value 0)
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
        0  -  Run Bytes Sent By Job
        0  -  Run Bytes Received By Job
        0  -  Total Bytes Sent By Job
        0  -  Total Bytes Received By Job
...

Verifying completed jobs

When the job is no longer listed in condor_q, or when the log file reports Job terminated, the results can be viewed using condor_history:

$ condor_history
 ID      OWNER            SUBMITTED     RUN_TIME ST   COMPLETED CMD
   1.0   train99         7/10 10:28   0+00:00:00 C   ???        /home/train99/cond

When the job completes, verify that the output is as expected. The binary name is different from what you created because of how Globus and Condor-G cooperate to stage your file to execute computer.

$ ls
myjob.submit  myscript.sh*  results.error  results.log   results.output
$ cat results.error
$ cat results.output 
NO - 11 is a factor

If you didn't watch results.log with tail -f, you will want to examine the logged information with cat results.log .

Submitting a job to other hosts

  1. Create a new submit file:

    $ cat > myjob2.submit
    executable=/nfs/osgedu/primetest
    arguments=143
    output=results2.output
    error=results2.error
    log=results2.log
    notification=never
    universe=grid
    grid_resource=gt2 osg-edu.cs.wisc.edu/jobmanager-condor
    queue
    Ctrl+D
    $ cat myjob2.submit
    executable=primetest
    arguments=143
    output=results2.output
    error=results2.error
    log=results2.log
    notification=never
    universe=grid
    grid_resource=gt2 osg-edu.cs.wisc.edu/jobmanager-condor
    queue

    Notice that the setting for the grid_resource now refers to condor instead of fork. Globus will submit the job to Condor on osg-edu.cs.wisc.edu instead of running the job directly.

  2. Submit the job to Condor-G:

    $ condor_submit myjob2.submit
    Submitting job(s).
    Logging submit event(s).
    1 job(s) submitted to cluster 2.

    You can monitor the job's progress just like the first job. If you log into osg-edu.cs.wisc.edu in another window, you can see your job in the Condor queue there. Be quick, or the job will finish before you look!

    $ ssh osg-edu.cs.wisc.edu
    train99@osg-edu.cs.wisc.edu's password: 
    $ condor_status
    
    Name          OpSys       Arch   State      Activity   LoadAv Mem   ActvtyTime
    
    vm1@clu1.phys LINUX       INTEL  Unclaimed  Idle       0.000     9  0+00:03:34
    vm2@clu1.phys LINUX       INTEL  Unclaimed  Idle       0.000     9  0+00:03:32
    
                         Machines Owner Claimed Unclaimed Matched Preempting
    
             INTEL/LINUX      100     0       0       100       0          0
    
                   Total      100     0       0       100       0          0
    $ condor_q
    
    -- Submitter: osg-edu.cs.wisc.edu : <1.1.1.1:36311> : osg-edu.cs.wisc.edu
     ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
      11.0   train99         7/10 23:04   0+00:00:00 I  0   0.0  data TestJob 10
    
    1 jobs; 1 idle, 0 running, 0 held
  3. Clean up the results after the second job has finished running:

    $ rm results.* results2.*

A Simple DAG

Now we'll use DAGman, a tool which will help is run several grid jobs at once. (more information)

  1. Create a small shell script to monitor the Condor-G queue. We will use this throughout the rest of the tutorial:

    $ cat > watch_condor_q
    #! /bin/sh
    while true; do
         condor_q train99
         condor_q -globus train99
         sleep 10
    done
    Ctrl+D
    $ cat watch_condor_q
    #! /bin/sh
    while true; do
         condor_q
         condor_q -globus
         sleep 10
    done
    $ chmod a+x watch_condor_q 
    

  2. Create a minimal DAG for DAGMan. This DAG will have a single node.

    $ cat > mydag.dag
    Job HelloWorld myjob.submit
    Ctrl+D
    $ cat mydag.dag
    Job HelloWorld myjob.submit

  3. Submit the DAG.

    This section requires you to have three windows open. We will submit the DAG in the first window and watch the progress of it and the job in the other two. We will do these in the following order:

    1. In the first window, submit the DAG and then watch condor with watch_condor_q.

    2. In the second window, tail the results log.

    3. In the third window, tail the DAGMan log.

    Submit the DAG with condor_submit_dag and watch the run with watch_condor_q. condor_dagman is running as a job and submits your real job on your behalf, without your direct intervention. You might see the C (completed) state as your job finishes, but that often goes by too quickly to notice.

    $ condor_submit_dag mydag.dag
    
    Checking your DAG input file and all submit files it references.
    This might take a while... 
    Done.
    -----------------------------------------------------------------------
    File for submitting this DAG to Condor   : mydag.dag.condor.sub
    Log of DAGMan debugging messages         : mydag.dag.dagman.out
    Log of Condor library debug messages     : mydag.dag.lib.out
    Log of the life of condor_dagman itself  : mydag.dag.dagman.log
    
    Condor Log file for all jobs of this DAG : results.log
    Submitting job(s).
    Logging submit event(s).
    1 job(s) submitted to cluster 2.
    -----------------------------------------------------------------------
    $ ./watch_condor_q 
  4. In the first log window, watch the job log file as your job runs:

    $ tail -f --lines=500 results.log

  5. In a third window, watch DAGMan's log file by runnning tail -f --lines=500 mydag.dag.dagman.out. We suggest that you re-run this command whenever you submit a DAG during the remainder of this tutorial. This will show you how a typical DAG progresses. Use Ctrl+C to stop watching the file. An example is shown below:

    $ cd ~/condor-tutorial/submit
    $ tail -f --lines=500 mydag.dag.dagman.out
    [...]
    11/10 01:06:54 Of 1 nodes total:
    11/10 01:06:54  Done     Pre   Queued    Post   Ready   Un-Ready   Failed
    11/10 01:06:54   ===     ===      ===     ===     ===        ===      ===
    11/10 01:06:54     1       0        0       0       0          0        0
    11/10 01:06:54 All jobs Completed!
    11/10 01:06:54 Note: 0 total job deferrals because of -MaxJobs limit (0)
    11/10 01:06:54 Note: 0 total job deferrals because of -MaxIdle limit (0)
    11/10 01:06:54 Note: 0 total PRE script deferrals because of -MaxPre limit (0)
    11/10 01:06:54 Note: 0 total POST script deferrals because of -MaxPost limit (0)
    11/10 01:06:54 **** condor_scheduniv_exec.1474.0 (condor_DAGMAN) EXITING WITH STATUS 0
    

    The first window, running watch_condor_q, should look something like the following:

    $ ./watch_condor_q 
    
    
    -- Submitter: workshop2.ci.uchicago.edu : <1.1.1.1:35688> : workshop2.ci.uchicago.edu
     ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
       2.0   train99         7/10 17:33   0+00:00:03 R  0   2.6  condor_dagman -f -
       3.0   train99         7/10 17:33   0+00:00:00 I  0   0.0  myscript.sh TestJo
    
    2 jobs; 1 idle, 1 running, 0 held
    
    
    -- Submitter: workshop2.ci.uchicago.edu : <1.1.1.1:35688> : workshop2.ci.uchicago.edu
     ID      OWNER          STATUS  MANAGER  HOST                EXECUTABLE        
       3.0   train99       UNSUBMITTED fork     workshop2.ci.uchicago.edu   /tmp/train99-cond
    
    
    -- Submitter: workshop2.ci.uchicago.edu : <1.1.1.1:35688> : workshop2.ci.uchicago.edu
     ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
       2.0   train99         7/10 17:33   0+00:00:33 R  0   2.6  condor_dagman -f -
       3.0   train99         7/10 17:33   0+00:00:15 R  0   0.0  myscript.sh TestJo
    
    2 jobs; 0 idle, 2 running, 0 held
    
    
    -- Submitter: workshop2.ci.uchicago.edu : <1.1.1.1:35688> : workshop2.ci.uchicago.edu
     ID      OWNER          STATUS  MANAGER  HOST                EXECUTABLE        
       3.0   train99       ACTIVE fork     workshop2.ci.uchicago.edu   /home/train99/cond
    
    
    -- Submitter: workshop2.ci.uchicago.edu : <1.1.1.1:35688> : workshop2.ci.uchicago.edu
     ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
       2.0   train99         7/10 17:33   0+00:01:03 R  0   2.6  condor_dagman -f -
       3.0   train99         7/10 17:33   0+00:00:45 R  0   0.0  myscript.sh TestJo
    
    2 jobs; 0 idle, 2 running, 0 held
    
    
    -- Submitter: workshop2.ci.uchicago.edu : <1.1.1.1:35688> : workshop2.ci.uchicago.edu
     ID      OWNER          STATUS  MANAGER  HOST                EXECUTABLE        
       3.0   train99       ACTIVE fork     workshop2.ci.uchicago.edu   /tmp/train99-cond
    
    
    -- Submitter: workshop2.ci.uchicago.edu : <1.1.1.1:35688> : workshop2.ci.uchicago.edu
     ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
    
    0 jobs; 0 idle, 0 running, 0 held
    
    
    -- Submitter: workshop2.ci.uchicago.edu : <1.1.1.1:35688> : workshop2.ci.uchicago.edu
     ID      OWNER          STATUS  MANAGER  HOST                EXECUTABLE        
    
    
    Ctrl+C

  6. Verify your results:

    $ ls -l
    total 12
    -rw-r--r--    1 train99  train99        28 Jul 10 10:35 mydag.dag
    -rw-r--r--    1 train99  train99       523 Jul 10 10:36 mydag.dag.condor.sub
    -rw-r--r--    1 train99  train99       608 Jul 10 10:38 mydag.dag.dagman.log
    -rw-r--r--    1 train99  train99      1860 Jul 10 10:38 mydag.dag.dagman.out
    -rw-r--r--    1 train99  train99        29 Jul 10 10:38 mydag.dag.lib.out
    -rw-------    1 train99  train99         0 Jul 10 10:36 mydag.dag.lock
    -rw-r--r--    1 train99  train99       175 Jul  9 18:13 myjob.submit
    -rwxr-xr-x    1 train99  train99       194 Jul 10 10:36 myscript.sh
    -rw-r--r--    1 train99  train99        31 Jul 10 10:37 results.error
    -rw-------    1 train99  train99       833 Jul 10 10:38 results.log
    -rw-r--r--    1 train99  train99       261 Jul 10 10:37 results.output
    -rwxr-xr-x    1 train99  train99        81 Jul 10 10:35 watch_condor_q
    $ cat results.error 
    $ cat results.output 
    NO - 11 is a factor
    

    Looking at DAGMan's various files, we see that DAGMan itself ran as a Condor job.

    $ ls
    mydag.dag         mydag.dag.dagman.log  mydag.dag.lib.out  myjob.submit  results.error  results.output
    mydag.dag.condor.sub  mydag.dag.dagman.out  mydag.dag.lock     myscript.sh   results.log    watch_condor_q
    $ cat mydag.dag.condor.sub
    # Filename: mydag.dag.condor.sub
    # Generated by condor_submit_dag mydag.dag
    universe   = scheduler
    executable   = /path/to/condor/bin/condor_dagman
    getenv      = True
    output      = mydag.dag.lib.out
    error      = mydag.dag.lib.out
    log      = mydag.dag.dagman.log
    remove_kill_sig   = SIGUSR1
    arguments   = -f -l . -Debug 3 -Lockfile mydag.dag.lock -Condorlog results.log -Dag mydag.dag -Rescue mydag.dag.rescue
    environment   = _CONDOR_DAGMAN_LOG=mydag.dag.dagman.out;_CONDOR_MAX_DAGMAN_LOG=0
    queue
    $ cat mydag.dag.dagman.log
    000 (006.000.000) 07/10 10:36:43 Job submitted from host: <1.1.1.1:33785>
    ...
    001 (006.000.000) 07/10 10:36:44 Job executing on host: <1.1.1.1:33785>
    
    ...
    005 (006.000.000) 07/10 10:38:10 Job terminated.
       (1) Normal termination (return value 0)
          Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
          Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
          Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
          Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
       0  -  Run Bytes Sent By Job
       0  -  Run Bytes Received By Job
       0  -  Total Bytes Sent By Job
       0  -  Total Bytes Received By Job
    ...

    If you weren't watching the DAGMan output file with tail -f, you can examine the file with the following command:

    $ cat mydag.dag.dagman.out

  7. Clean up your results. Be careful when deleting mydag.dag.* to not delete mydag.dag. Note the .*!

    $ rm mydag.dag.* results.*

Running a job with a more complex DAG

Typically each node in a DAG will have its own Condor submit file. Create some more submit files by copying our existing file. For simplicity during this tutorial, we'll keep the submit files very similar, notably using the same executable. In real-world use, your submit files and executables can differ.

$ cp myjob.submit job.setup.submit
$ cp myjob.submit job.work1.submit
$ cp myjob.submit job.work2.submit
$ cp myjob.submit job.workfinal.submit
$ cp myjob.submit job.finalize.submit

Edit the various submit files.

Change the output and error entries to point to results.NODE.output and results.NODE.error files where NODE is actually the middle word in the submit file (job.NODE.submit). So job.finalize.error would include:

output=results.finalize.output
error=results.finalize.error

Here is one possible set of settings for the output entries:

$ grep '^output=' job.*.submit
job.finalize.submit:output=results.finalize.output
job.setup.submit:output=results.setup.output
job.work1.submit:output=results.work1.output
job.work2.submit:output=results.work2.output
job.workfinal.submit:output=results.workfinal.output

This prevents the various nodes from overwriting each other's output.

Do not change the log entries. DAGMan requires that all nodes output their logs in the same location. Condor will ensure that the different jobs will not overwrite each other's entries in the log.

Change the arguments entries so that the first argument is something unique to each node (perhaps the NODE name).

For node work2, change the second argument to 120 so that it looks something like arguments=MyWorkerNode2 120

Add the new nodes to your DAG:

$ cat mydag.dag 
Job HelloWorld myjob.submit
$ cat >> mydag.dag
Job Setup job.setup.submit
Job WorkerNode_1 job.work1.submit
Job WorkerNode_Two job.work2.submit
Job CollectResults job.workfinal.submit
Job LastNode job.finalize.submit
PARENT Setup CHILD WorkerNode_1 WorkerNode_Two
PARENT WorkerNode_1 WorkerNode_Two CHILD CollectResults
PARENT CollectResults CHILD LastNode
Ctrl+D
$ cat mydag.dag
Job HelloWorld myjob.submit
Job Setup job.setup.submit
Job WorkerNode_1 job.work1.submit
Job WorkerNode_Two job.work2.submit
Job CollectResults job.workfinal.submit
Job LastNode job.finalize.submit
PARENT Setup CHILD WorkerNode_1 WorkerNode_Two
PARENT WorkerNode_1 WorkerNode_Two CHILD CollectResults
PARENT CollectResults CHILD LastNode

Change watch_condor_q script

condor_q -dag will organize jobs into their associated DAGs. Change watch_condor_q to use this:

$ rm watch_condor_q
$ cat > watch_condor_q
#! /bin/sh
while true; do
    echo ....
    echo .... Output from condor_q
    echo ....
     condor_q train99
    echo ....
    echo .... Output from condor_q -globus
    echo ....
     condor_q -globus train99
    echo ....
    echo .... Output from condor_q -dag
    echo ....
     condor_q -dag train99
     sleep 10
done
Ctrl+D
$ chmod a+x watch_condor_q 

Submit your new DAG and monitor it.

In separate windows, run tail -f --lines=500 results.log and tail -f --lines=500 mydag.dag.dagman.out to monitor the job's progress.

$ condor_submit_dag mydag.dag

Checking your DAG input file and all submit files it references.
This might take a while... 
Done.
-----------------------------------------------------------------------
File for submitting this DAG to Condor   : mydag.dag.condor.sub
Log of DAGMan debugging messages         : mydag.dag.dagman.out
Log of Condor library debug messages     : mydag.dag.lib.out
Log of the life of condor_dagman itself  : mydag.dag.dagman.log

Condor Log file for all jobs of this DAG : results.log
Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 8.
-----------------------------------------------------------------------
$ ./watch_condor_q

-- Submitter: workshop2.ci.uchicago.edu : <1.1.1.1:35688> : workshop2.ci.uchicago.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   4.0   adesmet         7/10 17:45   0+00:00:08 R  0   2.6  condor_dagman -f -
   5.0   adesmet         7/10 17:45   0+00:00:00 I  0   0.0  myscript.sh TestJo
   6.0   adesmet         7/10 17:45   0+00:00:00 I  0   0.0  myscript.sh Setup 

3 jobs; 2 idle, 1 running, 0 held

-- Submitter: workshop2.ci.uchicago.edu : <1.1.1.1:35688> : workshop2.ci.uchicago.edu
 ID      OWNER          STATUS  MANAGER  HOST                EXECUTABLE        
   5.0   train99       UNSUBMITTED fork     workshop2.ci.uchicago.edu   /tmp/username-cond
   6.0   train99       UNSUBMITTED fork     workshop2.ci.uchicago.edu   /tmp/username-cond

[...]

-- Submitter: workshop2.ci.uchicago.edu : <1.1.1.1:35688> : workshop2.ci.uchicago.edu
 ID      OWNER/NODENAME   SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   4.0   train99         7/10 17:45   0+00:03:13 R  0   2.6  condor_dagman -f -
   8.0    |-WorkerNode_  7/10 17:46   0+00:01:28 R  0   0.0  myscript.sh Worker

2 jobs; 0 idle, 2 running, 0 held

[...]

-- Submitter: workshop2.ci.uchicago.edu : <1.1.1.1:35688> : workshop2.ci.uchicago.edu
 ID      OWNER          STATUS  MANAGER  HOST                EXECUTABLE        


-- Submitter: workshop2.ci.uchicago.edu : <1.1.1.1:35688> : workshop2.ci.uchicago.edu
 ID      OWNER/NODENAME   SUBMITTED     RUN_TIME ST PRI SIZE CMD               

0 jobs; 0 idle, 0 running, 0 held

Ctrl+C

Watching the logs or the condor_q output, you'll note that the CollectResults node (workfinal) wasn't run until both of the WorkerNode nodes (work1 and work2) finished.

Examine your results

$ ls
job.finalize.submit   mydag.dag.condor.sub  myscript.sh           results.setup.error   results.workfinal.error
job.setup.submit      mydag.dag.dagman.log  results.error        results.setup.output  results.workfinal.output
job.work1.submit      mydag.dag.dagman.out  results.finalize.error   results.work1.error   watch_condor_q
job.work2.submit      mydag.dag.lib.out     results.finalize.output  results.work1.output
job.workfinal.submit  mydag.dag.lock       results.log           results.work2.error
mydag.dag         myjob.submit       results.output        results.work2.output
$ tail --lines=500 results.*.error
==> results.finalize.error <==
This is sent to standard error

==> results.setup.error <==
This is sent to standard error

==> results.work1.error <==
This is sent to standard error

==> results.work2.error <==
This is sent to standard error

==> results.workfinal.error <==
This is sent to standard error
$ tail --lines=500 results.*.output

==> results.finalize.output <==
I'm process id 29614 on workshop2.ci.uchicago.edu
Thu Jul 10 10:53:58 CDT 2003
Running as binary /home/train99/.globus/.gass_cache/local/md5/0d/7c60aa10b34817d3ffe467dd116816/md5/de/03c3eb8a20852948a2af53438bbce1/data Finalize 1
My name (argument 1) is Finalize
My sleep duration (argument 2) is 1
Sleep of 1 seconds finished.  Exiting

==> results.setup.output <==
I'm process id 29337 on workshop2.ci.uchicago.edu
Thu Jul 10 10:50:31 CDT 2003
Running as binary /home/train99/.globus/.gass_cache/local/md5/a5/fab7b658db65dbfec3ecf0a5414e1c/md5/f4/e9a04ae03bff43f00a10c78ebd60fd/data Setup 1
My name (argument 1) is Setup
My sleep duration (argument 2) is 1
Sleep of 1 seconds finished.  Exiting

==> results.work1.output <==
I'm process id 29444 on workshop2.ci.uchicago.edu
Thu Jul 10 10:51:04 CDT 2003
Running as binary /home/train99/.globus/.gass_cache/local/md5/2e/17db42df4e113f813cea7add42e03e/md5/f6/f1bd82a2fec9a3a372a44c009a63ca/data WorkerNode1 1
My name (argument 1) is WorkerNode1
My sleep duration (argument 2) is 1
Sleep of 1 seconds finished.  Exiting

==> results.work2.output <==
I'm process id 29432 on workshop2.ci.uchicago.edu
Thu Jul 10 10:51:03 CDT 2003
Running as binary /home/train99/.globus/.gass_cache/local/md5/ea/9a3c8d16346b2fea808cda4b5969fa/md5/f6/f1bd82a2fec9a3a372a44c009a63ca/data WorkerNode2 120
My name (argument 1) is WorkerNode2
My sleep duration (argument 2) is 120
Sleep of 120 seconds finished.  Exiting

==> results.workfinal.output <==
I'm process id 29554 on workshop2.ci.uchicago.edu
Thu Jul 10 10:53:27 CDT 2003
Running as binary /home/train99/.globus/.gass_cache/local/md5/c9/7ba5d43acad3d9ebdfa633839e75c3/md5/11/cd84efa75305d54100f0f451b46b35/data WorkFinal 1
My name (argument 1) is WorkFinal
My sleep duration (argument 2) is 1
Sleep of 1 seconds finished.  Exiting

Examine your log

$ cat results.log
000 (005.000.000) 07/10 17:45:24 Job submitted from host: <workshop2.ci.uchicago.edu:35688>
    DAG Node: HelloWorld
...
000 (006.000.000) 07/10 17:45:24 Job submitted from host: <workshop2.ci.uchicago.edu:35688>
    DAG Node: Setup
...
017 (006.000.000) 07/10 17:45:42 Job submitted to Globus
    RM-Contact: gk2:/jobmanager-fork
    JM-Contact: https://gk2:2349/914/1057877133/
    Can-Restart-JM: 1
...
001 (006.000.000) 07/10 17:45:42 Job executing on host: gt2 workshop2.ci.uchicago.edu/jobmanager-fork
...

017 (005.000.000) 07/10 17:45:42 Job submitted to Globus
    RM-Contact: workshop2.ci.uchicago.edu:/jobmanager-fork
    JM-Contact: https://workshop2.ci.uchicago.edu:2348/915/1057877133/
    Can-Restart-JM: 1
...
001 (005.000.000) 07/10 17:45:42 Job executing on host: gk2
...
005 (005.000.000) 07/10 17:46:50 Job terminated.
   (1) Normal termination (return value 0)
      Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
      Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
      Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
      Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
   0  -  Run Bytes Sent By Job
   0  -  Run Bytes Received By Job
   0  -  Total Bytes Sent By Job
   0  -  Total Bytes Received By Job
...
005 (006.000.000) 07/10 17:46:50 Job terminated.
   (1) Normal termination (return value 0)
      Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
      Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
      Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
      Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
   0  -  Run Bytes Sent By Job
   0  -  Run Bytes Received By Job
   0  -  Total Bytes Sent By Job
   0  -  Total Bytes Received By Job
...
000 (007.000.000) 07/10 17:46:55 Job submitted from host: <workshop2.ci.uchicago.edu:35688>
    DAG Node: WorkerNode_1
...
000 (008.000.000) 07/10 17:46:56 Job submitted from host: <workshop2.ci.uchicago.edu:35688>
    DAG Node: WorkerNode_Two
...
017 (008.000.000) 07/10 17:47:09 Job submitted to Globus
    RM-Contact: workshop2.ci.uchicago.edu:/jobmanager-fork
    JM-Contact: https://workshop2.ci.uchicago.edu:2364/1037/1057877219/
    Can-Restart-JM: 1
...
001 (008.000.000) 07/10 17:47:09 Job executing on host: gt2 workshop2.ci.uchicago.edu/jobmanager-fork
...
017 (007.000.000) 07/10 17:47:09 Job submitted to Globus
    RM-Contact: workshop2.ci.uchicago.edu:/jobmanager-fork
    JM-Contact: https://workshop2.ci.uchicago.edu:2367/1040/1057877220/
    Can-Restart-JM: 1
...
001 (007.000.000) 07/10 17:47:09 Job executing on host: gt2 workshop2.ci.uchicago.edu/jobmanager-fork
...
005 (007.000.000) 07/10 17:48:17 Job terminated.
   (1) Normal termination (return value 0)
      Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
      Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
      Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
      Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
   0  -  Run Bytes Sent By Job
   0  -  Run Bytes Received By Job
   0  -  Total Bytes Sent By Job
   0  -  Total Bytes Received By Job
...
005 (008.000.000) 07/10 17:49:18 Job terminated.
   (1) Normal termination (return value 0)
      Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
      Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
      Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
      Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
   0  -  Run Bytes Sent By Job
   0  -  Run Bytes Received By Job
   0  -  Total Bytes Sent By Job
   0  -  Total Bytes Received By Job
...
000 (009.000.000) 07/10 17:49:22 Job submitted from host: <workshop2.ci.uchicago.edu:35688>
    DAG Node: CollectResults
...
017 (009.000.000) 07/10 17:49:35 Job submitted to Globus
    RM-Contact: workshop2.ci.uchicago.edu:/jobmanager-fork
    JM-Contact: https://workshop2.ci.uchicago.edu:2383/1185/1057877366/
    Can-Restart-JM: 1
...
001 (009.000.000) 07/10 17:49:35 Job executing on host: gt2 workshop2.ci.uchicago.edu/jobmanager-fork
...
005 (009.000.000) 07/10 17:50:42 Job terminated.
   (1) Normal termination (return value 0)
      Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
      Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
      Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
      Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
   0  -  Run Bytes Sent By Job
   0  -  Run Bytes Received By Job
   0  -  Total Bytes Sent By Job
   0  -  Total Bytes Received By Job
...
000 (010.000.000) 07/10 17:50:42 Job submitted from host: <workshop2.ci.uchicago.edu:35688>
    DAG Node: LastNode
...
017 (010.000.000) 07/10 17:50:55 Job submitted to Globus
    RM-Contact: workshop2.ci.uchicago.edu:/jobmanager-fork
    JM-Contact: https://workshop2.ci.uchicago.edu:2392/1247/1057877446/
    Can-Restart-JM: 1
...
001 (010.000.000) 07/10 17:50:55 Job executing on host: gt2 workshop2.ci.uchicago.edu/jobmanager-fork
...
005 (010.000.000) 07/10 17:52:02 Job terminated.
   (1) Normal termination (return value 0)
      Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
      Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
      Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
      Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
   0  -  Run Bytes Sent By Job
   0  -  Run Bytes Received By Job
   0  -  Total Bytes Sent By Job
   0  -  Total Bytes Received By Job
...

Examine the DAGMan log

$ cat mydag.dag.dagman.out
7/10 17:45:24 ******************************************************
7/10 17:45:24 ** condor_scheduniv_exec.4.0 (CONDOR_DAGMAN) STARTING UP
7/10 17:45:24 ** $CondorVersion: 6.8.4 Apr 22 2006 $
7/10 17:45:24 ** $CondorPlatform: INTEL-LINUX-GLIBC22 $
7/10 17:45:24 ** PID = 18826
7/10 17:45:24 ******************************************************
7/10 17:45:24 DaemonCore: Command Socket at <workshop2.ci.uchicago.edu:35774>
7/10 17:45:24 argv[0] == "condor_scheduniv_exec.4.0"
7/10 17:45:24 argv[1] == "-Debug"
7/10 17:45:24 argv[2] == "3"
7/10 17:45:24 argv[3] == "-Lockfile"
7/10 17:45:24 argv[4] == "mydag.dag.lock"
7/10 17:45:24 argv[5] == "-Condorlog"
7/10 17:45:24 argv[6] == "results.log"
7/10 17:45:24 argv[7] == "-Dag"
7/10 17:45:24 argv[8] == "mydag.dag"
7/10 17:45:24 argv[9] == "-Rescue"
7/10 17:45:24 argv[10] == "mydag.dag.rescue"
7/10 17:45:24 Condor log will be written to results.log
7/10 17:45:24 DAG Lockfile will be written to mydag.dag.lock
7/10 17:45:24 DAG Input file is mydag.dag
7/10 17:45:24 Rescue DAG will be written to mydag.dag.rescue
7/10 17:45:24 Parsing mydag.dag ...
7/10 17:45:24 Dag contains 6 total jobs
7/10 17:45:24 Bootstrapping...
7/10 17:45:24 Number of pre-completed jobs: 0
7/10 17:45:24 Submitting Job HelloWorld ...
7/10 17:45:24    assigned Condor ID (5.0.0)
7/10 17:45:24 Submitting Job Setup ...
7/10 17:45:24    assigned Condor ID (6.0.0)
7/10 17:45:25 Event: ULOG_SUBMIT for Job HelloWorld (5.0.0)
7/10 17:45:25 Event: ULOG_SUBMIT for Job Setup (6.0.0)
7/10 17:45:25 0/6 done, 0 failed, 2 submitted, 0 ready, 0 pre, 0 post
7/10 17:45:45 Event: ULOG_GLOBUS_SUBMIT for Job Setup (6.0.0)
7/10 17:45:45 Event: ULOG_EXECUTE for Job Setup (6.0.0)
7/10 17:45:45 Event: ULOG_GLOBUS_SUBMIT for Job HelloWorld (5.0.0)
7/10 17:45:45 Event: ULOG_EXECUTE for Job HelloWorld (5.0.0)
7/10 17:46:55 Event: ULOG_JOB_TERMINATED for Job HelloWorld (5.0.0)
7/10 17:46:55 Job HelloWorld completed successfully.
7/10 17:46:55 Event: ULOG_JOB_TERMINATED for Job Setup (6.0.0)
7/10 17:46:55 Job Setup completed successfully.
7/10 17:46:55 Submitting Job WorkerNode_1 ...
7/10 17:46:55    assigned Condor ID (7.0.0)
7/10 17:46:55 Submitting Job WorkerNode_Two ...
7/10 17:46:56    assigned Condor ID (8.0.0)
7/10 17:46:56 Event: ULOG_SUBMIT for Job WorkerNode_1 (7.0.0)
7/10 17:46:56 Event: ULOG_SUBMIT for Job WorkerNode_Two (8.0.0)
7/10 17:46:56 2/6 done, 0 failed, 2 submitted, 0 ready, 0 pre, 0 post
7/10 17:47:11 Event: ULOG_GLOBUS_SUBMIT for Job WorkerNode_Two (8.0.0)
7/10 17:47:11 Event: ULOG_EXECUTE for Job WorkerNode_Two (8.0.0)
7/10 17:47:11 Event: ULOG_GLOBUS_SUBMIT for Job WorkerNode_1 (7.0.0)
7/10 17:47:11 Event: ULOG_EXECUTE for Job WorkerNode_1 (7.0.0)
7/10 17:48:21 Event: ULOG_JOB_TERMINATED for Job WorkerNode_1 (7.0.0)
7/10 17:48:21 Job WorkerNode_1 completed successfully.
7/10 17:48:21 3/6 done, 0 failed, 1 submitted, 0 ready, 0 pre, 0 post
7/10 17:49:21 Event: ULOG_JOB_TERMINATED for Job WorkerNode_Two (8.0.0)
7/10 17:49:21 Job WorkerNode_Two completed successfully.
7/10 17:49:21 Submitting Job CollectResults ...
7/10 17:49:22    assigned Condor ID (9.0.0)
7/10 17:49:22 Event: ULOG_SUBMIT for Job CollectResults (9.0.0)
7/10 17:49:22 4/6 done, 0 failed, 1 submitted, 0 ready, 0 pre, 0 post
7/10 17:49:37 Event: ULOG_GLOBUS_SUBMIT for Job CollectResults (9.0.0)
7/10 17:49:37 Event: ULOG_EXECUTE for Job CollectResults (9.0.0)
7/10 17:50:42 Event: ULOG_JOB_TERMINATED for Job CollectResults (9.0.0)
7/10 17:50:42 Job CollectResults completed successfully.
7/10 17:50:42 Submitting Job LastNode ...
7/10 17:50:42    assigned Condor ID (10.0.0)
7/10 17:50:42 Event: ULOG_SUBMIT for Job LastNode (10.0.0)
7/10 17:50:42 5/6 done, 0 failed, 1 submitted, 0 ready, 0 pre, 0 post
7/10 17:50:57 Event: ULOG_GLOBUS_SUBMIT for Job LastNode (10.0.0)
7/10 17:50:57 Event: ULOG_EXECUTE for Job LastNode (10.0.0)
7/10 17:52:02 Event: ULOG_JOB_TERMINATED for Job LastNode (10.0.0)
7/10 17:52:02 Job LastNode completed successfully.
7/10 17:52:02 6/6 done, 0 failed, 0 submitted, 0 ready, 0 pre, 0 post
7/10 17:52:02 All jobs Completed!
7/10 17:52:02 **** condor_scheduniv_exec.4.0 (condor_DAGMAN) EXITING WITH STATUS 0

Clean up your results. Be careful about deleting the mydag.dag.* files, you do not want to delete mydag.dag, just mydag.dag.*.

$ rm mydag.dag.* results.*

Part IV. Data Management

Getting set up

Make a working directory for this exercise. For the rest of this exercise, all your work should be done in there.

$ mkdir dataex
$ cd dataex

Next create some files of different sizes, to use for exercises:

$ dd if=/dev/zero of=smallfile-train99 bs=1M count=10
$ dd if=/dev/zero of=mediumfile-train99 bs=1M count=50
$ dd if=/dev/zero of=largefile-train99 bs=1M count=200
$ ls -sh
total 261M
201M largefile-train99   51M mediumfile-train99   11M smallfile-train99

Moving Files with GridFTP

Transfers to a remote site

Now try transferring a file to a remote site.

First you will need some scratch space on the remote system. You can create a working directory in your remote home directory.

workshop2.ci.uchicago.edu$ globus-job-run osg-edu.cs.wisc.edu /bin/mkdir /nfs/osgedu/clemson-train99

Now copy the file over to this directory:

workshop2.ci.uchicago.edu$ globus-url-copy -vb file:///home/train99/dataex/smallfile-train99 gsiftp://osg-edu.cs.wisc.edu/nfs/osgedu/clemson-train99/ex1
Source: file:///home/train99/dataex/smallfile
Dest:   gsiftp://osg-edu.cs.wisc.edu/home//gpfs1/osg_data/osgedu/train99
  largefile-train99  ->  ex1
    208666624 bytes         1.41 MB/sec avg         1.43 MB/sec inst

You will probably find that the transfer rate is much lower than when copying to local machines.

You can try copying to other sites in addition to osg-edu.cs.wisc.edu. Remember that you might need to make a scratch directory on each one, and that the place for this will be different for each site.

Measuring transfer speed

See how fast the file transfer is happening by using the -vb flag when copying the large file. Since this is a transfer over a local network that should not be too busy it should be fairly quick:

$ globus-url-copy -vb file:///home/train99/dataex/largefile-train99 gsiftp://osg-edu.cs.wisc.edu/nfs/osgedu/clemson-train99/ex1
Source: file:///home/train99/dataex/
Dest:   gsiftp://osg-edu.cs.wisc.edu/home/train99/
  largefile-train99  ->  ex1
    207618048 bytes         8.81 MB/sec avg         9.09 MB/sec inst

URL formats

A quick reminder on URL formats: We've seen two kind of URLs so far.

  • file:///home/train99/dataex/largefile - a file called largefile on the local file system, in the directory /home/train99/dataex/.

  • gsiftp://osg-edu.cs.wisc.edu/scratch/train99/ - a directory accessible via gsiftp on the host called osg-edu.cs.wisc.edu in directory /scratch/train99.

Parallel streams

Trying using 4 parallel data streams by adding the -p flag with an argument of 4:

Use the following globus-url-copy command to transfer the file from workshop2.ci.uchicago.edu to the osg-edu.cs.wisc.edu:

$ globus-url-copy -p 4 -vb file:///home/train99/dataex/smallfile-train99 gsiftp://osg-edu.cs.wisc.edu/nfs/osgedu/clemson-train99/ex1

Experiment with transferring different file sizes and numbers of parallel streams, to both local and remote sites and see how the speed varies.

Third party transfers

Next try a third-party transfer. You do this by specifying two gsiftp URLs, instead of one gsiftp URL and one file URL.

globus-url-copy will control the transfers but data will not pass through the local machine. Instead, it will go directly between the source and destination machines.

Transfer a file between two remote sites, and see if it is faster than if you had transferred it to workshop2.ci.uchicago.edu and then back out again.

Try to make up a command line for this yourself - you should use two gsiftp URLs, instead of a file url and a gsiftp URL.

Reliable File Transfer (RFT)

Next use RFT, the reliable file transfer service, to transfer a block of files between two sites.

First, create a transfer job file, which lists some RFT parameters and all of the files to transfer. You can get an example from workshop2.ci.uchicago.edu:/sw/misc/example.rft. Read through this and change the URLs (the site names and the files -- pay attention) at the end to refer to your files.

The RFT command and transfer job file reference is available here.

This example lists three transfers: largefile will be transfered three times, once each to osg-edu.cs.wisc.edu, once to osg-edu.cs.wisc.edu, and once to another host on the grid.

You can launch it as follows. The client will periodically output transfer status. You can watch jobs move from the pending state, to the Active state and then to the Finished state.

$ cp /sw/misc/example.rft rft.xfr
$ vi rft.xfr
... make your changes ...
$ rft -h workshop2.ci.uchicago.edu -f ./rft.xfr 
Number of transfers in this request: 3
Subscribed for overall status
Termination time to set: 60 minutes

 Overall status of transfer:
Finished/Active/Failed/Retrying/Pending
0/1/0/0/2

 Overall status of transfer:
Finished/Active/Failed/Retrying/Pending
1/0/0/0/2

 Overall status of transfer:
Finished/Active/Failed/Retrying/Pending
1/1/0/0/1

 Overall status of transfer:
Finished/Active/Failed/Retrying/Pending
2/0/0/0/1

 Overall status of transfer:
Finished/Active/Failed/Retrying/Pending
2/1/0/0/0

 Overall status of transfer:
Finished/Active/Failed/Retrying/Pending
3/0/0/0/0
All Transfers are completed

Initally all jobs start in the pending state, move to active state and then hopefully to finished state (but maybe fail, in which case they go to the failed state).

The transfer file has a number of options, documented in-line. You can experiment changing them. Interesting ones to try:

  • Add more URLs to transfer

  • Transfer between two remote sites

  • Use parallel streams

  • Increase the transfer concurrency

In particular you should check that you understand the difference between parallel streams (the number of streams used when transferring one file) and concurrency (the number of files that can be transferred at once).

Finding Replicas with RLS

The above sections have dealt with moving data around, and always made the assumption that you knew where the files you wanted were located.

Next we will deal with the Replica Location Service (RLS).

Check that we can connect to an RLS server.

$ globus-rls-admin -p rls://workshop2.ci.uchicago.edu
ping rls://workshop2.ci.uchicago.edu: 0 seconds

Querying an RLS server

First perform a simple query for an example logical filename that has been placed in the RLS by the instructors:

$ globus-rls-cli rls://workshop2.ci.uchicago.edu
rls> query lrc lfn example
  example: gsiftp://workshop2.ci.uchicago.edu/scratch/example
  example: gsiftp://osg-edu.cs.wisc.edu/scratch/example

This queries for a logical filename example. The results show that this file can be retrieved via either of two URLs (one in scratch space on workshop2.ci.uchicago.edu, and one in scratch space on osg-edu.cs.wisc.edu).

Now try querying for logical filename another-example.

Adding mappings to the RLS

You can also publish your own logical filename into the RLS, with mappings to physical files, using the create command:

rls> create train99-first-lfn gsiftp://workshop2.ci.uchicago.edu/home/train99/dataex/largefile-train99

This creates an LFN called train99-first-lfn and then adds a mapping to gsiftp://workshop2.ci.uchicago.edu/home/train99/dataex/largefile-train99.

Note

This does not check that the gsiftp URL is valid. If you enter the wrong information here, then RLS will report the wrong information when you later query it.
rls> query lrc lfn train99-first-lfn
  train99-first-lfn: gsiftp://workshop2.ci.uchicago.edu/home/train99/dataex/largefile-train99

Now copy largefile to another place (on another gridlab machine or on one of the remote sites), and register it into the RLS, with the same LFN. You will need to use the add command instead of the create command, because the LFN already exists and you just need to add a new mapping.

Get a neighbour to query the RLS for your logical filename, and see that the mappings you have made are public for everyone to see.

Multiple RLS servers

Other LRCs

So far, you have only been using the RLS server on workshop2.ci.uchicago.edu. There are servers running on other machines.

Use globus-rls-admin to ping the RLS server on osg-edu.cs.wisc.edu and check that it is online.

Then, connect to one of the other servers using globus-rls-cli and query for the example LFN that we used above. You should see that there are some other locations from which you can get the example file.

Try adding your own LFN into one of the other servers, using globus-rls-cli.

RLS server statistics

Next use the -S option to check the status/statistics of each of the two servers. You should see output similar to that below:

$ globus-rls-admin -S rls://workshop2.ci.uchicago.edu 
Version:    2.1.5
Uptime:     00:28:15
LRC stats
  update method: lfnlist
  update method: bloomfilter
  updates bloomfilter: rls://osg-edu.cs.wisc.edu:39281 last 06/21/04 22:44:45
  lfnlist update interval: 86400
  bloomfilter update interval: 900
  numlfn: 1
  numpfn: 1
  nummap: 1
RLI stats
  updated by: rls://osg-edu.cs.wisc.edu:39281 last 06/21/04 22:44:35
  updated via bloomfilters
globus-rls-admin -S rls://gk2
Version:    2.1.5
Uptime:     00:32:33
LRC stats
  update method: lfnlist
  update method: bloomfilter
  updates bloomfilter: rls://osg-edu.cs.wisc.edu:39281 last 06/21/04 22:44:40
  lfnlist update interval: 86400
  bloomfilter update interval: 900
  numlfn: 2
  numpfn: 2
  nummap: 2
RLI stats
  updated by: rls://osg-edu.cs.wisc.edu:39281 last 06/21/04 22:44:49
  updated via bloomfilters

Treasure hunt

There are three files with logical filenames treasure1, treasure2 and treasure3 stored on the grid. Use RLS to find them and RFT to move them into your home directory.

SRM

Getting set up

Make a working directory for this exercise. For the rest of this exercise, all your work should be done in there.

$ mkdir srmex
$ cd srmex

There are a few environmental variables already set for you.

SRM_HOME : srm client installation directory; /sw/srmclient2
SRMEP : SRM service endpoint; srm://gwdca04.fnal.gov:8443/srm/managerv2
SRMPATH : Working directory on SRM storage; /pnfs/fnal.gov/data/osgedu
MYNAME : your login

Next create a file to use for exercises:

$ dd if=/dev/zero of=smallfile-$MYNAME bs=1M count=2
$ ls -l
-rw-r--r-- 1 train03 train03 2097152 2008-01-12 18:15 smallfile-train03

Assumptions

You already have used globus-url-copy to move your files from your local machine to one of designated target machine and from a remote gridftp server to your local machine.

Basic operations

Checking the status of SRM

Use srm-ping to find out the status of SRM server on $SRMEP.


$ srm-ping $SRMEP

This returns SRM version number, similar to the following.

Ping versionInfo=v2.2
Extra information
        Key=backend_type
        Value=dCache
        Key=backend_version
        Value=production-1-8-0-9

Putting a file into SRM managed storage

File transfer into SRM managed storage goes through several protocols including gridftp file transfer. This client operation communicates with SRM server through several interfaces internally; srmPrepareToPut to request your file request, srmStatusOfPutRequest to check your request, gridftp file transfer and srmPutDone to finalize the state of your file transfer.

$ srm-copy file:////home/train99/srmex/smallfile-$MYNAME \
           $SRMEP\?SFN=$SRMPATH/smallfile-$MYNAME

Upon successful completion, this returns a summary similar to the following:

SRM-CLIENT*REQUESTTYPE=put
SRM-CLIENT*TOTALFILES=1
SRM-CLIENT*TOTAL_SUCCESS=1
SRM-CLIENT*TOTAL_FAILED=0
SRM-CLIENT*REQUEST_TOKEN=-2146782625
SRM-CLIENT*REQUEST_STATUS=SRM_SUCCESS
SRM-CLIENT*SOURCEURL[0]= file:////home/train99/srmex/smallfile-$MYNAME
SRM-CLIENT*TARGETURL[0]= $SRMEP\?SFN=$SRMPATH/smallfile-$MYNAME
SRM-CLIENT*TRANSFERURL[0]=gsiftp://gwdca03.fnal.gov:2811///smallfile-alex
SRM-CLIENT*ACTUALSIZE[0]=2097152
SRM-CLIENT*FILE_STATUS[0]=SRM_SUCCESS
SRM-CLIENT*EXPLANATION[0]=Done

URL formats

A quick reminder on URL formats:

We've seen two kinds of URLs so far.

file:////home/train99/srmex/smallfile - a file called smallfile on the local file system, in directory /home/train99/srmex/. The appended $MYNAME is only to make the filename unique in this grid school.

srm://gwdca04.fnal.gov:8443/srm/managerv2\?SFN=/pnfs/fnal.gov/data/osgedu/smallfile-train99 - a SiteURL for a file name smallfile-train99 on SRM running on the host called gwdca04.fnal.gov and port 8443 with the web service handle /srm/managerv2 in directory /pnfs/fnal.gov/data/osgedu. SFN represents Site File Name.

Browsing a file in SRM managed storage

Now try to find out the properties of the file that you just put into SRM.

$ srm-ls $SRMEP\?SFN=$SRMPATH/smallfile-$MYNAME

Upon successful completion, this returns a summary similar to the following:

SRM-CLIENT*REQUEST_STATUS=SRM_SUCCESS
SRM-CLIENT*REQUEST_EXPLANATION=srm-ls completed normally
SRM-CLIENT*SURL=/pnfs/fnal.gov/data/osgedu/smallfile-alex
SRM-CLIENT*BYTES=2097152
SRM-CLIENT*FILETYPE=FILE
SRM-CLIENT*STORAGETYPE=PERMANENT
SRM-CLIENT*FILE_STATUS=SRM_SUCCESS
SRM-CLIENT*OWNERPERMISSION=7166
SRM-CLIENT*LIFETIMELEFT=-1
SRM-CLIENT*LIFETIMEASSIGNED=-1
SRM-CLIENT*CHECKSUMTYPE=adler32
SRM-CLIENT*CHECKSUMVALUE=01e00001
SRM-CLIENT*FILELOCALITY=ONLINE
SRM-CLIENT*OWNERPERMISSION.USERID=7166
SRM-CLIENT*OWNERPERMISSION.MODE=RW
SRM-CLIENT*GROUPPERMISSION.GROUPID=9803
SRM-CLIENT*GROUPPERMISSION.MODE=R
SRM-CLIENT*OTHERPERMISSION=R
SRM-CLIENT*RETENTIONPOLICY=CUSTODIAL
SRM-CLIENT*ACCESSLATENCY=ONLINE
SRM-CLIENT*LASTACCESSED=2008-1-12-18-18-39
SRM-CLIENT*CREATEDATTIME=2008-1-12-18-18-39

Getting a file from SRM managed storage

Now try to get the file that you just browsed and put into SRM from the SRM managed storage to your local machine. This client operation communicates with SRM server through several interfaces internally: srmPrepareToGet to request your file request, srmStatusOfGetRequest to check your request, gridftp file transfer and srmReleaseFiles to release the file after your transfer.

$ srm-copy $SRMEP\?SFN=$SRMPATH/smallfile-$MYNAME \
           file:////home/train99/srmex/my-smallfile

Upon successful completion, this returns a summary similar to the following:

SRM-CLIENT*REQUESTTYPE=get
SRM-CLIENT*TOTALFILES=1
SRM-CLIENT*TOTAL_SUCCESS=1
SRM-CLIENT*TOTAL_FAILED=0
SRM-CLIENT*REQUEST_TOKEN=-2146782626
SRM-CLIENT*REQUEST_STATUS=SRM_SUCCESS
SRM-CLIENT*SOURCEURL[0]= $SRMEP\?SFN=$SRMPATH/smallfile-$MYNAME
SRM-CLIENT*TARGETURL[0]= file:////home/train99/srmex/my-smallfile
SRM-CLIENT*TRANSFERURL[0]=gsiftp://gwdca03.fnal.gov:2811///smallfile-alex
SRM-CLIENT*ACTUALSIZE[0]=2097152
SRM-CLIENT*FILE_STATUS[0]=SRM_FILE_PINNED
SRM-CLIENT*EXPLANATION[0]=Done

After srm-copy is completed, find out the file size at the target on your local machine:

$ ls -l /home/train99/srmex/my-smallfile
-rw-r--r-- 1 train99 train99 2097152 2008-01-12 19:29 my-smallfile

Removing a file in SRM managed storage

Now try to remove the file that you put from the SRM managed storage.

$ srm-rm $SRMEP\?SFN=$SRMPATH/smallfile-$MYNAME

Upon successful completion, this returns a summary similar to the following:

SRM-DIR: Total files to remove: 1
        status=SRM_SUCCESS
        explanation=successfully removed files
        surl=$SRMEP\?SFN=$SRMPATH/smallfile-$MYNAME

After srm-rm returns successfully, find out the file properties of the same SURL on the SRM with srm-ls. You should see that the SURL is invalid.

Creating and removing a directory in SRM managed storage

Now try to create a directory in SRM managed storage.

$ srm-mkdir $SRMEP\?SFN=$SRMPATH/$MYNAME

This will create a directory under the SRM that you can use in your SURLs. Upon successful completion, this returns a summary similar to the following:

SRM-DIR: Sat Jan 12 19:04:09 CST 2008 Calling SrmMkdir
        status=SRM_SUCCESS
        explanation=success

Browse the directory to see what kind of property information that you retrieve from SRM.

Now try to remove the directory from SRM.

$ srm-rmdir $SRMEP\?SFN=$SRMPATH/$MYNAME

This will remove a directory under the SRM. Upon successful completion, this returns a summary similar to the following:

SRM-DIR: Sat Jan 12 19:06:34 CST 2008 Calling SrmRmdir
        status=SRM_SUCCESS
        explanation=success

Summary of basic operations

Experiment with putting and getting files with different file sizes and numbers of parallel streams to and from the remote SRM site, and see the differences. When you use 4 parallel data streams by adding the -parallelism option with an argument of 4, the client operation goes through the same protocol, and the parallel streams are used in the gridftp file transfer. Larger files would make a significant difference in file transfer performance.

Experiment with directory structure in your path.

Note: Remember to remove those files and directories that you created afterwards.

Space management and related operations

Reserving a space in SRM for opportunistic use

Now, let's make a space reservation for 5M bytes of total space, 4M bytes of guaranteed space and lifetime of 900 seconds:

$ srm-sp-reserve -serviceurl $SRMEP -size 5000000 -gsize 4000000 -lifetime 900

Upon successful completion, this returns a summary similar to the following:

SRM-SPACE: Status Code for spaceStatusRequest SRM_SUCCESS
        SpaceToken=258138
        TotalReservedSpaceSize=4000000
        Retention Policy=REPLICA
        Access Latency=ONLINE

Upon successful space reservation, this will show you the space token which will be used in the next exercises. (e.g. 258138 from above, but it is not necessarily numbers always and different storage may return different string format.) Note that your reserved space was returned as 4MB. Let's set the returned space token as an environment variable to re-use later on:

$ export SPTOKEN=258138

Finding out space properties from SRM

Now, let's find out the space information with the space token that you just received above:

$ srm-sp-info -serviceurl $SRMEP -spacetoken $SPTOKEN

Upon successful completion, this returns a summary similar to the following:

SRM-SPACE:  ....space token details ....
        status=SRM_SUCCESS
        SpaceToken=258138
        TotalSize=4000000
        Owner=VoGroup=osgedu VoRole=null
        LifetimeAssigned=900
        LifetimeLeft=463
        UnusedSize=4000000
        GuaranteedSize=4000000
        RetentionPolicy=REPLICA
        AccessLatency=ONLINE
        status=SRM_SUCCESS
        explanation=ok

Retrieving space tokens from SRM

Supposed you lost your space token, and let’s find out how to retrieve the space tokens that belong to you:

$ srm-sp-tokens -serviceurl $SRMEP 

Upon successful completion, this returns a summary similar to the following:

SRM-SPACE: ...................................
        Status=SRM_SUCCESS
        Explanation=OK
SRM-SPACE (0)SpaceToken=258138

This would show all the space tokens that belong to your grid identity and its mapping on the server.

Updating a space in SRM

Some time passed since the above space reservation, and the lifetime of the reserved space may be near the expiration. Now, let's update the lifetime of the space as well as the size of the space. We'llll use 7MB of total space with 6MB of guaranteed space, and make the lifetime 950 seconds:

$ srm-sp-update -serviceurl $SRMEP -spacetoken $SPTOKEN -size 7000000 -gsize 6000000 -lifetime 950

Upon successful completion, this returns a summary similar to the following because the target SRM storage does not support this functionality.

SRM-SPACE: Sat Jan 12 19:09:55 CST 2008 Calling updateSpace request
        status=SRM_NOT_SUPPORTED
        explanation=can not find a handler, not implemented
        Request token=null

However, when the SRM storage supports the functionality and the request is successful, this returns a summary similar to the following.

SRM-SPACE: Sat Jan 12 21:22:50 PST 2008 Calling updateSpace request
        status=SRM_SUCCESS
        Request token=null
        lifetime=950
        Min=7000000
        Max=7000000

Your space token is the same as before, and upon successful completion, the lifetime and size of your space should be updated. Let’s find out the space information from the SRM and verify using srm-sp-info to see the new updated information.

Putting a file into the reserved space in SRM

Now let's put a file into your reserved space using the space token. This client operation communicates with the SRM server, same as before. However, because of your space token, your file will be written into the space that you have reserved.

$ srm-copy file:////home/train99/srmex/smallfile-$MYNAME \
           $SRMEP\?SFN=$SRMPATH/smallfile-space-$MYNAME \
	   -spacetoken $SPTOKEN

Upon successful completion, this returns a summary similar to the following:

SRM-CLIENT*REQUESTTYPE=put
SRM-CLIENT*TOTALFILES=1
SRM-CLIENT*TOTAL_SUCCESS=1
SRM-CLIENT*TOTAL_FAILED=0
SRM-CLIENT*REQUEST_TOKEN=-2146782603
SRM-CLIENT*REQUEST_STATUS=SRM_SUCCESS
SRM-CLIENT*SOURCEURL[0]= file:////home/train99/srmex/smallfile-$MYNAME
SRM-CLIENT*TARGETURL[0]= $SRMEP\?SFN=$SRMPATH/smallfile-space-$MYNAME
SRM-CLIENT*TRANSFERURL[0]=gsiftp://gwdca03.fnal.gov:2811///smallfile-space-alex
SRM-CLIENT*ACTUALSIZE[0]=2097152
SRM-CLIENT*FILE_STATUS[0]=SRM_SUCCESS
SRM-CLIENT*EXPLANATION[0]=Done

After successful completion, find out the file properties with srm-ls.

$ srm-ls $SRMEP\?SFN=$SRMPATH/smallfile-space-$MYNAME

Upon successful completion, this returns a summary similar to the following:

SRM-CLIENT*REQUEST_STATUS=SRM_SUCCESS
SRM-CLIENT*REQUEST_EXPLANATION=srm-ls completed normally
SRM-CLIENT*SURL=/pnfs/fnal.gov/data/osgedu/smallfile-space-alex
SRM-CLIENT*BYTES=2097152
SRM-CLIENT*FILETYPE=FILE
SRM-CLIENT*STORAGETYPE=PERMANENT
SRM-CLIENT*FILE_STATUS=SRM_SUCCESS
SRM-CLIENT*OWNERPERMISSION=7166
SRM-CLIENT*LIFETIMELEFT=-1
SRM-CLIENT*LIFETIMEASSIGNED=-1
SRM-CLIENT*CHECKSUMTYPE=adler32
SRM-CLIENT*CHECKSUMVALUE=01e00001
SRM-CLIENT*FILELOCALITY=ONLINE
SRM-CLIENT*OWNERPERMISSION.USERID=7166
SRM-CLIENT*OWNERPERMISSION.MODE=RW
SRM-CLIENT*GROUPPERMISSION.GROUPID=9803
SRM-CLIENT*GROUPPERMISSION.MODE=R
SRM-CLIENT*OTHERPERMISSION=R
SRM-CLIENT*SPACETOKENS(0)=258138
SRM-CLIENT*RETENTIONPOLICY=CUSTODIAL
SRM-CLIENT*ACCESSLATENCY=ONLINE
SRM-CLIENT*LASTACCESSED=2008-1-12-19-16-37
SRM-CLIENT*CREATEDATTIME=2008-1-12-19-16-37

Note from the previous srm-ls output that this time it shows the space token you used when putting your file into the SRM managed storage.

Releasing the reserved space from SRM

Now let's release the reserved space using the space token.

$ srm-sp-release -serviceurl $SRMEP -spacetoken $SPTOKEN

Upon successful completion, this returns a summary similar to the following:

SRM-SPACE: Releasing space for token=258138
        status=SRM_SUCCESS
        explanation=Space released

This operation may fail if you have any files in the space associated with the space token. In such case, remove the files with srm-rm to try releasing the space again.

$ srm-rm  $SRMEP\?SFN=$SRMPATH/smallfile-space-$MYNAME

After successful releasing your reserved space, find out the space properties with srm-sp-info.

Summary of space management operations

Experiment on reserving spaces with different space sizes and lifetimes, and putting your files into the reserved spaces with space token. Experiment updating the reserved space after you put your files into the reserved space. Experiment with directory structure in your SURL.

Note: Remember to remove those files and directories that you created afterwards. Also remember to release those spaces that you reserved if still active.

Part V. Security and Certificates on the Grid

This exercise will provide hands-on experience in using various tools to setup and use the Grid Security Infrastructure (GSI) for working on the grid. The first few sections delve into certificates and proxies and demonstarate how pre-configured credentials can be used to run some grid enable programs. (more information)

Proxies

In order to do things (like submit jobs or transfer data) on the grid, you need a grid proxy. A grid proxy contains everything necessary to authenticate you to grid resources. You will do more with grid security in the security lab, later on.

We have given each training account a proxy that will work on many grid systems. You can check this with the grid-proxy-info command.

workshop2.ci.uchicago.edu$ grid-proxy-info
subject  : /DC=org/DC=doegrids/OU=People/CN=OSG Education student 37 789564/CN=1558914057
issuer   : /DC=org/DC=doegrids/OU=People/CN=OSG Education student 37 789564
identity : /DC=org/DC=doegrids/OU=People/CN=OSG Education student 37 789564
type     : Proxy draft (pre-RFC) compliant impersonation proxy
strength : 512 bits
path     : /tmp/x509up_u1048
timeleft : 98:42:47  (4.1 days)

Look at the timeleft field. This tells you how much time this proxy will be valid for. Check that there is some time left on your proxy. (When this proxy has expired, you will no longer be able to use the grid, and you will have to get a new proxy)

Grid Proxy Certificates

The security details necessary for you to access the grid are stored in a file called a 'grid proxy'. In normal grid usage, you would use your own credentials and a command called grid-proxy-init to make your own proxy. For this tutorial, the instructors made a proxy for you before the tutorial started.

Contents of a Grid Proxy

Use grid-proxy-info to show information about your proxy.

Use the -all parameter to display information your proxy:

[train99 ~]$ grid-proxy-info -all
subject  : /O=Grid/OU=OSG/CN=Training User 99/CN=203360020
issuer   : /O=Grid/OU=OSG/CN=Training User 99
identity : /O=Grid/OU=OSG/CN=Training User 99
type     : Proxy draft (pre-RFC) compliant impersonation proxy
strength : 512 bits
path     : /tmp/x509up_u539
timeleft : 11:58:58

Grid Proxy Details

subject

The distingushed name (DN) from the certificate, appended with a uniqe string of numbers.

issuer

The distinguished name of the user certificate itself.

path

The file system location where the your proxy is stored.

timeleft

How much longer the proxy will be valid, in hours, minutes and seconds.

As you can see, the issuer of the grid certificate is the user certificate. This shows the chain of trust: CA -> user certificate -> proxy certificate.

The proxy certificate contains the private key generated for proxy, correspnding public key and is signed like a certificate by the user certificate.

Now list the contents of the proxy using grid-cert-info, specifying the full path to your proxy.

$ grid-cert-info -file /path/to/proxy/proxyFileName
Certificate:
    Data:
        Version: 3 (0x2)
        Serial Number: 203360020 (0xc1f0714)
        Signature Algorithm: md5WithRSAEncryption
        Issuer: C=US, O=SDSC, OU=SDSC, CN=Account Train31/UID=train31
        Validity
            Not Before: Jun 23 14:55:10 2006 GMT
            Not After : Jun 24 03:00:10 2006 GMT
        Subject: C=US, O=SDSC, OU=SDSC, CN=Account Train31/UID=train31, CN=203360020
        Subject Public Key Info:
            Public Key Algorithm: rsaEncryption
            RSA Public Key: (512 bit)
                Modulus (512 bit):
                    00:b8:75:e3:a4:3c:31:9e:b9:71:e8:b0:4e:fc:18:
                    69:e6:79:15:90:f4:0f:49:20:f0:e3:62:9f:e2:92:
                    d0:96:4c:9b:b5:97:12:b3:bd:87:c7:8c:2f:bb:b0:
                    fe:79:8c:3d:61:5e:49:f6:c1:46:e1:1e:08:d1:d7:
                    89:a0:e3:8a:f3
                Exponent: 65537 (0x10001)
        X509v3 extensions:
            1.3.6.1.4.1.3536.1.222: critical
                0.0
..+.......
    Signature Algorithm: md5WithRSAEncryption
        45:05:52:c7:9f:a7:35:32:d9:a8:be:58:92:a7:b0:61:e4:7a:
        2a:a2:36:0f:eb:65:0e:0f:ca:40:3d:0e:27:8b:38:14:a6:af:
        51:7d:28:2f:ac:3e:3e:05:7b:ea:d6:0e:fc:78:7d:eb:60:80:
        6a:74:43:64:ef:ca:e8:25:fe:d3:07:a9:4d:e0:54:4a:75:9f:
        c9:8e:9a:1e:82:19:a4:fc:72:a3:6f:0d:de:33:57:d8:f8:cd:
        da:d2:bc:8a:ee:48:34:4b:00:3e:7e:b7:5e:66:fa:2e:5c:22:
        4a:50:98:02:32:c6:e3:a9:07:b7:bb:e6:4d:02:e8:6c:d4:48:
        5e:55:ec:ed:a9:38:ee:b8:33:60:88:c1:ab:38:ce:d8:53:a3:
        ac:c3:a2:c1:d8:1e:95:5b:e5:3a:3f:d1:e0:51:c2:5e:82:e0:
        a4:48:d3:e6:82:66:56:d9:6b:e0:a5:1e:85:4d:3d:d7:e0:4e:
        03:ce:f7:5a:63:cd:5c:9a:38:96:59:0f:92:11:6b:eb:ed:34:
        1a:55:73:e1:c0:b0:91:ea:b4:1e:3b:8d:0f:2d:53:83:10:98:
        44:19:ac:39:6d:1a:6b:37:90:60:6a:35:9b:c6:41:2e:5a:ef:
        ae:54:6c:9e:51:b8:68:c2:97:83:2f:72:25:df:90:b9:bc:31:
        92:23:45:77
[train99 ~]$ 

The contents are similar to your user certificate, but there are some differences; for example, the issuer is the DN of the user certificate, rather than of the certificate authority.

grid-cert-info is useful to see how long your proxy certificate will last (the Not Before and Not After lines under Validity).

Contents of the Grid Mapfile

Globus services (for example, GRAM and GridFTP) use a grid mapfile located in /etc/grid-security/grid-mapfile on each server.

This file has restricted write access, but the file can be read by anyone.

You can look at the gridmap file on workshop2 like this:

[train99@workshop2.ci.uchicago.edu ~]$ cat /etc/grid-security/grid-mapfile 
#
# Automatically generated by gx-gen-mapfile (gx-map 0.5.1)
# at Fri 2006-06-23 15:26:02 UTC on workshop2.ci.uchicago.edu.
# DO NOT EDIT THIS FILE.  ANY CHANGES YOU MAKE WILL BE LOST ON THE NEXT UPDATE.
#
"/C=US/O=Globus Alliance/OU=User/CN=101497d3dcd.3dcd5aef" ranantha
"/C=US/O=Globus Alliance/OU=User/CN=10bd8f410f6.5f0086b4" benc
"/C=US/O=Globus Alliance/OU=User/CN=10bf234e01a.ac286cfa" ranantha
"/C=US/O=SDSC/OU=SDSC/CN=Account Train10/UID=train10" train10
"/C=US/O=SDSC/OU=SDSC/CN=Account Train11/UID=train11" train11
"/C=US/O=SDSC/OU=SDSC/CN=Account Train12/UID=train12" train12
...
"/C=US/O=SDSC/OU=SDSC/CN=Account Train58/UID=train58" train58
"/C=US/O=SDSC/OU=SDSC/CN=Account Train59/UID=train59" train59
"/C=US/O=SDSC/OU=SDSC/CN=Account Train60/UID=train60" train60
"/DC=org/DC=doegrids/OU=People/CN=Gaurang Mehta 998137" gmehta

Grid mapfiles can be created by system administrators by hand or using a number of tools. In this workshop, the grid mapfile is maintained by a tool called gx-map.

Only the listed DNs are allowed to access Globus services running on workshop2.

Each entry is a mapping from DN to username. For example, DN /O=Grid/OU=OSG/CN=Training User 99 is mapped to usename train99.

Part VI. Running applications on other Grid sites

Objectives

So far you've run jobs on various grid sites, but they have all been sites specically chosen and prepared by the instructors. Now we will try running on some other sites on the grid. There are a few issues to cover - some in this section and some in the next.

Finding Sites

There are a number of machines that you can probably submit to: some on Open Science Grid, some on TeraGrid and some tutorial-specific machines. You need to use a different mechanism to discover the machines in each of these grids.

Finding sites on the Open Science Grid

VORS (Virtual Organization Resource Selector) is one of the monitoring tools available on the Open Science Grid. It can be used to get a good view of the Open Science Grid. For instance, there is a map of sites in the OSGEDU VO.

You can use this to check the status of many OSG sites on this list, to find out which sites are working and which sites support the OSGEDU VO which you are part of.

Checking authentication

Use globusrun to verify that you are authorized to use a site and can authenticate to it:

$ globusrun -a -r osg-edu.cs.wisc.edu/jobmanager-fork
GRAM Authentication test successful

Testing a Site

Let's test a site, using GRAM2:

$ globus-job-run osg-edu.cs.wisc.edu /bin/date
Sun Jul 10 23:25:25 CDT 2005

Try to copy your application to the site.

Caution

You don't know where to copy the files. Is there any temporary directory available? And if so, how do you find it?

If you plan to use any of the OSG sites, and you are authorized to do so (GRAM Authentication test successful), go to the VORS page again.

Pick a site (one that supports OSGEDU). Click on that site.

Under that listing, the entry labelled $APP location is the APPDIR you will be using.

The APPDIR is where you should copy your applications. (After making a separate directory, of course. You don't want your application to be messed up by other students, do you?)

Running a job

Now you can go ahead and

  1. Create your workspace in the APPDIR

  2. "Stage-in" your application with globus-url-copy

  3. Execute your application

Remember to replace SITE, APPDIR and YOURUSERNAME with values that are appropriate for you.

$ globus-job-run SITE /bin/mkdir APPDIR/YOURUSERNAME
$ globus-url-copy file://`pwd`/prime gsiftp://SITE/APPDIR/YOURUSERNAME/prime
$ globus-job-run SITE /bin/chmod +x APPDIR/YOURUSERNAME/prime
$ globus-job-run SITE APPDIR/YOURUSERNAME/prime 200 2 200
NO

Using Condor-G to submit jobs to OSG and TeraGrid

Condor-G allows one to use Condor tools to submit jobs to Globus resources. You can think of it as a sophisticated globus-job-run. One of the useful features is the ability to submit and monitor multiple jobs to grid resources.

Submitting a single job using Condor-G

In the Condor world, one has to write a submission file that describes the application your are submitting. A sample submission file is below.

########################################
#                       
#  A sample condor submission file
#                                        
########################################
                                         
executable =  prime
universe = vanilla

output  = prime.out                
error   = prime.error             
log     = prime.log  
arguments = 107 2 107

queue

Note the universe variable. If the universe used is vanilla, the job is infact executed on the submitting site itself. The condor on workshop2.ci.uchicago.edu is configured to run jobs locally.

To submit jobs to a remote resource, the universe should be set to =grid=. Let's see a submission file for the grid universe.

########################################
#                       
#  A sample Condor-G submission file
#                                        
########################################

executable = APPDIR/YOURUSERNAME/prime

transfer_executable = false
universe       = grid
grid_resource = gt2 SITE/jobmanager
log            = prime.log
arguments      = 100 2 100
output = prime.out

queue

Submit your prime application using this submission file to a site. You can monitor your application using condor_q.

$ condor_submit example.sub
Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 33.

$ condor_q

-- Submitter: workshop2.ci.uchicago.edu : <206.76.233.104:36236> : workshop2.ci.uchicago.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   9.0   train99      7/10 23:39   0+00:00:00 I  0   0.0  prime             

1 jobs; 1 idle, 0 running, 0 held

Tip

If condor_q lists too many entries, you can use the =job-id= to refer your job. In the above example, this is =33= in the line =1 job(s) submitted to cluster 33= returned by condor_submit. Now you can just do condor_q XXX. You can also try using condor_q train99 to enlist jobs submitted by user train99.

You can try various options like -long and -globus with condor_q to see more details.

Checking output

Check the output file prime.out.

Submitting multiple jobs to multiple sites using Condor-G

There are various ways to use Condor-G to submit multiple jobs to multiple sites:

  • Write multiple submission files and changes attributes manually or using a script. This is clumsy and difficult to manage.

  • Write a single common submission file and dynamically change only the attributes that need to be changed

First, we must identify the attributes that need to be changed for different instantiations of the application:

  • range of divisors

  • site names

  • application directories

  • output file names

How do we do this? By passing parameters to condor_submit:

$ condor_submit -a "arguments = $num $start $end" -a "grid_resource = gt2 $site/jobmanager" ...

The strings that you specify with -a option get added to the submission file you specify.

Write a common submission file and submit three instantiations of your prime application to three sites. Note that you have to use different output file names for each instantiation.

Create a submission file named example.sub with following contents.

####################
#
# Submission file for prime number finder
#
####################

transfer_executable = false
universe       = grid
log            = prime.log
queue

Note that we do not specify the site and arguments.

Create a directory for the output files:

$ mkdir output

Submit a job using the submission file by passing arguments to condor_submit:

$ condor_submit -a "arguments = 1000 2 1000" -a "output = output/1.out" -a "grid_resource = gt2 ufgrid05.phys.ufl.edu/jobmanager" -a "executable = APPDIR/YOURUSERNAME/prime" example.sub

Submit multiple jobs to multiple sites. Note that you have to copy your executables to the site, if it doesn't have it already. Use VORS to find the APP variables.

Inspecting the output

A simple grep through all your output files should tell whether the number is a prime or not.

$ grep NO output/*
Clean-up

If you no longer plan to use the executable you (globus-url-)copied to the remote-site(s), please go ahead and clean up your workspace(s):

$ globus-job-run SITE /bin/rm -rf APPDIR/YOURUSERNAME 

Putting it all together (Optional)

Write a script to submit jobs to multiple sites automatically.

Perhaps your script would contain something like this:

$ ./script.pl
Usage: ./script.pl <task number>
    1 - Make dir
    2 - Copy exes
    3 - Run prime jobs
    4 - Grep output
    5 - Remove dir

About these notes

These notes were produced from the Open Science Grid Education, Outreach and Training group SVN repository, at this location and revision:

Path: .
URL: https://svn.ci.uchicago.edu/svn/osgedu/schools/2008/clemson
Repository Root: https://svn.ci.uchicago.edu/svn/osgedu
Repository UUID: b4a0e4a1-be33-0410-93ba-8605a86001b8
Revision: 376
Node Kind: directory
Schedule: normal
Last Changed Author: benc
Last Changed Rev: 368
Last Changed Date: 2008-05-19 12:19:22 -0500 (Mon, 19 May 2008)