Part II. Grid job submission using GRAM

Running Grid Jobs with Globus Commands

Now, if everything is set correctly, you should be able to run Grid jobs on the hosts in the lab Grid. First, try to execute a simple hello world job, like this:

$ globus-job-run osg-edu.cs.wisc.edu /bin/echo Hello World
Hello World
  

You've just submitted a job (the Linux command echo) to run on osg-edu.cs.wisc.edu

The globus-job-run utility runs commands on remote sites.

It expects them to be fully qualified path names (i.e., they must start with a "/"). Lets say we want to run the Linux command hostname on the remote site to verify that we're talking to the resource we think we are.

  1. Run it locally to make sure you are invoking it correctly.

    $ hostname
    terminable.ci.uchicago.edu
          
  2. Use the command which to discover the location of the version of hostname that you are using. It will return a fully-qualified path name.

    $ which hostname
    /bin/hostname
    
  3. This tells you that to run hostname via globus-job-run, use /bin/hostname.

  4. Use which to discover the location of the following commands on the system:

    • id
    • env
    • ps
    • uptime
  5. Now run hostname remotely, on osg-edu.cs.wisc.edu, to verify that you really are reaching a remote system:

    $ globus-job-run osg-edu.cs.wisc.edu /bin/hostname
    osg-edu.cs.wisc.edu
          

Next, see what else can you learn about the remote system with this approach.

  1. Discover what user ID your job ran under using id.

  2. Discover what environment variables are set using env.

  3. Discover the load on the remote Grid server using uptime.

  4. Discover the default working directory in which your remote job will run using pwd.

    1. Do an ls of this working directory.

    2. Use df to discover how much storage space exists in this working directory.

  5. Use df to discover how much storage space exists in the remote /tmp directory.

    • Can you create a file on the remote system?

    • Can you remove it?

Running Under a Remote Shell

Fully qualified pathnames are necessary when running commands under globus-job-run, because by default it does not start a UNIX shell on the remote system and its the shell that implements mechanisms like searching for commands in your $PATH variable, and many other features, like input/output redirection ( e.g., >foo), pipes (e.g., cmd1|cmd2) and $VAR substitution. But we can tell globus-job-run to run a shell for you on the remote Grid site, and pass the command string to that remote shell:

Try for example:

$ globus-job-run osg-edu.cs.wisc.edu /bin/sh -c   \
"grep osgedu /etc/passwd | wc -l"
1
  

Experiment for a few minutes with this to try a few shell commands and pipelines.

Common Linux system commands are typically, but not always, found in the same directory on all Linux systems, so you can expect the path to be the same on every system. However, application specific software is usually installed in different places on each system. Virtual organizations often establish conventions for such things, which will be covered in the national grids chapter later.

Immediate and Batch Job managers

GRAM, the Globus component for running remote jobs, supports the concept of a job manager as an adapter to Local Resource Managers (LRMs). Each site - or collection of resources - can support one or more such job managers. osg-edu.cs.wisc.edu support two job managers: The fork job manager runs a job immediately through the UNIX fork() interface, and the Condor job manager as an interface to the Condor batch scheduling system.

Now we will investigate some of the differences between the fork and Condor job managers. Which do you think will be faster? Use the command time to test which jobmanager is faster.

To time a command, enter time commandname:

terminable$ time sleep 3
real    0m3.007s
user    0m0.004s
sys     0m0.000s

Use this to time a few trivial Grid jobs to compare Fork and Condor:

terminable$ time globus-job-run osg-edu.cs.wisc.edu/jobmanager-condor /bin/hostname
terminable$ time globus-job-run osg-edu.cs.wisc.edu/jobmanager-fork /bin/hostname

The <term>fork</term> job manager is very fast - it has low scheduling latency. It runs trivial commands very quickly. But it also has no compute power - its usually just a single CPU on a cluster-controlling computer called the head node. A batch job manager (such as the condor jobmanager), on the other hand, has a higher scheduling overhead, but gives you access to all computers in a cluster, and the opportunity to do real parallel computing.

While our lab hosts use the Condor scheduler, other systems use other schedulers. For example, systems using Portable Batch System(PBS) require that you specify jobmanager-pbs. You'll get a chance to try this later in this lab.

Look at the hostnames that are returned by the job execution commands above. jobmanager-fork will always return osg-edu.cs.wisc.edu because fork jobs always run on the headnode of a cluster. jobmanager-condor jobs will often run on different computers within the cluster, so you will see a different hostname.

Next try starting five such jobs on gridlab2 at once: put an "&" at the end of the line, and either use cut-and-paste, or shell command history, or a simple shell script to run five of these commands at once.

Introduction to the example prime number finding application

Throughout this tutorial we will use a simple application that tests if a number is prime. (wikipedia, the prime pages).

The prime testing application is invoked using the primetest command. In most simple use, it takes a single parameter: the number to test.

For example, this command will test the number 122 for primality.

$ /home/benc//primetest 122
NO - 2 is a factor
  

As you can see, the number is not prime, because the application determined that 2 is a factor.

Timing the prime application

This algorithm can take some time. Use time to measure how long the command takes. This time will vary a lot, depending on several factors - for example, the size of the number, the structure of its factors, how many other people are running on the same computer.

Test how long it takes to test the integers: 3, 524287, 524288, and 1500450271.

$  time /home/benc//primetest 524287

Running the prime test with multiple invocations

The prime testing application uses a simple algorithm to test for primes - each possible factor below the target number is tested. So when primetest is run with input 122, it tests every number between 2 and 121 in sequence. This is a very simple exampe of a parameter sweep.

We can split this sweep into several separate pieces, and combine the results:

$ /home/benc//primetest 122 2 50
NO - 2 is a factor
$ /home/benc//primetest 122 51 121
NO - 61 is a factor
  

The first run tested potential factors between 2 and 50, and the second run tested potential factors between 51 and 121.

We can combine the results from several runs of primetest as follows: A number is prime if no runs of primetest find a factor. A number is not prime if any run of primetest finds a factor. So in the case of 122, we can see the 122 is not prime because at least one of the pieces returned 'NO'.

Now test some of the same integers: 3, 524287, 524288, 1500450271, with the sweep divided into several sections.

Running the primes application remotely

The prime application is installed on osg-edu.cs.wisc.edu in the directory /nfs/osg-app/osgedu/. grid. We can use GRAM to execute that application remotely.

Run primetest on osg-edu.cs.wisc.edu using GRAM:

$ globus-job-run osg-edu.cs.wisc.edu /nfs/osg-app/osgedu/primetest 143
NO - 11 is a factor

Can you run the same application on a different grid site? Which steps above do you need to do? What do you need to change from the above to make them work with osg-edu.cs.wisc.edu? What extra information do you need? In a later section, we'll learn about how to do this.