Now, if everything is set correctly, you should be able to run Grid jobs on the hosts in the lab Grid. First, try to execute a simple hello world job, like this:
$ globus-job-run osg-edu.cs.wisc.edu /bin/echo Hello World
Hello World
You've just submitted a job (the Linux command echo) to run on osg-edu.cs.wisc.edu
The globus-job-run utility runs commands on remote sites.
It expects them to be fully qualified path names (i.e., they must start with a "/"). Lets say we want to run the Linux command hostname on the remote site to verify that we're talking to the resource we think we are.
Run it locally to make sure you are invoking it correctly.
$ hostname
terminable.ci.uchicago.edu
Use the command which to discover the location of the version of hostname that you are using. It will return a fully-qualified path name.
$ which hostname
/bin/hostname
This tells you that to run hostname via
globus-job-run, use
/bin/hostname.
Use which to discover the location of the following commands on the system:
Now run hostname remotely, on osg-edu.cs.wisc.edu, to verify that you really are reaching a remote system:
$ globus-job-run osg-edu.cs.wisc.edu /bin/hostname
osg-edu.cs.wisc.edu
Next, see what else can you learn about the remote system with this approach.
Discover what user ID your job ran under using id.
Discover what environment variables are set using env.
Discover the load on the remote Grid server using uptime.
Discover the default working directory in which your remote job will run using pwd.
Do an ls of this working directory.
Use df to discover how much storage space exists in this working directory.
Use df to discover how much storage space
exists in the remote /tmp
directory.
Can you create a file on the remote system?
Can you remove it?
Fully qualified pathnames are necessary when running commands under
globus-job-run, because by default it does not start a
UNIX shell on the remote system and its the shell that implements mechanisms
like searching for commands in your $PATH variable, and
many other features, like input/output redirection ( e.g.,
>foo), pipes (e.g., cmd1|cmd2) and
$VAR substitution. But we can tell globus-job-run to run a
shell for you on the remote Grid site, and pass the command string to that
remote shell:
Try for example:
$ globus-job-run osg-edu.cs.wisc.edu /bin/sh -c \
"grep osgedu /etc/passwd | wc -l"
1
Experiment for a few minutes with this to try a few shell commands and pipelines.
Common Linux system commands are typically, but not always, found in the same directory on all Linux systems, so you can expect the path to be the same on every system. However, application specific software is usually installed in different places on each system. Virtual organizations often establish conventions for such things, which will be covered in the national grids chapter later.
GRAM, the Globus component for running remote jobs, supports the concept of a job manager as an adapter to Local Resource Managers (LRMs). Each site - or collection of resources - can support one or more such job managers. osg-edu.cs.wisc.edu support two job managers: The fork job manager runs a job immediately through the UNIX fork() interface, and the Condor job manager as an interface to the Condor batch scheduling system.
Now we will investigate some of the differences between the fork
and Condor job managers.
Which do you think will be faster? Use the command
time to test which jobmanager is faster.
To time a command, enter time :
commandname
terminable$ time sleep 3
real 0m3.007s
user 0m0.004s
sys 0m0.000s
Use this to time a few trivial Grid jobs to compare Fork and Condor:
terminable$time globus-job-run osg-edu.cs.wisc.edu/jobmanager-condor /bin/hostnameterminable$time globus-job-run osg-edu.cs.wisc.edu/jobmanager-fork /bin/hostname
The <term>fork</term> job manager is very fast - it has low scheduling latency. It runs trivial commands very quickly. But it also has no compute power - its usually just a single CPU on a cluster-controlling computer called the head node. A batch job manager (such as the condor jobmanager), on the other hand, has a higher scheduling overhead, but gives you access to all computers in a cluster, and the opportunity to do real parallel computing.
While our lab hosts use the Condor scheduler, other systems use other schedulers. For example, systems using Portable Batch System(PBS) require that you specify jobmanager-pbs. You'll get a chance to try this later in this lab.
Look at the hostnames that are returned by the job execution commands above. jobmanager-fork will always return osg-edu.cs.wisc.edu because fork jobs always run on the headnode of a cluster. jobmanager-condor jobs will often run on different computers within the cluster, so you will see a different hostname.
Next try starting five such jobs on gridlab2 at once: put an "&" at the end of the line, and either use cut-and-paste, or shell command history, or a simple shell script to run five of these commands at once.
Throughout this tutorial we will use a simple application that tests if a number is prime. (wikipedia, the prime pages).
The prime testing application is invoked using the
primetest command. In most simple use, it takes
a single parameter: the number to test.
For example, this command will test the number 122 for primality.
$ /home/benc//primetest 122
NO - 2 is a factor
As you can see, the number is not prime, because the application determined that 2 is a factor.
This algorithm can take some time. Use time to measure how long the command takes. This time will vary a lot, depending on several factors - for example, the size of the number, the structure of its factors, how many other people are running on the same computer.
Test how long it takes to test the integers: 3, 524287, 524288, and 1500450271.
$ time /home/benc//primetest 524287
The prime testing application uses a simple algorithm to test for primes - each possible factor below the target number is tested. So when primetest is run with input 122, it tests every number between 2 and 121 in sequence. This is a very simple exampe of a parameter sweep.
We can split this sweep into several separate pieces, and combine the results:
$/home/benc//primetest 122 2 50NO - 2 is a factor $/home/benc//primetest 122 51 121NO - 61 is a factor
The first run tested potential factors between 2 and 50, and the second run tested potential factors between 51 and 121.
We can combine the results from several runs of primetest as follows: A number is prime if no runs of primetest find a factor. A number is not prime if any run of primetest finds a factor. So in the case of 122, we can see the 122 is not prime because at least one of the pieces returned 'NO'.
Now test some of the same integers: 3, 524287, 524288, 1500450271, with the sweep divided into several sections.
The prime application is installed on osg-edu.cs.wisc.edu in the
directory /nfs/osg-app/osgedu/.
grid. We can use GRAM to execute that application remotely.
Run primetest on osg-edu.cs.wisc.edu using GRAM:
$ globus-job-run osg-edu.cs.wisc.edu /nfs/osg-app/osgedu/primetest 143
NO - 11 is a factor
Can you run the same application on a different grid site? Which steps above do you need to do? What do you need to change from the above to make them work with osg-edu.cs.wisc.edu? What extra information do you need? In a later section, we'll learn about how to do this.