Part IV. Data Management

Getting set up

Make some working directories for this exercise. For the rest of this exercise, all your work should be done in there.

First make a directory on osgedu.cs.clemson.edu.

$ mkdir dataex
$ cd dataex

Then use GRAM to create a working directory on osgce.cs.wisc.edu:

$ globus-job-run osgce.cs.wisc.edu mkdir /export/osg/data/osgedu/YOURLOGIN

This will give you a directory on osgce.cs.wisc.edu which you can use for storing files in this exercise.

Next create some files of different sizes, to use for exercises:

$ dd if=/dev/zero of=smallfile-YOURLOGIN bs=1M count=10
$ dd if=/dev/zero of=mediumfile-YOURLOGIN bs=1M count=50
$ dd if=/dev/zero of=largefile-YOURLOGIN bs=1M count=200
$ ls -sh
total 261M
201M largefile-YOURLOGIN   51M mediumfile-YOURLOGIN   11M smallfile-YOURLOGIN

Moving Files with GridFTP

Use globus-url-copy to move your small file from your home directory on osgedu.cs.clemson.edu to your home directory on osgce.cs.wisc.edu.

$ globus-url-copy file:///home/YOURLOGIN/dataex/smallfile-YOURLOGIN gsiftp://osgce.cs.wisc.edu/export/osg/data/osgedu/YOURLOGIN/ex1
$ echo $?
0

The command echo $? checks to see what the return value was for the previous command. If you see a 0 (zero), then globus-url-copy succeeded. A different number indicates a problem. In that case you should also see an error message.

Measuring transfer speed

See how fast the file transfer is happening by using the -vb flag when copying the large file. Since this is a transfer over a local network [1] that should not be too busy it should be fairly quick:

$ globus-url-copy -vb file:///home/YOURLOGIN/dataex/largefile-YOURLOGIN gsiftp://osgce.cs.wisc.edu/export/osg/data/osgedu/YOURLOGIN/ex1
Source: file:///home/YOURLOGIN/dataex/
Dest:   gsiftp://osgce.cs.wisc.edu/home/YOURLOGIN/
  largefile-YOURLOGIN  ->  ex1
    207618048 bytes         8.81 MB/sec avg         9.09 MB/sec inst

URL formats

A quick reminder on URL formats: We've seen two kind of URLs so far.

  • file:///home/YOURLOGIN/dataex/largefile - a file called largefile on the local file system, in directory /home/YOURLOGIN/dataex/.

  • gsiftp://osgce.cs.wisc.edu/scratch/YOURLOGIN/ - a directory accessible via gsiftp on the host called osgce.cs.wisc.edu in directory /scratch/YOURLOGIN.

Parallel streams

Trying using 4 parallel data streams by adding the -p flag with an argument of 4:

Now that you're osgce.cs.wisc.edu, use the dd commands you issued earlier in the exercise to create the small, medium, and large files to create those files in your home directory on the Teragrid site.

Use the following globus-url-copy command to transfer the file from osgedu.cs.clemson.edu to the osgce.cs.wisc.edu:

$ globus-url-copy -p 4 -vb file:///home/YOURLOGIN/dataex/smallfile-YOURLOGIN gsiftp://osgce.cs.wisc.edu/home/etrain99/data/ex1

Experiment with transferring different file sizes and numbers of parallel streams, to both local and remote sites and see how the speed varies.

Third party transfers

Next try a third-party transfer. You do this by specifying two gsiftp URLs, instead of one gsiftp URL and one file URL.

globus-url-copy will control the transfers but data will not pass through the local machine. Instead, it will go directly between the source and destination machines.

Transfer a file between two remote sites, and see if it is faster than if you had transferred it to osgedu.cs.clemson.edu and then back out again.

Try to make up a command line for this yourself - you should use two gsiftp URLs, instead of a file url and a gsiftp URL.

Reliable File Transfer (RFT)

Next use RFT, the reliable file transfer service, to transfer a block of files between two sites.

First, create a transfer job file, which lists some RFT parameters and all of the files to transfer. You can get an example from /soft/globus-4.0.3-r1/share/globus_wsrf_rft_client/transfer.xfr. Read through this and change the URLs at the end to refer to some files of your choice.

The RFT command and transfer job file documentation is here.

The example above lists one transfer in the last two lines of the file: from the local machine to itself, transferring the file /tmp/rftTest.tmp to rftTest_Done.tmp. You should change the two gsiftp URLs to two other gsiftp URLs. For example, you could use the URLs that were used in the previous GridFTP exercise.

You can launch an RFT transfer as follows. The client will periodically output transfer status. You can watch jobs move from the pending state, to the Active state and then to the Finished state.

$ cp /soft/globus-4.0.3-r1/share/globus_wsrf_rft_client/transfer.xfr rft.xfr
$ vi rft.xfr
... make your changes ...
$ rft -h osgedu.cs.clemson.edu -f ./rft.xfr 
Number of transfers in this request: 3
Subscribed for overall status
Termination time to set: 60 minutes

 Overall status of transfer:
Finished/Active/Failed/Retrying/Pending
0/1/0/0/2

 Overall status of transfer:
Finished/Active/Failed/Retrying/Pending
1/0/0/0/2

 Overall status of transfer:
Finished/Active/Failed/Retrying/Pending
1/1/0/0/1

 Overall status of transfer:
Finished/Active/Failed/Retrying/Pending
2/0/0/0/1

 Overall status of transfer:
Finished/Active/Failed/Retrying/Pending
2/1/0/0/0

 Overall status of transfer:
Finished/Active/Failed/Retrying/Pending
3/0/0/0/0
All Transfers are completed

Initally all jobs start in the pending state, move to active state and then hopefully to finished state (but maybe fail, in which case they go to the failed state).

The transfer file has a number of options, documented in-line. You can experiment changing them. Interesting ones to try:

  • Add more URLs to transfer

  • Transfer between two remote sites

  • Use parallel streams

  • Increase the transfer concurrency

In particular you should check that you understand the difference between parallel streams (the number of streams used when transferring one file) and concurrency (the number of files that can be transferred at once).