1. Overview

This guide explains details required to run Swift on various system types, with details for specific installations on which Swift is currently used. For a given system type, most instructions should work on that system type anywhere. However, details such as queue names or file system locations will have to be customized by the user.

2. Prerequisites

This guide assumes that you have already downloaded and installed Swift. It assumes that Swift is in your PATH and that you have a working version of Sun Java 1.5+. For more information on downloading and installing Swift, see the Swift Quickstart Guide.

3. Beagle (Cray XE6)

Beagle is a Cray XE6 supercomputer at UChicago. It employs a batch-oriented computational model where-in a PBS schedular accepts user’s jobs and queues them in the queueing system for execution.

The computational model requires a user to prepare the submit files, track job submissions, chackpointing, managing input/output data and handling exceptional conditions manually. Running Swift under Beagle can accomplish the above tasks with least manual user intervention and maximal oppurtunistic computation time on Beagle queues. In the following sections, we discuss more about specifics of running Swift on Beagle. A more detailed information about Swift and its workings can be found on Swift documentation page here:

http://www.ci.uchicago.edu/swift/wwwdev/docs/index.php

More information on Beagle can be found on UChicago Beagle website here:

http://beagle.ci.uchicago.edu

3.1. Requesting Access

If you do not already have a Computation Institute (CI) account, you can request one at https://www.ci.uchicago.edu/accounts/. This page will give you a list of resources you can request access to.

If you already have an existing CI account, but do not have access to Beagle, send an email to support@ci.uchicago.edu to request access.

3.2. Connecting to a login node

Once you have account, you should be able to access a Beagle login node with the following command:

ssh yourusername@login.beagle.ci.uchicago.edu

3.3. Getting Started with Swift

Follow the steps outlined below to get started with Swift on Beagle:

step 1. Load the Swift and Sun-java module on Beagle as follows: module load swift sun-java

step 2. Create and change to a directory where your Swift related work will stay. (say, mkdir swift-lab, followed by, cd swift-lab)

step 3. To get started with a simple example running /bin/cat to read an input file data.txt and write to an output file f.nnn.out, start with writing a simple swift source script as follows:

type file;

/* App definition */
app (file o) cat (file i)
{
  cat @i stdout=@o;
}

file out[]<simple_mapper; location="outdir", prefix="f.",suffix=".out">;
file data<"data.txt">;

/* App invocation: n times */
foreach j in [1:@toint(@arg("n","1"))] {
  out[j] = cat(data);
}

step 4. The next step is to create a sites file. An example sites file (sites.xml) is shown as follows:

<config>
  <pool handle="pbs">
    <execution provider="coaster" jobmanager="local:pbs"/>
    <!-- replace with your project -->
    <profile namespace="globus" key="project">CI-CCR000013</profile>

    <profile namespace="globus" key="providerAttributes">
                     pbs.aprun;pbs.mpp;depth=24</profile>

    <profile namespace="globus" key="jobsPerNode">24</profile>
    <profile namespace="globus" key="maxTime">1000</profile>
    <profile namespace="globus" key="slots">1</profile>
    <profile namespace="globus" key="nodeGranularity">1</profile>
    <profile namespace="globus" key="maxNodes">1</profile>

    <profile namespace="karajan" key="jobThrottle">.63</profile>
    <profile namespace="karajan" key="initialScore">10000</profile>

    <filesystem provider="local"/>
    <!-- replace this with your home on lustre -->
    <workdirectory >/lustre/beagle/ketan/swift.workdir</workdirectory>
  </pool>
</config>

step 5. In this step, we will see the config and tc files. The config file (cf) is as follows:

wrapperlog.always.transfer=true
sitedir.keep=true
execution.retries=1
lazy.errors=true
use.provider.staging=true
provider.staging.pin.swiftfiles=false
foreach.max.threads=100
provenance.log=false

The tc file (tc) is as follows:

pbs cat /bin/cat null null null

More about config and tc file options can be found in the swift userguide here: http://www.ci.uchicago.edu/swift/guides/trunk/userguide/userguide.html#_swift_configuration_properties

step 6. Run the example using following commandline:

swift -config cf -tc.file tc -sites.file sites.xml catsn.swift -n=1

You can further change the value of -n to any arbitrary number to run that many number of concurrent cat

step 7. Swift will show a status message as "done" after the job has completed its run in the queue. Check the output in the generated outdir directory (ls outdir)

Swift 0.93RC5 swift-r5285 cog-r3322

RunID: 20111218-0246-6ai8g7f0
Progress:  time: Sun, 18 Dec 2011 02:46:33 +0000
Progress:  time: Sun, 18 Dec 2011 02:46:42 +0000  Active:1
Final status:  time: Sun, 18 Dec 2011 02:46:43 +0000  Finished successfully:1

3.4. Larger Runs on Beagle

A key factor in scaling up Swift runs on Beagle is to setup the sites.xml parameters. The following sites.xml parameters must be set to scale that is intended for a large run:

  • maxTime : The expected walltime for completion of your run. This parameter is accepted in seconds.

  • slots : This parameter specifies the maximum number of pbs jobs/blocks that the coaster scheduler will have running at any given time. On Beagle, this number will determine how many qsubs swift will submit for your run. Typical values range between 40 and 60 for large runs.

  • nodeGranularity : Determines the number of nodes per job. It restricts the number of nodes in a job to a multiple of this value. The total number of workers will then be a multiple of jobsPerNode * nodeGranularity. For Beagle, jobsPerNode value is 24 corresponding to its 24 cores per node.

  • maxNodes : Determines the maximum number of nodes a job must pack into its qsub. This parameter determines the largest single job that your run will submit.

  • jobThrottle : A factor that determines the number of tasks dispatched simultaneously. The intended number of simultaneous tasks must match the number of cores targeted. The number of tasks is calculated from the jobThrottle factor is as follows:

Number of Tasks = (JobThrottle x 100) + 1

Following is an example sites.xml for a 50 slots run with each slot occupying 4 nodes (thus, a 200 node run):

<config>
  <pool handle="pbs">
    <execution provider="coaster" jobmanager="local:pbs"/>
    <profile namespace="globus" key="project">CI-CCR000013</profile>

    <profile namespace="globus" key="ppn">24:cray:pack</profile>

    <!-- For swift 0.93
    <profile namespace="globus" key="ppn">pbs.aprun;pbs.mpp;depth=24</profile>
    -->

    <profile namespace="globus" key="jobsPerNode">24</profile>
    <profile namespace="globus" key="maxTime">50000</profile>
    <profile namespace="globus" key="slots">50</profile>
    <profile namespace="globus" key="nodeGranularity">4</profile>
    <profile namespace="globus" key="maxNodes">4</profile>

    <profile namespace="karajan" key="jobThrottle">48.00</profile>
    <profile namespace="karajan" key="initialScore">10000</profile>

    <filesystem provider="local"/>
    <workdirectory >/lustre/beagle/ketan/swift.workdir</workdirectory>
  </pool>
</config>

3.5. Troubleshooting

In this section we will discuss some of the common issues and remedies while using Swift on Beagle. The origin of these issues can be Swift or the Beagle’s configuration, state and user configuration among other factors. We try to identify maximum known issues and address them here:

  • Command not found: Swift is installed on Beagle as a module. If you see the following error message:

If 'swift' is not a typo you can run the following command to lookup the package that contains the binary:
    command-not-found swift
-bash: swift: command not found

The most likely cause is the module is not loaded. Do the following to load the Swift module:

$ module load swift sun-java
Swift version swift-0.93 loaded
sun-java version jdk1.7.0_02 loaded
  • Failed to transfer wrapperlog for job cat-nmobtbkk and/or Job failed with an exit code of 254. Check the <workdirectory> element on the sites.xml file.

<workdirectory >/home/ketan/swift.workdir</workdirectory>

It is likely that it is set to a path where the compute nodes can not write, e.g. your /home directory. The remedy for this error is to set your workdirectory to the /lustre path where swift could write from compute nodes.

<workdirectory >/lustre/beagle/ketan/swift.workdir</workdirectory>

4. Fusion (x86 cluster)

Fusion is a 320-node computing cluster for the Argonne’s Laboratory Computing Resource Center (LCRC). The primary goal of the LCRC is to facilitate mid-range computing in all of the scientific programs of Argonne and the University of Chicago.

This section will walk you through running a simple Swift script on Fusion.

4.1. Requesting Access

If you do not already have a Fusion account, you can request one at https://accounts.lcrc.anl.gov/request.php. Email support@lcrc.anl.gov for additional help.

4.2. Projects

In order to run a job on a Fusion compute node, you must first be associated with a project.

Each project has one or more Primary Investigators, or PIs. These PIs are responsible for adding and removing users to a project. Contact the PI of your project to be added.

More information on this process can be found at http://www.lcrc.anl.gov/info/Projects.

4.3. SSH Keys

Before accessing Fusion, be sure to have your SSH keys configured correctly. SSH keys are required to access fusion. You should see information about this when you request your account. Check ssh FAQ or email support@lcrc.anl.gov for additional help.

4.4. Connecting to a login node

Once your keys are configured, you should be able to access a Fusion login node with the following command:

ssh <yourusername>@fusion.lcrc.anl.gov

4.5. Creating sites.xml

Swift uses various configuration files to determine how to run an application. This section will provide a working configuration file which you can copy and paste to get running quickly. The sites.xml file tells Swift how to submit jobs, where working directories are located, and various other configuration information. More information on sites.xml can be found in the Swift user guide.

The first step is to paste the text below into a file named sites.xml.

<config>
<pool handle="fusion">
  <execution jobmanager="local:pbs" provider="coaster" url="none"/>
  <filesystem provider="local" url="none" />
  <profile namespace="globus" key="maxtime">750</profile>
  <profile namespace="globus" key="jobsPerNode">1</profile>
  <profile namespace="globus" key="slots">1</profile>
  <profile namespace="globus" key="nodeGranularity">1</profile>
  <profile namespace="globus" key="maxNodes">2</profile>
  <profile namespace="globus" key="queue">shared</profile>
  <profile namespace="karajan" key="jobThrottle">5.99</profile>
  <profile namespace="karajan" key="initialScore">10000</profile>
  <workdirectory>_WORK_</workdirectory>
</pool>
</config>

This file will require one customization. Create a directory called swiftwork. Modify _WORK_ in sites.xml to point to a new directory. For example

<workdirectory>/tmp/swiftwork</workdirectory>

4.6. Creating tc

The tc configuration file gives information about the applications that will be called by Swift. More information about the format of tc can be found in the Swift user guide.

Paste the following example into a file named tc

fusion  echo            /bin/echo       INSTALLED       INTEL32::LINUX
fusion  cat             /bin/cat        INSTALLED       INTEL32::LINUX
fusion  ls              /bin/ls         INSTALLED       INTEL32::LINUX
fusion  grep            /bin/grep       INSTALLED       INTEL32::LINUX
fusion  sort            /bin/sort       INSTALLED       INTEL32::LINUX
fusion  paste           /bin/paste      INSTALLED       INTEL32::LINUX
fusion  wc              /usr/bin/wc     INSTALLED       INTEL32::LINUX

4.7. Copy a Swift Script

Within the Swift directory is an examples directory which contains several introductory Swift scripts. The example we will use in this section is called catsn.swift. The script copies input file’s content to another file using the Unix cat utility. Copy this script to the same directory that your sites.xml and tc.data files are located.

$ cp ~/swift/examples/misc/catsn.swift .
$ cp ~/swift/examples/misc/data.txt .
Tip The location of your swift directory may vary depending on how you installed it. Change this to the examples/misc directory of your installation as needed.

4.8. Run Swift

Finally, run the script

$ swift -sites.file sites.xml -tc.file tc catsn.swift

You should see 10 new text files get created, named catsn*.out. If you see these files, then you have succesfully run Swift on Fusion!

4.9. Queues

Fusion has two queues: shared and batch. The shared queue has a maximum 1 hour walltime and limited to 4 nodes. The batch queue is for all other jobs.

Edit your sites.xml file and edit the queue option to modify Swift’s behavior. For example:

<profile namespace="globus" key="queue">batch</profile>

More information on Fusion queues can be found at http://www.lcrc.anl.gov/info/BatchJobs.

4.10. More Help

The best place for additional help is the Swift user mailing list. You can subscribe to this list at swift-user. When submitting information, send your sites.xml file, your tc.data, and any Swift log files that were created during your attempt.

5. Futuregrid Cloud

The NSF-funded FutureGrid cloud is administered by Indiana University. It offers a variety of resources via a multitude of interfaces. Currently, it offers cloud resources via three different interfaces: Eucalyptus, Nimbus (www.nimbusproject.org), and OpenStack (www.openstack.org). The total number of resources at FutureGrid is close to 5000 CPU cores and 220~TB of storage from more than six physical clusters. We use the resources offered by one such cluster via the Nebula middleware.

More information on futuregrid can be found here.

5.1. Requesting Futuregrid Access

If you do not already have a futuregrid account, you can follow the instructions here to get started. This page provides information on how to create an account, how to join a project, how to set up your SSH keys, and how to create a new project.

5.2. Downloading Swift VM Tools

A set of scripts based around cloudinitd are used to easily start virtual machines. To download, change to your home directory and run the following command:

$ svn co https://svn.ci.uchicago.edu/svn/vdl2/usertools/swift-vm-boot swift-vm-boot

5.3. Download your Credentials

Run the following commands to retrieve your credentials:

$ cd swift-vm-boot
$ scp yourusername@hotel.futuregrid.org:nimbus_creds.tar.gz .
$ tar xvfz nimbus_creds.tar.gz

When you extract your credential file, look at the file called hotel.conf. Near the bottom of this file will be two settings called vws.repository.s3id and vws.repository.s3key. Copy these values for the next step.

5.4. Configuring coaster-service.conf

To run on futuregrid, you will need a file called coaster-service.conf. This file contains many options to control how things run. Here is an example of a working coaster-service.conf on futuregrid.

# Where to copy worker.pl on the remote machine for sites.xml
export WORKER_LOCATION=/tmp

# How to launch workers: local, ssh, cobalt, or futuregrid
export WORKER_MODE=futuregrid

# Do all the worker nodes you're using have a shared filesystem? (yes/no)
export SHARED_FILESYSTEM=no

# Username to use on worker nodes
export WORKER_USERNAME=root

# Enable SSH tunneling? (yes/no)
export SSH_TUNNELING=yes

# Directory to keep log files, relative to working directory when launching start-coaster-service
export LOG_DIR=logs

# Location of the swift-vm-boot scripts
export SWIFTVMBOOT_DIR=$HOME/swift-vm-boot

# Futuregrid settings
export FUTUREGRID_IAAS_ACCESS_KEY=XXXXXXXXXXXXXXXXXXXXX
export FUTUREGRID_IAAS_SECRET_KEY=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
export FUTUREGRID_HOTEL_NODES=0
export FUTUREGRID_SIERRA_NODES=2
export FUTUREGRID_CPUS_PER_NODE=1

# Swift information for creating sites.xml
export WORK=/tmp
export JOBS_PER_NODE=$FUTUREGRID_CPUS_PER_NODE
export JOB_THROTTLE=$( echo "scale=5; ($JOBS_PER_NODE * $(($FUTUREGRID_HOTEL_NODES + $FUTUREGRID_SIERRA_NODES)))/100 - 0.00001"|bc )

# Application locations for tc.data
#app convert=/usr/bin/convert

Paste your credentials from the hotel.conf file into the FUTUREGRID_IAAS_ACCESS_KEY and FUTUREGRID_IAAS_SECRET_KEY fields. Adjust the number of nodes you would like to allocate here by changing the values of FUTUREGRID_HOTEL_NODES and FUTUREGRID_SIERRA_NODES. Add a list of any applications you want to run in the format "#app myapp=/path/to/app".

5.5. Starting the Coaster Service Script

Now that everything is configured, change to the location of the coaster-service.conf file and run this command to start the coaster service:

$ start-coaster-service

This command will start the VMs, start the required processes on the worker nodes, and generate Swift configuration files for you to use. The configuration files will be generated in your current directory. These files are sites.xml, tc.data, and cf.

5.6. Running Swift

Now that you have all of your configuration files generated, run the following command:

$ swift -sites.file sites.xml -tc.file tc.data -config cf <yourscript.swift>

If you like to create a custom tc and/or cf file for repeated use, rename it to something other than tc.data/cf to prevent it from being overwritten. The sites.xml however will need to be regenerated every time you start the coaster service. If you need to repeatedly modify some sites.xml options, you may edit the template in Swift’s etc/sites/persistent-coasters. You may also create your own custom tc files with the hostname of persistent-coasters. More information about this can be found in the Swift userguide at http://www.ci.uchicago.edu/swift/guides/trunk/userguide/userguide.html.

5.7. Stopping the Coaster Service Script

To stop the coaster service, run the following command:

$ stop-coaster-service

This will kill the coaster service, kill the worker scripts on remote systems and terminate the virtual machines that were created during start-coaster-service.

5.8. More Help

The best place for additional help is the Swift user mailing list. You can subscribe to this list at http://mail.ci.uchicago.edu/mailman/listinfo/swift-user. When submitting information, send your sites.xml file, your tc.data, and any error messages you run into.

6. Grids: Open Science Grid and TeraGrid

6.1. Overview of running on grid sites

  • Get a DOEGrids cert. Then register it in the OSG Engage VO, and/or map it using gx-request on TeraGrid sites.

  • Run GridSetup to configure Swift to use the grid sites. This tests for correct operation and creates a "green list" of good sites.

  • Prepare an installation package for the programs you want to run on grid sites via Swift, and install that package using foreachsite.

  • Run swift-workers to start and maintain a pool of Swift workers on each site.

  • Run Swift scripts that use the grid site resources.

Note This revision only supports a single-entry sites file which uses provider staging and assumes that the necessary apps are locatable through the same tc entries (ie either absolute or PATH-relative paths) on all sites.
Note This revision has been testing using the bin/grid code from trunk (which gets installed into trunk’s bin/ directory, and the base swift code from the 0.93 branch. No other configurations have been tested at the moment. I intend to put this code in bin/grid in 0.93, as it should have no ill affects on other Swift usage.

6.2. Requesting Access

For OSG: Obtain a DOEGrids certificate and register the certificate in the OSG "Engage" VO following the procedure at:

FIXME: access to OSG wiki pages may request the user to present a certificate. Is this a problem from users without one? If so, make a copy of the page on the Swift web.

For TeraGrid: Obtain a DOEGrids certifcate using the OSG Engage instructions above. Ask a TeraGrid PI to add you to a TeraGrid project. Once you obtain a login and project access (via US Mail), use gx-request to add your certificate.

A detailed step-by-step instructions for requesting and installing your certificates in the browser and client machine are as follows:

Step1. Apply for a certificate: https://pki1.doegrids.org/ca/; use ANL as affiliation (registration authority) in the form.

Step2. When you receive your certificate via a link by mail, download and install it in your browser; we have tested it for firefox on linux and mac., and for Chrome on mac.

On firefox, as you click the link that you received in the mail, you will be prompted to install it by firefox: passphrase it and click install. Next take a backup of this certificate in the form of .p12. This is in Preferences > Advanced > Encryption > View Certificate > Your Certificate

Step3. Install DOE CA and ESnet root CA into your browser by clicking the top left links on this page: http://www.doegrids.org/

Step4. Go to the Engage VO registration point here: https://osg-engage.renci.org:8443/vomrs/Engage/vomrs from the same browser that has the above certs installed. Also see https://twiki.grid.iu.edu/bin/view/Engagement/EngageNewUserGuide for more details.

Step5. For installation of certificate on client machine, you need to have the certificate that is in the browser put in your client’s ~/.globus directory from where you want to access OSG resources. The certificate has to be in the form of .pem files with a seperate .pem file for key and cert. For this conversion use the above backed up .p12 file as follows:

$ openssl pkcs12 -in your.p12 -out usercert.pem -nodes -clcerts -nokeys
$ openssl pkcs12 -in your.p12 -out userkey.pem -nodes -nocerts

Step6. Test it:

$ voms-proxy-init --voms Engage -hours 48

To run jobs using the procedure described here, you need to be logged in to a "submit host" on which you will run Swift and other grid-related utilities. A submit host is any host where the OSG client stack or equivalent tools are installed. Such hosts include the OSG Engage submit host (engage-submit3.renci.org), and the two Swift lab servers {bridled,communicado}.ci.uchicago.edu.

Obtain a login on engage-submit3.renci.org following instructions on the OSG URL above.

Obtain a CI login with access to the Swift lab servers by requesting "OSG Gridlab" access at:

6.3. Connecting to a submit host

ssh yourusername@bridled.ci.uchicago.edu
ssh yourusername@communicado.ci.uchicago.edu
ssh yourusername@engage-submit.renci.org

6.4. Downloading and install Swift

The current version of Swift can be downloaded from http://www.ci.uchicago.edu/swift/downloads/index.php.

Fetch and untar the latest release. Then add the Swift bin/ directory to your PATH. For example:

cd $HOME
wget http://www.ci.uchicago.edu/swift/packages/swift-0.92.1.tar.gz
tar txf swift-0.92.1.tar.gz
export PATH=$PATH:$HOME/swift-0.92.1/bin

6.5. Set up OSG environment

Depending on your shell type, run:

source /opt/osg-<version>/setup.sh
or
source /opt/osg-<version>/setup.csh
Note This above step is not required on engage-submit3 host.

6.6. Create a VOMS Grid proxy

$ voms-proxy-init -voms Engage -valid 72:00

6.7. Generating Configuration Files

cd $HOME
mkdir swiftgrid
cd swiftgrid
gen_gridsites
# Wait a few minutes to a few hours for Swift to validate grid sites
get_greensites >greensites

You can repeatedly try the get_greensites command, which simply concatenates all the site names that sucessfully resturned an output file from site tests.

6.8. Installing software on OSG sites

The command "foreachsite" will execute a local shell script passed to it as an argument, on each OSG site in the Engage VO ReSS site catalog. The user’s installscript will execute on either the head node (as a GRAM "fork" job) or on a worker node, as controlled by the -resource option. Its syntax is:

$ ./foreachsite -help
./foreachsite [-resource fork|worker ] [-sites alt-sites-file] scriptname
$

To install your software, create a script similar to "myapp.sh", below, which (in this example) reads a tar file of a pre-compiled application "myapp-2.12" and executes a test script for that application. The test script should print some reconizable indication of its success or failure:

$ cat myinstall.sh
renice 2 -p $$
IDIR=OSG_APP/extenci/myname/myapp
mkdir -p $IDIR
cd $IDIR
wget http://my.url.edu/~mydir/myapp-2.12.tar.tgz
tar zxf myapp-2.12.tar.tgz
myapp-2.12/bin/test_myapp.sh
if [ $? = 0 ]; then
  echo INSTALL SUCCEEDED
else
  echo INSTALL FAILED
fi
$
$ foreachsite -resource fork myinstall.sh
$
$ # Wait a while here, then poll for successfully installed apps...
$
$ grep SUCCEEDED run.89/*/*.stdout
$

Following is an example of the installation script for DSSAT app on OSG:

#!/bin/bash

cd ${OSG_APP}/extenci/swift/

#pull
wget http://www.ci.uchicago.edu/~ketan/DSSAT.tgz

#extract
tar zxf DSSAT.tgz

# test
cd DSSAT/data

../DSSAT040.EXE A H1234567.MZX > std.OUT

if [ $? = 0 ]; then
  echo INSTALL SUCCEEDED
else
  echo INSTALL FAILED
fi

6.9. Starting a single coaster service

This single coaster service will service all grid sites:

start-grid-service --loglevel INFO --throttle 3.99 --jobspernode 1 \
                   >& start-grid-service.out

6.10. Starting workers on OSG sites through GRAM

Make sure that your "greensites" file is in the current working directory.

The swiftDemand file should be set to contain the number of workers you want to start across all OSG sites. Eventually this will be set dynamically by watching your Swift script execute. (Note that this number only includes jobs started by the swift-workers factory command, not by any workers added manually from the TeraGrid - see below.

The condor directory must be pre-created and will be used by Condor to return stdout and stderr files from the Condor jobs, which will execute the wrapper script "run-workers.sh".

Note this script is current built manually, and wraps around and transports the worker.pl script. This needs to be automated.
echo 250 >swiftDemand mkdir -p condor

swift-workers greensites extenci \
              http://communicado.ci.uchicago.edu:$(cat service-0.wport) \
              >& swift-workers.out &

6.11. Starting workers on OSG sites through GlideinWMS

As an alternative to the above gram based direct worker submission, a GlideinWMS based worker submission can be made. The service start step would be same as above.

GlideinWMS is a Glidein Based WMS (Workload Management System) that works on top of condor.

As with the case of Gram based workers, the condor directory must be pre-created and will be used by Condor to return stdout and stderr files from the Condor jobs, which will execute the wrapper script "run-workers.sh".

run-gwms-workers http://communicado.ci.uchicago.edu:$(cat service-0.wport) \
100 >& gwms-workers.out &
Note The run-gwms-workers is available from the bin/grid directory of swift trunk code. You will need to include it in your PATH.

In the above commandline, one can change the number of workers by changing the second commandline argument, which is 100 in this example.

6.12. Adding workers from TeraGrid sites

The job below can be used to submit jobs to TareGrid (Ranger only at the moment) to add more workers to the execution pool. The same requirements hold there as for OSG sites, namely, that the app tools listed in tc for the single execution site need to be locatable on the TeraGrid site(s).

start-ranger-service --nodes 1 --walltime 00:10:00 --project TG-DBS123456N \
                     --queue development --user tg12345 --startservice no \
                     >& start-ranger-service.out
Note Change the project and user names to match your TeraGrid parameters.

6.13. Running Swift

Now that everything is in place, run Swift with the following command:

export SWIFT_HEAP_MAX=6000m # Add this for very large scripts

swift -config cf.ps -tc.file tc -sites.file sites.grid-ps.xml \
      catsn.swift -n=10000 >& swift.out &

You should see several new files being created, called catsn.0001.out, catsn.0002.out, etc. Each of these files should contain the contents of what you placed into data.txt. If this happens, your job has run successfully on the grid sites.

6.14. More Help

The best place for additional help is the Swift user mailing list. You can subscribe to this list at http://mail.ci.uchicago.edu/mailman/listinfo/swift-user. When submitting information, please send your sites.xml file, your tc.data, and any Swift log files that were created during your attempt.

7. Intrepid (Blue Gene/P)

Intrepid is an IBM Blue Gene/P (BG/p) supercomputer located at the Argonne Leadership Computing Facility. More information on Intrepid can be found at http://www.alcf.anl.gov. Surveyor and Challenger are similar, smaller machines.

7.1. Requesting Access

If you do not already have an account on Intrepid, you can request one here. More information about this process and requesting allocations for your project can be found here.

7.2. SSH Keys

Accessing the Intrepid via SSH can be done with any SSH software package. Before logging in, you will need to generate an SSH public key and send it to support@alcf.anl.gov for verification and installation.

7.3. Cryptocard

This security token uses one-time passwords for controlled access to the BG/P login systems.

7.4. Connecting to a login node

When you gain access to Intrepid, you should receive a cryptocard and a temporary PIN. You must have a working cryptocard, know your PIN, and have your SSH key in place before you may login.

You can connect to Intrepid with the following command:

ssh yourusername@intrepid.alcf.anl.gov

You will be presented with a password prompt. The first part of your password is your PIN. Enter you PIN, press the Cryptocard button, and then enter the password your crypocard generates. If this is the first time you are logging in, you will be prompted to change your PIN.

7.5. Downloading and building Swift

The most recent versions of Swift can be found at http://www.ci.uchicago.edu/swift/downloads/index.php. Follow the instructions provided on that site to download and build Swift.

7.6. Adding Swift to your PATH

Once you have installed Swift, add the Swift binary to your PATH so you can easily run it from any directory.

In your home directory, edit the file ".bashrc".

If you have installed Swift via a source repository, add the following line at the bottom of .bashrc.

export PATH=$PATH:$HOME/cog/modules/swift/dist/swift-svn/bin

If you have installed Swift via a binary package, add this line:

export PATH=$PATH:$HOME/swift-<version>/bin

Replace <version> with the actual name of the swift directory in the example above.

7.7. What You Need To Know Before Running Swift

Note that on Intrepid, the compute nodes can not create or write to a /home filesystem. Consequently, in order for Swift to interface correctly from login node to the compute nodes, Swift must write all internal and intermediate files to /intrepid-fs0, which is writable by the compute nodes. In order to accomplish this, export the environment variable SWIFT_USERHOME as follows:

export SWIFT_USERHOME=/intrepid-fs0/users/`whoami`/scratch

Before you can create a Swift configuration file, there are some things you will need to know.

7.7.1. Swift Work Directory

The Swift work directory is a directory which Swift uses for processing work. This directory needs to be writable. Common options for this are:

/home/username/swiftwork
/home/username/work
/tmp/swift.work

7.7.2. Which project(s) are you a member of?

Intrepid requires that you are a member of a project. You can determine this by running the following command:

$ projects
HTCScienceApps

If you are not a member of a project, you must first request access to a project. More information on this process can be found at https://wiki.alcf.anl.gov/index.php/Discretionary_Allocations

7.7.3. Determine your Queue

Intrepid has several different queues you can submit jobs to depending on the type of work you will be doing. The command "qstat -q" will print the most up to date list of this information.

Table 1. Intrepid Queues
User Queue Queue Nodes Time (hours) User Maxrun Project maxrun

prod-devel

prod-devel

64-512

0-1

5

20

prod

prod-short

512-4096

0-6

5

20

prod

prod-long

512-4096

6-12

5

20

prod

prod-capability

4097-32768

0-12

2

20

prod

prod-24k

16385-24576

0-12

2

20

prod

prod-bigrun

32769-40960

0-12

2

20

prod

backfill

512-8192

0-6

5

10

7.8. Generating Configuration Files

Now that you know what queue to use, your project, and your work directory, it is time to set up Swift. Swift uses a configuration file called sites.xml to determine how it should run. There are two methods you can use for creating this file. You can manually edit the configuration file, or generate it with a utility called gensites.

7.8.1. Manually Editing sites.xml

Below is the template that is used by Swift’s test suite for running on Intrepid.

<config>

  <pool handle="localhost" sysinfo="INTEL32::LINUX">
    <gridftp url="local://localhost" />
    <execution provider="local" url="none" />
    <workdirectory>/scratch/wozniak/work</workdirectory>
    <!-- <profile namespace="karajan" key="maxSubmitRate">1</profile> -->
    <profile namespace="karajan" key="jobThrottle">0.04</profile>
    <profile namespace="swift"   key="stagingMethod">file</profile>
  </pool>

  <pool handle="coasters_alcfbgp">
    <filesystem provider="local" />
    <execution provider="coaster" jobmanager="local:cobalt"/>
    <!-- <profile namespace="swift"   key="stagingMethod">local</profile> -->
    <profile namespace="globus"  key="internalHostname">_HOST_</profile>
    <profile namespace="globus"  key="project">_PROJECT_</profile>
    <profile namespace="globus"  key="queue">_QUEUE_</profile>
    <profile namespace="globus"  key="kernelprofile">zeptoos</profile>
    <profile namespace="globus"  key="alcfbgpnat">true</profile>
    <profile namespace="karajan" key="jobthrottle">5.11</profile>
    <profile namespace="karajan" key="initialScore">10000</profile>
    <profile namespace="globus"  key="jobsPerNode">1</profile>
    <profile namespace="globus"  key="workerLoggingLevel">DEBUG</profile>
    <profile namespace="globus"  key="slots">1</profile>
    <profile namespace="globus"  key="maxTime">900</profile> <!-- seconds -->
    <profile namespace="globus"  key="nodeGranularity">512</profile>
    <profile namespace="globus"  key="maxNodes">512</profile>
    <workdirectory>_WORK_</workdirectory>
  </pool>

</config>

Copy and paste this template, replace the values, and call it sites.xml.

The values to note here are the ones that are listed between underscores. In the example above, they are _HOST_, _PROJECT_, _QUEUE_, and _WORK_.

HOST

The IP address on which Swift runs and to which workers must connect. To obtain this, run ifconfig and select the IP address that starts with 172.

PROJECT

The project to use.

QUEUE

The queue to use.

WORK

The Swift work directory.

7.9. Manually Editing tc.data

Below is the tc.data file used by Swift’s test suite.

coasters_alcfbgp        cp              /bin/cp         INSTALLED       INTEL32::LINUX  null

Copy these commands and save it as tc.data.

7.10. Catsn.swift

The swift script we will run is called catsn.swift. It simply cats a file and saves the result. This is a nice simple test to ensure jobs are running correctly. Create a file called data.txt which contains some simple input - a "hello world" will do the trick.

type file;

app (file o) cat (file i)
{
  cat @i stdout=@o;
}

string t = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789";
string char[] = @strsplit(t, "");

file out[]<simple_mapper; location=".", prefix="catsn.",suffix=".out">;
foreach j in [1:@toInt(@arg("n","10"))] {
  file data<"data.txt">;
  out[j] = cat(data);
}

7.11. Running Swift

Now that everything is in place, run Swift with the following command:

swift -sites.file sites.xml -tc.file tc.data catsn.swift -n=10

You should see several new files being created, called catsn.0001.out, catsn.0002.out, etc. Each of these files should contain the contents of what you placed into data.txt. If this happens, your job has run successfully!

7.12. More Help

The best place for additional help is the Swift user mailing list. You can subscribe to this list. When submitting information, send your sites.xml file, your tc.data, and any Swift log files that were created during your attempt.

8. MCS Compute Servers (x86 workstations)

This sections describes how to use the general use compute servers for the MCS division of Argonne National Laboratory.

8.1. Create a coaster-service.conf

To begin, copy the text below and paste it into your Swift distribution’s etc directory. Name the file coaster-service.conf.

# Location of SWIFT. If empty, PATH is referenced
export SWIFT=

# Where to place/launch worker.pl on the remote machine for sites.xml
export WORKER_LOCATION=/home/${USER}/work

# How to launch workers: local, ssh, or cobalt
export WORKER_MODE=ssh

# Worker logging setting passed to worker.pl for sites.xml
export WORKER_LOGGING_LEVEL=INFO

# User name to use for all systems
export WORKER_USERNAME=$USER

# Worker host names for ssh
export WORKER_HOSTS="crush thwomp stomp crank grind churn trounce thrash vanquish"

# Directory to keep log files, relative to working directory when launching start-coaster-service
export LOG_DIR=logs
export WORKER_LOG_DIR=/home/${USER}/work

# Manually define ports. If not specified, ports will be automatically generated
export LOCAL_PORT=
export SERVICE_PORT=

# Set shared filesystem to no since work will be done in local /sandbox directory
export SHARED_FILESYSTEM=yes

# start-coaster-service tries to automatically detect IP address.
# Specify here if auto detection is not working correctly
export IPADDR=

# Below are various settings to give information about how to create sites.xml
export WORK=/home/${USER}/work
export JOBSPERNODE=4

# Try to determine throttle automatically based on the number of nodes and jobs per node
export JOB_THROTTLE=$( echo "scale=5; ($JOBS_PER_NODE * $( echo $WORKER_HOSTS | wc -w ))/100 - 0.00001"|bc )

# Swift applications
#app cat=/bin/cat
#app bash=/bin/bash
#app echo=/bin/echo

8.2. Starting the Coaster Service

Change directories to the location you would like to run a Swift script and start the coaster service with this command:

start-coaster-service

This will create a configuration file that Swift needs called sites.xml.

Warning Any existing sites.xml files in this directory will be overwritten. Be sure to make a copy of any custom configuration files you may have.

8.3. Run Swift

Next, run Swift. If you do not have a particular script in mind, you can test Swift by using a Swift script in the examples/ directory.

Run the following command to run the script:

swift -sites.file sites.xml -tc.file tc.data yourscript.swift

8.4. Stopping the Coaster Service

The coaster service will run indefinitely. The stop-coaster-service script will terminate the coaster service.

$ stop-coaster-service

This will kill the coaster service and kill the worker scripts on remote systems.

9. Midway (x86 cluster)

Midway is a cluster maintained by the Research Computing Center at the University of Chicago. Midway uses Slurm to handle job scheduling. For more details about Midway, and to request an account, visit http://rcc.uchicago.edu.

9.1. Connecting to a login node

Once you have access to Midway, connect to a Midway login node.

$ ssh userid@midway.rcc.uchicago.edu

9.2. Loading Swift

Swift is available on Midway as a module. To load the Swift, run:

$ module load swift

9.3. Example sites.xml

Below is an example that uses two of the queues available on Midway, sandyb and westmere. Be sure to adjust walltime, work directory, and other options as needed.

<config>
  <pool handle="midway-sandyb">
    <execution provider="coaster" jobmanager="local:slurm"/>
    <profile namespace="globus" key="jobsPerNode">16</profile>
    <profile namespace="globus" key="maxWalltime">00:05:00</profile>
    <profile namespace="globus" key="highOverAllocation">100</profile>
    <profile namespace="globus" key="lowOverAllocation">100</profile>
    <profile namespace="globus" key="queue">sandyb</profile>
    <profile namespace="karajan" key="initialScore">10000</profile>
    <filesystem provider="local"/>
    <workdirectory>/scratch/midway/{env.USER}/work</workdirectory>
  </pool>

  <pool handle="midway-westmere">
    <execution provider="coaster" jobmanager="local:slurm"/>
    <profile namespace="globus" key="jobsPerNode">12</profile>
    <profile namespace="globus" key="maxWalltime">00:05:00</profile>
    <profile namespace="globus" key="highOverAllocation">100</profile>
    <profile namespace="globus" key="lowOverAllocation">100</profile>
    <profile namespace="globus" key="queue">westmere</profile>
    <profile namespace="karajan" key="initialScore">10000</profile>
    <filesystem provider="local"/>
    <workdirectory>/scratch/midway/{env.USER}/work</workdirectory>
  </pool>
</config>

9.4. Example sites.xml for use with MPI

Below is an example sites.xml that is suitable for running with MPI. Jobtype must be set to single. The value you set for ppn will determine the number of cores/slots your application uses per node. The value of count will set the number of nodes to request. The example below requests 2 nodes with 12 slots per node.

<config>
  <pool handle="midway-westmere">
    <execution provider="coaster" jobmanager="local:slurm"/>
    <profile namespace="globus" key="jobsPerNode">1</profile>
    <profile namespace="globus" key="ppn">12</profile>
    <profile namespace="globus" key="maxWalltime">_WALLTIME_</profile>
    <profile namespace="globus" key="highOverAllocation">100</profile>
    <profile namespace="globus" key="lowOverAllocation">100</profile>
    <profile namespace="globus" key="queue">westmere</profile>
    <profile namespace="karajan" key="initialScore">10000</profile>
    <profile namespace="globus" key="jobtype">single</profile>
    <profile namespace="globus" key="count">2</profile>
    <filesystem provider="local"/>
    <workdirectory>/scratch/midway/{env.USER}/work</workdirectory>
  </pool>
</config>

9.5. Various tips for running MPI jobs

  • You’ll need to load an MPI module. Run "module load openmpi" to add to your path.

  • The app that Swift runs should be a wrapper script that invokes your MPI application by running "mpiexec /path/to/yourMPIApp"

10. UC3 (x86 cluster)

10.1. Requesting Access

To request access to UC3, you must have a University of Chicago CNetID and be a meimber of the UC3 group. More information about UC3 can be found at https://wiki.uchicago.edu/display/uc3/UC3+Home or uc3-support@lists.uchicago.edu.

10.2. Connecting to a login node

To access the UC3 login node, you will use your CNetID and password.

ssh -l <cnetid> uc3-sub.uchicago.edu

10.3. Installing Swift

Swift should be available by default on the UC3 login nodes. You can verify this by running the following command:

swift -version

If for some reason Swift is not available, you can following the instructions at http://www.ci.uchicago.edu/swift/guides/release-0.93/quickstart/quickstart.html. Swift 0.94 or later is required to work with the condor provider on UC3.

10.4. Creating sites.xml

This section will provide a working configuration file which you can copy and paste to get running quickly. The sites.xml file tells Swift how to submit jobs, where working directories are located, and various other configuration information. More information on sites.xml can be found in the Swift User’s Guide.

The first step is to paste the text below into a file named sites.xml:

<config>
  <pool handle="uc3">
    <execution provider="coaster" url="uc3-sub.uchicago.edu" jobmanager="local:condor"/>
    <profile namespace="karajan" key="jobThrottle">999.99</profile>
    <profile namespace="karajan" key="initialScore">10000</profile>
    <profile namespace="globus"  key="jobsPerNode">1</profile>
    <profile namespace="globus"  key="maxWalltime">3600</profile>
    <profile namespace="globus"  key="nodeGranularity">1</profile>
    <profile namespace="globus"  key="highOverAllocation">100</profile>
    <profile namespace="globus"  key="lowOverAllocation">100</profile>
    <profile namespace="globus"  key="slots">1000</profile>
    <profile namespace="globus"  key="maxNodes">1</profile>
    <profile namespace="globus"  key="condor.+AccountingGroup">"group_friends.{env.USER}"</profile>
    <profile namespace="globus"  key="jobType">nonshared</profile>
    <filesystem provider="local" url="none" />
    <workdirectory>.</workdirectory>
  </pool>
</config>

10.5. Creating tc.data

The tc.data configuration file gives information about the applications that will be called by Swift. More information about the format of tc.data can be found in the Swift User’s guide.

Paste the following example into a file named tc.data:

uc3 echo /bin/echo null null null

10.6. Create a configuration file

A swift configuration file enables and disables some settings in Swift. More information on what these settings do can be found in the Swift User’s guide.

Paste the following lines into a file called cf:

wrapperlog.always.transfer=false
sitedir.keep=true
execution.retries=0
lazy.errors=false
status.mode=provider
use.provider.staging=true
provider.staging.pin.swiftfiles=false
use.wrapper.staging=false

10.7. Creating echo.swift

Now we need to create a swift script to test with. Let’s use a simple application that calls /bin/echo.

type file;

app (file o) echo (string s) {
   echo s stdout=@o;
}

foreach i in [1:5] {
  file output_file <single_file_mapper; file=@strcat("output/output.", i, ".txt")>;
  output_file = echo( @strcat("This is test number ", i) );
}

10.8. Running Swift

Putting everything together now, run your Swift script with the following command:

swift -sites.file sites.xml -tc.file tc.data -config cf echo.swift

If everything runs successfully, you will see 5 files get created in the output directory.

10.9. Controlling where jobs run

Swift will automatically generate condor scripts for you with the basic information about how to run. However, condor has hundreds of commands that let you customize how things work. If you need one of these advanced commands, you can add it to your sites.xml. The basic template for this is:

<profile namespace="globus" key="condor.key">value</profile>

For example, let’s assume that you want to control where your jobs run by adding a requirement. The condor command that will control the run is:

Requirements = UidDomain == "osg-gk.mwt2.org"

To have this generated by Swift, you will add a line to your sites.xml in the key/value style shown above.

<profile namespace="globus" key="condor.Requirements">UidDomain == "osg-gk.mwt2.org"</profile>

10.10. Installing Application Scripts on HDFS

Note This section will only work if the application you want to use is an interpreted script (bash, python, perl, etc). HDFS does not have the ability to give programs an execution bit and run them. If the application you want to run on UC3 is a compiled executable, skip this section and read ahead.

Once your simple echo test is running, you’ll want to start using your own applications. One way to go about doing this is by using the Hadoop filesystem. This filesystem is only available on the UC3 Seeder Cluster. In order to limit yourself to machines that can access this filesystem, add the following line to your sites.xml file:

<profile namespace="globus" key="condor.Requirements">UidDomain == "osg-gk.mwt2.org" &amp;&amp; regexp("uc3-c*", Machine)</profile>

Now you can install your script somewhere under the directory /mnt/hadoop/users/<yourusername>. Here is an example with putting everything together.

sites.xml
<config>
  <pool handle="uc3">
    <execution provider="coaster" url="uc3-sub.uchicago.edu" jobmanager="local:condor"/>
    <profile namespace="karajan" key="jobThrottle">999.99</profile>
    <profile namespace="karajan" key="initialScore">10000</profile>
    <profile namespace="globus"  key="jobsPerNode">1</profile>
    <profile namespace="globus"  key="maxWalltime">3600</profile>
    <profile namespace="globus"  key="nodeGranularity">1</profile>
    <profile namespace="globus"  key="highOverAllocation">100</profile>
    <profile namespace="globus"  key="lowOverAllocation">100</profile>
    <profile namespace="globus"  key="slots">1000</profile>
    <profile namespace="globus"  key="maxNodes">1</profile>
    <profile namespace="globus"  key="condor.+AccountingGroup">"group_friends.{env.USER}"</profile>
    <profile namespace="globus"  key="jobType">nonshared</profile>
    <profile namespace="globus" key="condor.Requirements">UidDomain == "osg-gk.mwt2.org" &amp;&amp; regexp("uc3-c*", Machine)</profile>
    <filesystem provider="local" url="none" />
    <workdirectory>.</workdirectory>
  </pool>
</config>
tc.data
uc3 bash /bin/bash null null null
myscript.swift
type file;

app (file o) myscript ()
{
   bash "/mnt/hadoop/users/<yourusername>/myscript.sh" stdout=@o;
}

file out[]<simple_mapper; location="outdir", prefix="myscript.",suffix=".out">;
int ntasks = @toInt(@arg("n","1"));

foreach n in [1:ntasks] {
   out[n] = myscript();
}
/mnt/hadoop/users/<yourusername>/myscript.sh
#!/bin/bash

echo This is my script
cf
wrapperlog.always.transfer=false
sitedir.keep=true
execution.retries=0
lazy.errors=false
status.mode=provider
use.provider.staging=true
provider.staging.pin.swiftfiles=false
use.wrapper.staging=false
Example run
$ swift -sites.file sites.xml -tc.file tc.data -config cf myscript.swift -n=10
Swift trunk swift-r6146 cog-r3544

RunID: 20130109-1657-tf01jpaa
Progress:  time: Wed, 09 Jan 2013 16:58:00 -0600
Progress:  time: Wed, 09 Jan 2013 16:58:30 -0600  Submitted:10
Progress:  time: Wed, 09 Jan 2013 16:59:00 -0600  Submitted:10
Progress:  time: Wed, 09 Jan 2013 16:59:12 -0600  Stage in:1  Submitted:9
Final status: Wed, 09 Jan 2013 16:59:12 -0600  Finished successfully:10
$ ls outdir/*
outdir/myscript.0001.out  outdir/myscript.0003.out  outdir/myscript.0005.out  outdir/myscript.0007.out  outdir/myscript.0009.out
outdir/myscript.0002.out  outdir/myscript.0004.out  outdir/myscript.0006.out  outdir/myscript.0008.out  outdir/myscript.0010.out

10.11. Staging in Applications with Coaster Provider Staging

If you want your application to be as portable as possible, you can use coaster provider staging to send your application(s) to a remote node. By removing the condor requirements in the previous section, you will have more cores available. Here is a simple script that stages in and executes a shell script.

type file;

app (file o) sleep (file script, int delay)
{
   # chmod +x script.sh ; ./script.sh delay
   bash "-c" @strcat("chmod +x ./", @script, " ; ./", @script, " ", delay) stdout=@o;
}

file sleep_script <"sleep.sh">;

foreach i in [1:5] {
  file o <single_file_mapper; file=@strcat("output/output.", i, ".txt")>;
  o = sleep(sleep_script, 10);
}

Mapping our script to a file and passing it as an argument to an app function causes the application to be staged in. The only thing we need on the worker node is /bin/bash.

Tip If the program you are staging in is an executable, statically compiling it will increase the chances of a successful run.
Tip If the application you are staging is more complex, with multiple files, add everything you need into just one compressed tar file, then extract on the worker node.

11. Stampede

Stampede is a 10 petaflop supercomputer available as part of XSEDE resources. It employs a batch-oriented computational model where-in a SLURM schedular accepts user’s jobs and queues them in the queueing system for execution. The computational model requires a user to prepare the submit files, track job submissions, chackpointing, managing input/output data and handling exceptional conditions manually.

Running Swift under Stampede can accomplish the above tasks with least manual user intervention. In the following sections, we discuss more about specifics of running Swift on Stampede. A more detailed information about Swift and its workings can be found on Swift documentation page here: http://www.ci.uchicago.edu/swift/wwwdev/docs/index.php

More information on Stampede can be found on XSEDE Stampede website here: https://www.xsede.org/stampede

11.1. Requesting Access

Initial access to XSEDE resources could be obtained by submitting a startup proposal. Advanced users could submit a proposal for research allocation. An educational allocation is available for teaching and/or training purposes. More on XSEDE allocations can be found here: https://www.xsede.org/allocations

11.2. Connecting to a login node

Once you have an account, you should be able to access a Stampede login node with the following command:

ssh yourusername@stampede.tacc.utexas.edu

Follow the steps outlined below to get started with Swift on Stampede:

step 1. Install Swift using one of the installation methods documented on Swift home: http://www.ci.uchicago.edu/swift/downloads/index.php,

if installing from source, java can be loaded on Stampede using module load jdk32 and apache ant could be downloaded from here: http://ant.apache.org

step 2. Create and change to a directory where your Swift related work will stay. (say, mkdir swift-work, followed by, cd swift-work)

step 3. To get started with a simple example running the Linux /bin/cat command to read an input file data.txt and write it to an output file, start with writing a simple Swift source script as follows:

type file;

/* App definitio */
app (file o) cat (file i)
{
  cat @i stdout=@o;
}

file out[]<simple_mapper; location="outdir", prefix="f.",suffix=".out">;
file data<"data.txt">;

/* App invocation: n times */
foreach j in [1:@toint(@arg("n","1"))] {
  out[j] = cat(data);
}

Make sure a file named data.txt is available in the current directory where the above Swift source file will be saved.

step 4. The next step is to create a sites file. An example sites file (sites.xml) is shown as follows:

<config>
  <pool handle="stampede">
    <execution provider="coaster" jobmanager="local:slurm"/>

    <!-- **replace with your project** -->
    <profile namespace="globus" key="project">TG-EAR130015</profile>

    <profile namespace="globus" key="jobsPerNode">1</profile>
    <profile namespace="globus" key="maxWalltime">00:11:00</profile>
    <profile namespace="globus" key="maxtime">800</profile>

    <profile namespace="globus" key="highOverAllocation">100</profile>
    <profile namespace="globus" key="lowOverAllocation">100</profile>

    <!-- queues on stampede: development, normal, large, etc. -->
    <profile namespace="globus" key="queue">development</profile>

    <!-- for mail notification -->
    <profile namespace="globus" key="slurm.mail-user">me@dept.org</profile>
    <profile namespace="globus" key="slurm.mail-type">ALL</profile>

    <filesystem provider="local"/>
    <workdirectory>/path/to/workdir</workdirectory>
  </pool>
</config>

step 5. In this step, we will see the config and tc files. The config file (cf) is as follows:

wrapperlog.always.transfer=true
sitedir.keep=true
execution.retries=0
lazy.errors=false
status.mode=provider
use.provider.staging=false
provider.staging.pin.swiftfiles=false
use.wrapper.staging=false

The tc file (tc) is as follows:

stampede cat /bin/cat null null null

More about config and tc file options can be found in the Swift userguide here: http://www.ci.uchicago.edu/swift/guides/trunk/userguide/userguide.html#_swift_configuration_properties.

step 6. Run the example using following commandline:

swift -config cf -tc.file tc -sites.file sites.xml catsn.swift -n=1

You can further change the value of -n to any arbitrary number to run that many number of cat in parallel

step 7. Swift will show a status message as "done" after the job has completed its run in the queue. Check the output in the generated outdir directory (ls outdir)

login3$ swift -sites.file sites.stampede.xml -config cf -tc.file tc catsn.swift
Swift trunk swift-r6290 cog-r3609

RunID: 20130221-1030-faapk389
Progress:  time: Thu, 21 Feb 2013 10:30:21 -0600
Progress:  time: Thu, 21 Feb 2013 10:30:22 -0600  Submitting:1
Progress:  time: Thu, 21 Feb 2013 10:30:29 -0600  Submitted:1
Progress:  time: Thu, 21 Feb 2013 10:30:51 -0600  Active:1
Progress:  time: Thu, 21 Feb 2013 10:30:54 -0600  Finished successfully:1
Final status: Thu, 21 Feb 2013 10:30:54 -0600  Finished successfully:1

11.3. Troubleshooting

In this section we will discuss some of the common issues and remedies while using Swift on Stampede. The origin of these issues can be Swift or Stampede’s configuration, state and usage load among other factors. We try to identify maximum known issues and address them here:

  • Command not found: Make sure the bin directory of Swift installation is in PATH.

  • Failed to transfer wrapperlog for job cat-nmobtbkk and/or Job failed with an exit code of 254. Check the <workdirectory> element on the sites.xml file.

<workdirectory >/work/your/path/swift.workdir</workdirectory>

It is likely that it is set to a path where the compute nodes can not write or no space available, e.g. your /home directory. The remedy for this error is to set your workdirectory to the path where Swift could write from compute nodes and there is enough space, e.g. /scratch directory.

  • If the jobs are not getting to active state for a long time, check the job status using the slurm squeue command:

$ squeue -u `whoami`

The output will give an indication of the status of jobs. See the slurm manual for more information on job management commands:

$ man slurm